Summary: | [ILK][BAT] gem_exec_fence@await-hang-default.html | ||
---|---|---|---|
Product: | DRI | Reporter: | Jani Saarinen <jani.saarinen> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | ILK | i915 features: | GEM/execlists |
Description
Jani Saarinen
2017-03-10 07:57:54 UTC
(In reply to Jani Saarinen from comment #0) > Maybe could have also opened old > https://bugs.freedesktop.org/show_bug.cgi?id=99736? No, because it is not the same bug. Waiting on CI providing complete debug logs. Without capturing the oops here there is little progress we can make. [ 209.126253] [IGT] gem_exec_fence: starting subtest await-hang-default [ 212.698224] [drm:missed_breadcrumb [i915]] render ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no [ 218.719419] [drm] GPU HANG: ecode 5:0:0xe77fff7f, in gem_exec_fence [6613], reason: Hang on render ring, action: reset [ 218.719698] [drm:i915_reset_and_wakeup [i915]] resetting chip [ 218.719770] drm/i915: Resetting chip after gpu hang [ 218.720366] [drm:i915_gem_reset [i915]] context gem_exec_fence[6613]/0 marked guilty (score 10) banned? no [ 218.720388] [drm:i915_gem_reset [i915]] resetting render ring to restart from tail of request 0x11884 [ 218.720506] [drm:init_workarounds_ring [i915]] render ring: Number of context specific w/a: 0 [ 220.697838] [drm:missed_breadcrumb [i915]] bsd ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no [ 225.867366] [IGT] gem_exec_fence: exiting, ret=0 [ 225.868419] ------------[ cut here ]------------ [ 225.868427] kernel BUG at drivers/gpu/drm/i915/i915_gem_request.c:413! [ 225.868434] invalid opcode: 0000 [#1] PREEMPT SMP [ 225.868440] Modules linked in: intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm lpc_ich mei_me mei i915 e1000e ptp pps_core prime_numbers [ 225.868478] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G W 4.11.0-rc2-CI-Trybot_650+ #1 [ 225.868487] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011 [ 225.868523] Workqueue: events i915_clflush_work [i915] [ 225.868529] task: ffff880212190040 task.stack: ffffc90000138000 [ 225.868554] RIP: 0010:__i915_gem_request_submit+0x19d/0x1a0 [i915] [ 225.868560] RSP: 0018:ffffc9000013bc90 EFLAGS: 00010016 [ 225.868566] RAX: 0000000000010ccc RBX: ffff88020cc2f9c0 RCX: 0000000000000001 [ 225.868573] RDX: 00000000ffffffff RSI: 00000000ffffffff RDI: ffff880206f67b28 [ 225.868580] RBP: ffffc9000013bcb8 R08: 000000005640d526 R09: b90709f000000000 [ 225.868586] R10: ffffffff82782218 R11: ffff880212190040 R12: 0000000000004b3f [ 225.868593] R13: ffff880206f67b00 R14: ffff880206e8a158 R15: ffff88020888c748 [ 225.868600] FS: 0000000000000000(0000) GS:ffff88021bcc0000(0000) knlGS:0000000000000000 [ 225.868609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.868616] CR2: 00000000009ba888 CR3: 0000000207414000 CR4: 00000000000006e0 [ 225.868624] Call Trace: [ 225.868648] i915_gem_request_submit+0x2b/0x50 [i915] [ 225.868675] i9xx_submit_request+0x16/0x50 [i915] [ 225.868700] submit_notify+0x3f/0x5c [i915] [ 225.868720] __i915_sw_fence_complete+0x176/0x220 [i915] [ 225.868730] ? try_to_del_timer_sync+0x4d/0x60 [ 225.868751] i915_sw_fence_complete+0x25/0x40 [i915] [ 225.868772] dma_i915_sw_fence_wake+0x26/0x60 [i915] [ 225.868781] dma_fence_signal+0x146/0x230 [ 225.868804] i915_clflush_work+0x3d/0x140 [i915] [ 225.868826] ? i915_clflush_work+0x61/0x140 [i915] [ 225.868836] process_one_work+0x1f4/0x6d0 [ 225.868843] ? process_one_work+0x16e/0x6d0 [ 225.868850] worker_thread+0x49/0x4a0 [ 225.868857] kthread+0x107/0x140 [ 225.868863] ? process_one_work+0x6d0/0x6d0 [ 225.868870] ? kthread_create_on_node+0x40/0x40 [ 225.868878] ret_from_fork+0x2e/0x40 [ 225.868884] Code: 18 a0 be 93 01 00 00 48 c7 c7 20 f2 19 a0 e8 8b 3b fc e0 e9 a6 fe ff ff 48 89 df e8 ce 16 01 00 e9 f9 fe ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 90 55 48 89 e5 41 55 41 54 53 48 8b 9f a8 00 00 00 49 89 [ 225.868961] RIP: __i915_gem_request_submit+0x19d/0x1a0 [i915] RSP: ffffc9000013bc90 [ 225.868972] ---[ end trace 991ea876a8672e61 ]--- [ 225.868979] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33 [ 225.868990] in_atomic(): 1, irqs_disabled(): 1, pid: 30, name: kworker/3:0 [ 225.868998] INFO: lockdep is turned off. [ 225.869003] irq event stamp: 1192454 [ 225.869010] hardirqs last enabled at (1192453): [<ffffffff81876587>] _raw_spin_unlock_irq+0x27/0x50 [ 225.869021] hardirqs last disabled at (1192454): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0x60 [ 225.869034] softirqs last enabled at (1192118): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0 [ 225.869044] softirqs last disabled at (1192097): [<ffffffff81086139>] irq_exit+0xa9/0xc0 [ 225.869054] Preemption disabled at: [ 225.869056] [<ffffffff815ec000>] dma_fence_signal+0x100/0x230 [ 225.869069] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G D W 4.11.0-rc2-CI-Trybot_650+ #1 [ 225.869080] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011 [ 225.869107] Workqueue: events i915_clflush_work [i915] [ 225.869114] Call Trace: [ 225.869122] dump_stack+0x67/0x92 [ 225.869129] ___might_sleep+0x162/0x250 [ 225.869135] __might_sleep+0x45/0x80 [ 225.869143] exit_signals+0x1f/0x2a0 [ 225.869150] do_exit+0x9f/0xcd0 [ 225.869156] ? kthread+0x107/0x140 [ 225.869162] ? process_one_work+0x6d0/0x6d0 [ 225.869169] ? kthread_create_on_node+0x40/0x40 [ 225.869176] rewind_stack_do_exit+0x17/0x20 [ 225.869185] note: kworker/3:0[30] exited with preempt_count 2 That's pretty unbelievable and scary. Statistics: Failure rate 1/165 run(s) (0%) From Chris: it actually matches the bug I've been looking at the last 2 weeks. Keeping active/open still. Patch on review: https://patchwork.freedesktop.org/series/22527/ The symptoms are close enough that I'm treating commit cbb60b4b987c8a57533dca0f66887ed14a9498e5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Apr 6 18:00:28 2017 +0100 drm/i915: Advance ring->head fully when idle as the likely fix (even though a seqno reset during that hang is actually unlikely). The bug cbb60b4b98 fixes is real enough. Re-open as seen on pw run: https://patchwork.freedesktop.org/series/23227/ and comment from Chris on IRC: "fdo finally replied, yup it has the same impossible bug-on as 100144" commit e6ba9992de6c63fe86c028b4876338e1cb7dac34 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 25 14:00:49 2017 +0100 drm/i915: Differentiate between sw write location into ring and last hw read We need to keep track of the last location we ask the hw to read up to (RING_TAIL) separately from our last write location into the ring, so that in the event of a GPU reset we do not tell the HW to proceed into a partially written request (which can happen if that request is waiting for an external signal before being executed). v2: Refactor intel_ring_reset() (Mika) Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100144 Testcase: igt/gem_exec_fence/await-hang Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests") Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170425130049.26147-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Thanks for the fix, whitelisting again from CI. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.