On CI https://intel-gfx-ci.01.org/CI/CI_DRM_2310/fi-ilk-650/igt@gem_exec_fence@await-hang-default.html Demsg: https://intel-gfx-ci.01.org/CI/CI_DRM_2310/fi-ilk-650/dmesg-during.log Maybe could have also opened old https://bugs.freedesktop.org/show_bug.cgi?id=99736?
(In reply to Jani Saarinen from comment #0) > Maybe could have also opened old > https://bugs.freedesktop.org/show_bug.cgi?id=99736? No, because it is not the same bug. Waiting on CI providing complete debug logs.
Without capturing the oops here there is little progress we can make.
[ 209.126253] [IGT] gem_exec_fence: starting subtest await-hang-default [ 212.698224] [drm:missed_breadcrumb [i915]] render ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no [ 218.719419] [drm] GPU HANG: ecode 5:0:0xe77fff7f, in gem_exec_fence [6613], reason: Hang on render ring, action: reset [ 218.719698] [drm:i915_reset_and_wakeup [i915]] resetting chip [ 218.719770] drm/i915: Resetting chip after gpu hang [ 218.720366] [drm:i915_gem_reset [i915]] context gem_exec_fence[6613]/0 marked guilty (score 10) banned? no [ 218.720388] [drm:i915_gem_reset [i915]] resetting render ring to restart from tail of request 0x11884 [ 218.720506] [drm:init_workarounds_ring [i915]] render ring: Number of context specific w/a: 0 [ 220.697838] [drm:missed_breadcrumb [i915]] bsd ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no [ 225.867366] [IGT] gem_exec_fence: exiting, ret=0 [ 225.868419] ------------[ cut here ]------------ [ 225.868427] kernel BUG at drivers/gpu/drm/i915/i915_gem_request.c:413! [ 225.868434] invalid opcode: 0000 [#1] PREEMPT SMP [ 225.868440] Modules linked in: intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm lpc_ich mei_me mei i915 e1000e ptp pps_core prime_numbers [ 225.868478] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G W 4.11.0-rc2-CI-Trybot_650+ #1 [ 225.868487] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011 [ 225.868523] Workqueue: events i915_clflush_work [i915] [ 225.868529] task: ffff880212190040 task.stack: ffffc90000138000 [ 225.868554] RIP: 0010:__i915_gem_request_submit+0x19d/0x1a0 [i915] [ 225.868560] RSP: 0018:ffffc9000013bc90 EFLAGS: 00010016 [ 225.868566] RAX: 0000000000010ccc RBX: ffff88020cc2f9c0 RCX: 0000000000000001 [ 225.868573] RDX: 00000000ffffffff RSI: 00000000ffffffff RDI: ffff880206f67b28 [ 225.868580] RBP: ffffc9000013bcb8 R08: 000000005640d526 R09: b90709f000000000 [ 225.868586] R10: ffffffff82782218 R11: ffff880212190040 R12: 0000000000004b3f [ 225.868593] R13: ffff880206f67b00 R14: ffff880206e8a158 R15: ffff88020888c748 [ 225.868600] FS: 0000000000000000(0000) GS:ffff88021bcc0000(0000) knlGS:0000000000000000 [ 225.868609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.868616] CR2: 00000000009ba888 CR3: 0000000207414000 CR4: 00000000000006e0 [ 225.868624] Call Trace: [ 225.868648] i915_gem_request_submit+0x2b/0x50 [i915] [ 225.868675] i9xx_submit_request+0x16/0x50 [i915] [ 225.868700] submit_notify+0x3f/0x5c [i915] [ 225.868720] __i915_sw_fence_complete+0x176/0x220 [i915] [ 225.868730] ? try_to_del_timer_sync+0x4d/0x60 [ 225.868751] i915_sw_fence_complete+0x25/0x40 [i915] [ 225.868772] dma_i915_sw_fence_wake+0x26/0x60 [i915] [ 225.868781] dma_fence_signal+0x146/0x230 [ 225.868804] i915_clflush_work+0x3d/0x140 [i915] [ 225.868826] ? i915_clflush_work+0x61/0x140 [i915] [ 225.868836] process_one_work+0x1f4/0x6d0 [ 225.868843] ? process_one_work+0x16e/0x6d0 [ 225.868850] worker_thread+0x49/0x4a0 [ 225.868857] kthread+0x107/0x140 [ 225.868863] ? process_one_work+0x6d0/0x6d0 [ 225.868870] ? kthread_create_on_node+0x40/0x40 [ 225.868878] ret_from_fork+0x2e/0x40 [ 225.868884] Code: 18 a0 be 93 01 00 00 48 c7 c7 20 f2 19 a0 e8 8b 3b fc e0 e9 a6 fe ff ff 48 89 df e8 ce 16 01 00 e9 f9 fe ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 90 55 48 89 e5 41 55 41 54 53 48 8b 9f a8 00 00 00 49 89 [ 225.868961] RIP: __i915_gem_request_submit+0x19d/0x1a0 [i915] RSP: ffffc9000013bc90 [ 225.868972] ---[ end trace 991ea876a8672e61 ]--- [ 225.868979] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33 [ 225.868990] in_atomic(): 1, irqs_disabled(): 1, pid: 30, name: kworker/3:0 [ 225.868998] INFO: lockdep is turned off. [ 225.869003] irq event stamp: 1192454 [ 225.869010] hardirqs last enabled at (1192453): [<ffffffff81876587>] _raw_spin_unlock_irq+0x27/0x50 [ 225.869021] hardirqs last disabled at (1192454): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0x60 [ 225.869034] softirqs last enabled at (1192118): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0 [ 225.869044] softirqs last disabled at (1192097): [<ffffffff81086139>] irq_exit+0xa9/0xc0 [ 225.869054] Preemption disabled at: [ 225.869056] [<ffffffff815ec000>] dma_fence_signal+0x100/0x230 [ 225.869069] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G D W 4.11.0-rc2-CI-Trybot_650+ #1 [ 225.869080] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011 [ 225.869107] Workqueue: events i915_clflush_work [i915] [ 225.869114] Call Trace: [ 225.869122] dump_stack+0x67/0x92 [ 225.869129] ___might_sleep+0x162/0x250 [ 225.869135] __might_sleep+0x45/0x80 [ 225.869143] exit_signals+0x1f/0x2a0 [ 225.869150] do_exit+0x9f/0xcd0 [ 225.869156] ? kthread+0x107/0x140 [ 225.869162] ? process_one_work+0x6d0/0x6d0 [ 225.869169] ? kthread_create_on_node+0x40/0x40 [ 225.869176] rewind_stack_do_exit+0x17/0x20 [ 225.869185] note: kworker/3:0[30] exited with preempt_count 2 That's pretty unbelievable and scary.
Statistics: Failure rate 1/165 run(s) (0%) From Chris: it actually matches the bug I've been looking at the last 2 weeks. Keeping active/open still.
Patch on review: https://patchwork.freedesktop.org/series/22527/
The symptoms are close enough that I'm treating commit cbb60b4b987c8a57533dca0f66887ed14a9498e5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Apr 6 18:00:28 2017 +0100 drm/i915: Advance ring->head fully when idle as the likely fix (even though a seqno reset during that hang is actually unlikely). The bug cbb60b4b98 fixes is real enough.
Re-open as seen on pw run: https://patchwork.freedesktop.org/series/23227/ and comment from Chris on IRC: "fdo finally replied, yup it has the same impossible bug-on as 100144"
commit e6ba9992de6c63fe86c028b4876338e1cb7dac34 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 25 14:00:49 2017 +0100 drm/i915: Differentiate between sw write location into ring and last hw read We need to keep track of the last location we ask the hw to read up to (RING_TAIL) separately from our last write location into the ring, so that in the event of a GPU reset we do not tell the HW to proceed into a partially written request (which can happen if that request is waiting for an external signal before being executed). v2: Refactor intel_ring_reset() (Mika) Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100144 Testcase: igt/gem_exec_fence/await-hang Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests") Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170425130049.26147-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Thanks for the fix, whitelisting again from CI.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.