Created attachment 135561 [details] Dmesg output showing the error. I can run into this issue by executing piglit on CNL with the following branch: https://cgit.freedesktop.org/~rantogno/mesa/log/?h=wip/cnl_disable_pc The kernel emits a backtrace and the machine becomes unusable shortly after (it's not a simple GPU hang). Even ssh eventually stops working. It takes a couple tries to happen. I had it happen with other branches too, and I believe master also has the same issue. I was using drm-tip from today, commit hash: 74ae8acff97c1739330154fa34bf5a64e28d608f I also artificially disabled preemption by setting .has_logical_ring_preemption = 0 on GEN9_FEATURES. I'll re-enable it and confirm that the bug still happens (just wanted to start the discussion early). Steps to reproduce: - Disable gdm so that no X or gnome-wayland or anything like that is running - with the above kernel, the mentioned mesa branch, and piglit installed, run something like: $ EGL_PLATFORM=gbm PIGLIT_PLATFORM=gbm ./piglit run gpu ~/piglit_output/ Make sure you have mesa, waffle and piglit built with gbm support. I believe the bug would also happen in other platforms too, like X or wayland.
try diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index fc098e6cca23..75a048eab127 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -154,8 +154,7 @@ #define GEN8_CTX_STATUS_LITE_RESTORE (1 << 15) #define GEN8_CTX_STATUS_COMPLETED_MASK \ - (GEN8_CTX_STATUS_ACTIVE_IDLE | \ - GEN8_CTX_STATUS_PREEMPTED | \ + (GEN8_CTX_STATUS_PREEMPTED | \ GEN8_CTX_STATUS_ELEMENT_SWITCH) #define CTX_LRI_HEADER_0 0x01 @@ -842,8 +841,6 @@ static void execlists_submission_tasklet(unsigned long data) GEM_TRACE("%s csb[%dd]: status=0x%08x:0x%08x\n", engine->name, head, status, buf[2*head + 1]); - if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK)) - continue; if (status & GEN8_CTX_STATUS_ACTIVE_IDLE && buf[2*head + 1] == PREEMPT_ID) { @@ -862,6 +859,9 @@ static void execlists_submission_tasklet(unsigned long data) EXECLISTS_ACTIVE_PREEMPT)) continue; + if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK)) + continue; + GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_USER));
Created attachment 135562 [details] Same kernel, with preemption enabled. The backtrace is different, though.
Fwiw, this is just the cnl bug, but without debugging enabled so instead of early detection it runs until the machine can run no more.
Created attachment 135563 [details] [review] Ignore standalone ACTIVE_IDLE events.
Created attachment 135564 [details] [review] Ignore standalone ACTIVE_IDLE events.
Created attachment 135567 [details] [review] Ignore standalone ACTIVE_IDLE events v2.
reference: https://patchwork.freedesktop.org/series/34081/ and patch https://patchwork.freedesktop.org/patch/189248/
Hi Jani, are these series supposed to solve the issue? If so, should I apply them with Chris Wilson's patches too?
Created attachment 135607 [details] dmesg output with debug enabled Added the dmesg output with drm.debug=0x1f. The file was 2.6GB, so I cut only the last 10k lines (let me know if you need more). This was only with the last patch from Chris Wilson applied.
/o\ Sorry wrong debug log... We needed the GEM_TRACE, which needs ftrace_dump_on_oops=1 not drm.debug. Any it proves that the series commmitted so far doesn't do anything for this bug. The last remaining hope is https://patchwork.freedesktop.org/patch/189291/ After that we have to start thinking again, for which we need the ftrace log.
I did have ftrace_dump_on_oops on my kernel cmdline too. On the dmesg output I see something like: [ 3976.735435] Dumping ftrace buffer: [ 3976.735454] (ftrace buffer empty) Does it mean we have no trace? Or do I still need something else?
Ah, and that output was from last week, I don't know if some of your patches have landed since then. I'll try with drm-tip of today now.
If you pull drm-tip, you should get commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write which is the other bug fix. Also make sure you have CONFIG_DRM_I915_TRACE_GEM=y
OK, seems to be fixed for now. I've been running piglit for a while this morning already, and no similar crash happened yet (5 runs so far, and it would crash on the first or second run).
I haven't seen this issue in a long time, so please feel free to close it.
No worries then, we'll just assume that it was fixed by commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write The hardware needs some time to process the information received in the ExecList Submission Port, and expects us to not write anything more until it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or PREEMPTED CSB event. If we do not follow this, the driver could write new data into the ELSP before HW had finishing fetching the previous one, putting us in 'undefined behaviour' space. This seems to be the problem causing the spurious PREEMPTED & COMPLETE events after a COMPLETE like the one below: [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3. [] vcs0: Execlist CSB[0]: 0x00000018 _ 0x00000007 [] vcs0: Execlist CSB[1]: 0x00000001 _ 0x00000000 [] vcs0: Execlist CSB[2]: 0x00000018 _ 0x00000007 <<< COMPLETE [] vcs0: Execlist CSB[3]: 0x00000012 _ 0x00000007 <<< PREEMPTED & COMPLETE [] vcs0: Execlist CSB[4]: 0x00008002 _ 0x00000006 [] vcs0: Execlist CSB[5]: 0x00000014 _ 0x00000006 The ELSP writes that lead to this CSB sequence show that the HW hadn't started executing the previous execlist (the one with only ctx 0x6) by the time the new one was submitted; this is a bit more clear in the data show in the EXECLIST_STATUS register at the time of the ELSP write. [] vcs0: ELSP[0] = 0x0_0 [execlist1] - status_reg = 0x0_302 [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302 [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308 [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308 Note that having to wait for this ack does not disable lite-restores, although it may reduce their numbers. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035 Signed-off-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.