Summary: | CNL unable to handle kernel paging request at 00000000fffffffc | ||
---|---|---|---|
Product: | DRI | Reporter: | Rafael Antognolli <rafael.antognolli> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs, rafael.antognolli |
Version: | XOrg git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | CNL | i915 features: | |
Attachments: |
Description
Rafael Antognolli
2017-11-17 20:01:50 UTC
try diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index fc098e6cca23..75a048eab127 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -154,8 +154,7 @@ #define GEN8_CTX_STATUS_LITE_RESTORE (1 << 15) #define GEN8_CTX_STATUS_COMPLETED_MASK \ - (GEN8_CTX_STATUS_ACTIVE_IDLE | \ - GEN8_CTX_STATUS_PREEMPTED | \ + (GEN8_CTX_STATUS_PREEMPTED | \ GEN8_CTX_STATUS_ELEMENT_SWITCH) #define CTX_LRI_HEADER_0 0x01 @@ -842,8 +841,6 @@ static void execlists_submission_tasklet(unsigned long data) GEM_TRACE("%s csb[%dd]: status=0x%08x:0x%08x\n", engine->name, head, status, buf[2*head + 1]); - if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK)) - continue; if (status & GEN8_CTX_STATUS_ACTIVE_IDLE && buf[2*head + 1] == PREEMPT_ID) { @@ -862,6 +859,9 @@ static void execlists_submission_tasklet(unsigned long data) EXECLISTS_ACTIVE_PREEMPT)) continue; + if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK)) + continue; + GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_USER)); Created attachment 135562 [details]
Same kernel, with preemption enabled.
The backtrace is different, though.
Fwiw, this is just the cnl bug, but without debugging enabled so instead of early detection it runs until the machine can run no more. Created attachment 135563 [details] [review] Ignore standalone ACTIVE_IDLE events. Created attachment 135564 [details] [review] Ignore standalone ACTIVE_IDLE events. Created attachment 135567 [details] [review] Ignore standalone ACTIVE_IDLE events v2. reference: https://patchwork.freedesktop.org/series/34081/ and patch https://patchwork.freedesktop.org/patch/189248/ Hi Jani, are these series supposed to solve the issue? If so, should I apply them with Chris Wilson's patches too? Created attachment 135607 [details]
dmesg output with debug enabled
Added the dmesg output with drm.debug=0x1f. The file was 2.6GB, so I cut only the last 10k lines (let me know if you need more).
This was only with the last patch from Chris Wilson applied.
/o\ Sorry wrong debug log... We needed the GEM_TRACE, which needs ftrace_dump_on_oops=1 not drm.debug. Any it proves that the series commmitted so far doesn't do anything for this bug. The last remaining hope is https://patchwork.freedesktop.org/patch/189291/ After that we have to start thinking again, for which we need the ftrace log. I did have ftrace_dump_on_oops on my kernel cmdline too. On the dmesg output I see something like: [ 3976.735435] Dumping ftrace buffer: [ 3976.735454] (ftrace buffer empty) Does it mean we have no trace? Or do I still need something else? Ah, and that output was from last week, I don't know if some of your patches have landed since then. I'll try with drm-tip of today now. If you pull drm-tip, you should get commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write which is the other bug fix. Also make sure you have CONFIG_DRM_I915_TRACE_GEM=y OK, seems to be fixed for now. I've been running piglit for a while this morning already, and no similar crash happened yet (5 runs so far, and it would crash on the first or second run). I haven't seen this issue in a long time, so please feel free to close it. No worries then, we'll just assume that it was fixed by commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write The hardware needs some time to process the information received in the ExecList Submission Port, and expects us to not write anything more until it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or PREEMPTED CSB event. If we do not follow this, the driver could write new data into the ELSP before HW had finishing fetching the previous one, putting us in 'undefined behaviour' space. This seems to be the problem causing the spurious PREEMPTED & COMPLETE events after a COMPLETE like the one below: [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3. [] vcs0: Execlist CSB[0]: 0x00000018 _ 0x00000007 [] vcs0: Execlist CSB[1]: 0x00000001 _ 0x00000000 [] vcs0: Execlist CSB[2]: 0x00000018 _ 0x00000007 <<< COMPLETE [] vcs0: Execlist CSB[3]: 0x00000012 _ 0x00000007 <<< PREEMPTED & COMPLETE [] vcs0: Execlist CSB[4]: 0x00008002 _ 0x00000006 [] vcs0: Execlist CSB[5]: 0x00000014 _ 0x00000006 The ELSP writes that lead to this CSB sequence show that the HW hadn't started executing the previous execlist (the one with only ctx 0x6) by the time the new one was submitted; this is a bit more clear in the data show in the EXECLIST_STATUS register at the time of the ELSP write. [] vcs0: ELSP[0] = 0x0_0 [execlist1] - status_reg = 0x0_302 [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302 [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308 [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308 Note that having to wait for this ack does not disable lite-restores, although it may reduce their numbers. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035 Signed-off-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.