Bug 103800 - CNL unable to handle kernel paging request at 00000000fffffffc
Summary: CNL unable to handle kernel paging request at 00000000fffffffc
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-17 20:01 UTC by Rafael Antognolli
Modified: 2018-01-04 18:08 UTC (History)
2 users (show)

See Also:
i915 platform: CNL
i915 features:


Attachments
Dmesg output showing the error. (61.44 KB, text/plain)
2017-11-17 20:01 UTC, Rafael Antognolli
no flags Details
Same kernel, with preemption enabled. (181.59 KB, text/plain)
2017-11-17 21:30 UTC, Rafael Antognolli
no flags Details
Ignore standalone ACTIVE_IDLE events. (1.20 KB, patch)
2017-11-17 21:54 UTC, Chris Wilson
no flags Details | Splinter Review
Ignore standalone ACTIVE_IDLE events. (2.22 KB, patch)
2017-11-17 21:55 UTC, Chris Wilson
no flags Details | Splinter Review
Ignore standalone ACTIVE_IDLE events v2. (725 bytes, patch)
2017-11-17 22:43 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg output with debug enabled (766.63 KB, text/plain)
2017-11-20 16:50 UTC, Rafael Antognolli
no flags Details

Description Rafael Antognolli 2017-11-17 20:01:50 UTC
Created attachment 135561 [details]
Dmesg output showing the error.

I can run into this issue by executing piglit on CNL with the following branch:

https://cgit.freedesktop.org/~rantogno/mesa/log/?h=wip/cnl_disable_pc

The kernel emits a backtrace and the machine becomes unusable shortly after (it's not a simple GPU hang). Even ssh eventually stops working.

It takes a couple tries to happen. I had it happen with other branches too, and I believe master also has the same issue.

I was using drm-tip from today, commit hash: 74ae8acff97c1739330154fa34bf5a64e28d608f

I also artificially disabled preemption by setting .has_logical_ring_preemption = 0 on GEN9_FEATURES. I'll re-enable it and confirm that the bug still happens (just wanted to start the discussion early).

Steps to reproduce:

 - Disable gdm so that no X or gnome-wayland or anything like that is running
 - with the above kernel, the mentioned mesa branch, and piglit installed, run something like:

$ EGL_PLATFORM=gbm PIGLIT_PLATFORM=gbm ./piglit run gpu ~/piglit_output/

Make sure you have mesa, waffle and piglit built with gbm support. I believe the bug would also happen in other platforms too, like X or wayland.
Comment 1 Chris Wilson 2017-11-17 21:27:51 UTC
try
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index fc098e6cca23..75a048eab127 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -154,8 +154,7 @@
 #define GEN8_CTX_STATUS_LITE_RESTORE   (1 << 15)
 
 #define GEN8_CTX_STATUS_COMPLETED_MASK \
-        (GEN8_CTX_STATUS_ACTIVE_IDLE | \
-         GEN8_CTX_STATUS_PREEMPTED | \
+        (GEN8_CTX_STATUS_PREEMPTED | \
          GEN8_CTX_STATUS_ELEMENT_SWITCH)
 
 #define CTX_LRI_HEADER_0               0x01
@@ -842,8 +841,6 @@ static void execlists_submission_tasklet(unsigned long data)
                        GEM_TRACE("%s csb[%dd]: status=0x%08x:0x%08x\n",
                                  engine->name, head,
                                  status, buf[2*head + 1]);
-                       if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
-                               continue;
 
                        if (status & GEN8_CTX_STATUS_ACTIVE_IDLE &&
                            buf[2*head + 1] == PREEMPT_ID) {
@@ -862,6 +859,9 @@ static void execlists_submission_tasklet(unsigned long data)
                                                EXECLISTS_ACTIVE_PREEMPT))
                                continue;
 
+                       if (!(status & GEN8_CTX_STATUS_COMPLETED_MASK))
+                               continue;
+
                        GEM_BUG_ON(!execlists_is_active(execlists,
                                                        EXECLISTS_ACTIVE_USER));
Comment 2 Rafael Antognolli 2017-11-17 21:30:37 UTC
Created attachment 135562 [details]
Same kernel, with preemption enabled.

The backtrace is different, though.
Comment 3 Chris Wilson 2017-11-17 21:39:05 UTC
Fwiw, this is just the cnl bug, but without debugging enabled so instead of early detection it runs until the machine can run no more.
Comment 4 Chris Wilson 2017-11-17 21:54:50 UTC
Created attachment 135563 [details] [review]
Ignore standalone ACTIVE_IDLE events.
Comment 5 Chris Wilson 2017-11-17 21:55:35 UTC
Created attachment 135564 [details] [review]
Ignore standalone ACTIVE_IDLE events.
Comment 6 Chris Wilson 2017-11-17 22:43:26 UTC
Created attachment 135567 [details] [review]
Ignore standalone ACTIVE_IDLE events v2.
Comment 8 Rafael Antognolli 2017-11-20 16:22:28 UTC
Hi Jani, are these series supposed to solve the issue? If so, should I apply them with Chris Wilson's patches too?
Comment 9 Rafael Antognolli 2017-11-20 16:50:13 UTC
Created attachment 135607 [details]
dmesg output with debug enabled

Added the dmesg output with drm.debug=0x1f. The file was 2.6GB, so I cut only the last 10k lines (let me know if you need more).

This was only with the last patch from Chris Wilson applied.
Comment 10 Chris Wilson 2017-11-20 16:56:21 UTC
/o\ Sorry wrong debug log... We needed the GEM_TRACE, which needs ftrace_dump_on_oops=1 not drm.debug.

Any it proves that the series commmitted so far doesn't do anything for this bug. The last remaining hope is https://patchwork.freedesktop.org/patch/189291/ After that we have to start thinking again, for which we need the ftrace log.
Comment 11 Rafael Antognolli 2017-11-20 16:59:59 UTC
I did have ftrace_dump_on_oops on my kernel cmdline too.

On the dmesg output I see something like:

[ 3976.735435] Dumping ftrace buffer:
[ 3976.735454]    (ftrace buffer empty)


Does it mean we have no trace? Or do I still need something else?
Comment 12 Rafael Antognolli 2017-11-20 17:01:02 UTC
Ah, and that output was from last week, I don't know if some of your patches have landed since then. I'll try with drm-tip of today now.
Comment 13 Chris Wilson 2017-11-20 17:05:58 UTC
If you pull drm-tip, you should get

commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f
Author: Michel Thierry <michel.thierry@intel.com>
Date:   Mon Nov 20 12:34:58 2017 +0000

    drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write

which is the other bug fix. Also make sure you have

CONFIG_DRM_I915_TRACE_GEM=y
Comment 14 Rafael Antognolli 2017-11-20 19:44:58 UTC
OK, seems to be fixed for now. I've been running piglit for a while this morning already, and no similar crash happened yet (5 runs so far, and it would crash on the first or second run).
Comment 15 Rafael Antognolli 2017-11-29 00:33:39 UTC
I haven't seen this issue in a long time, so please feel free to close it.
Comment 16 Chris Wilson 2017-11-29 00:35:50 UTC
No worries then, we'll just assume that it was fixed by

commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f
Author: Michel Thierry <michel.thierry@intel.com>
Date:   Mon Nov 20 12:34:58 2017 +0000

    drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write
    
    The hardware needs some time to process the information received in the
    ExecList Submission Port, and expects us to not write anything more until
    it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or
    PREEMPTED CSB event.
    
    If we do not follow this, the driver could write new data into the ELSP
    before HW had finishing fetching the previous one, putting us in
    'undefined behaviour' space.
    
    This seems to be the problem causing the spurious PREEMPTED & COMPLETE
    events after a COMPLETE like the one below:
    
    [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3.
    [] vcs0:  Execlist CSB[0]: 0x00000018 _ 0x00000007
    [] vcs0:  Execlist CSB[1]: 0x00000001 _ 0x00000000
    [] vcs0:  Execlist CSB[2]: 0x00000018 _ 0x00000007  <<< COMPLETE
    [] vcs0:  Execlist CSB[3]: 0x00000012 _ 0x00000007  <<< PREEMPTED & COMPLETE
    [] vcs0:  Execlist CSB[4]: 0x00008002 _ 0x00000006
    [] vcs0:  Execlist CSB[5]: 0x00000014 _ 0x00000006
    
    The ELSP writes that lead to this CSB sequence show that the HW hadn't
    started executing the previous execlist (the one with only ctx 0x6) by the
    time the new one was submitted; this is a bit more clear in the data
    show in the EXECLIST_STATUS register at the time of the ELSP write.
    
    [] vcs0: ELSP[0] = 0x0_0        [execlist1] - status_reg = 0x0_302
    [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302
    
    [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308
    [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308
    
    Note that having to wait for this ack does not disable lite-restores,
    although it may reduce their numbers.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035
    Signed-off-by: Michel Thierry <michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.