Bug 105888

Summary:

[CNL] System hang: GEM_BUG_ON(buf[2 * head + 1] != port->context_id) during piglit/gem_ctx_switch, masked by tracing enabled

Product:

DRI

Reporter:

Rafael Antognolli <rafael.antognolli>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

medium

CC:

intel-gfx-bugs

Version:

DRI git

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

CNL

i915 features:

GEM/Other

Attachments:

Description	Flags
dmesg output of the hang.	none
kernel config	none

Description Rafael Antognolli 2018-04-04 18:04:41 UTC

Created attachment 138599 [details]
dmesg output of the hang.

We are seeing system hangs on our CI. They seem to go away with some kernel config flags:

CONFIG_DRM_I915_WERROR, CONFIG_DRM_I915_DEBUG_GEM and CONFIG_DRM_I915_TRACE_GEM

The last one seems to be the one that make the hang entirely go away.

I'm attaching a dmesg output with the error, with a kernel compiled with CONFIG_DRM_I915_DEBUG_GEM at least.

Comment 1 Rafael Antognolli 2018-04-04 18:09:18 UTC

Last commit that I tested and still has the bug:

commit 4165791d29f64e01860a064f3c649447dbac41c3
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Mar 22 19:41:35 2018 +0200
                                                     
    drm/i915: Make force_load_detect effective even w/ DMI quirks/hotplug

    When doing forced load detection testing we should totally ignore any
    hotplug status for the connector. This is mostly relevant for machines
    where we already ignore the hotplug status based on the DMI quirks. On
    other machines we would currently skip the force load detection tests
    on account of the connector already being connected.

    v2: Drop the other force_load_detect check since it's useless now (Maarten)

    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>            
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>          
    Link: https://patchwork.freedesktop.org/patch/msgid/20180322174135.5982-1-ville.syrjala@linux.intel.com
    Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Comment 2 Chris Wilson 2018-04-04 21:45:44 UTC

Too bad, looks like the GEM_TRACE might have the clue as to what happened.

The complaint is that the CSB value read back from the hw doesn't match the tag we programmed into the ELSP. It might be a garbage value, or there might be extra bits in the dword we need to mask. The main worry is that the read is garbage, and it disappearing with GEM_TRACE suggests timing (the GEM_TRACE here will also do extra mmio read to cross the HWSP against the register block).

DMAR / Vt'd is not mentioned in the log, I presume you have it disabled? iommu is always a worry wrt random memory latencices and order of operations.

You can force us to use the older mmio only path with

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 12486d8f534b..78e15e1805f4 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -463,6 +463,8 @@ static void intel_engine_init_batch_pool(struct intel_engine_cs *engine)
 
 static bool csb_force_mmio(struct drm_i915_private *i915)
 {
+       return true;
+
        /*
         * IOMMU adds unpredictable latency causing the CSB write (from the
         * GPU into the HWSP) to only be visible some time after the interrupt

which should be less susceptible to timing. See if the bug goes away with that may be helpful.

Comment 3 Rafael Antognolli 2018-04-04 21:56:57 UTC

Cool, I'll give this a try.

Paulo also suggested disabling some individual GEM_TRACE messages, and it looks like this brings the hang back:

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c                           
index f60b61bf8b3b..37365ec55e63 100644                                         
--- a/drivers/gpu/drm/i915/intel_lrc.c         
+++ b/drivers/gpu/drm/i915/intel_lrc.c               
@@ -927,10 +927,10 @@ static void execlists_submission_tasklet(unsigned long data)
                        head = execlists->csb_head;                               
                        tail = READ_ONCE(buf[write_idx]);
                }                                        
-               GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
-                         engine->name,                                
-                         head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?",
-                         tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?");
+               /* GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n", */                                                                     
+               /*        engine->name, */                                   
+               /*        head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?", */
+               /*        tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?"); */
                                                                                                                                                     
                while (head != tail) {                                                                      
                        struct i915_request *rq;


I'll test your suggestion and report back (might only have results by tomorrow, since some times it takes a while to reproduce it).

Comment 4 Rafael Antognolli 2018-04-04 22:00:39 UTC

Created attachment 138608 [details]
kernel config

And to your question, I don't know what is DMAR / Vt'd.

I'm attaching the kernel config.

Comment 5 Rafael Antognolli 2018-04-04 22:38:50 UTC

And it crashes at the same place with the csb_force_mmio returning false.

Is that at least a good thing (not timing)?

Comment 6 Chris Wilson 2018-04-05 09:55:25 UTC

(In reply to Rafael Antognolli from comment #5)
> And it crashes at the same place with the csb_force_mmio returning false.
> 
> Is that at least a good thing (not timing)?

No, because that's the only thing removing the GEM_TRACE affects; timing. But it does tell us that at that moment, the CSB/mmio would be in sync. Only you removed the message that tells us what's wrong...

Comment 7 Mika Kuoppala 2018-04-06 08:11:45 UTC

(In reply to Rafael Antognolli from comment #5)
> And it crashes at the same place with the csb_force_mmio returning false.
> 
> Is that at least a good thing (not timing)?

Just to confirm: with csb_force_mmio returning true, even with traces off it is stable?

Comment 8 Rafael Antognolli 2018-04-06 14:26:04 UTC

Ugh, Chris suggested returning true, and I went and added a return false :-/

So, let me test this again and come back in a bit.

Comment 9 Rafael Antognolli 2018-04-06 17:28:02 UTC

So far it's looking good with csb_force_mmio returning true and traces off. I did try another run without that hack and it had the system hang on the first run.

But so far after adding the return true; there, it's been stable for 7-8 runs. I'll leave it running the whole day just to be sure.

Comment 10 Mika Kuoppala 2018-04-09 14:22:48 UTC

I managed to hit this with gem_ctx_switch also:

[ 1469.191109] [IGT] gem_ctx_switch: starting subtest bsd-interruptible
[ 1609.465467] execlists_submission_tasklet:1025 GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
[ 1609.465527] ------------[ cut here ]------------
[ 1609.465530] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1025!

Comment 11 Rafael Antognolli 2018-04-09 14:29:38 UTC

Oh, interesting.

Well, if it helps, with csb_force_mmio() returning always true, I didn't see this issue anymore.

Comment 12 Chris Wilson 2018-04-09 14:52:46 UTC

Ok, trim the GEM_TRACE there to exclude the mmio reads, let's just dump what the CSB status values are before it dies.

Comment 13 Rafael Antognolli 2018-04-11 00:52:29 UTC

So, trimming the MMIO reads doesn't help, I still cannot reproduce the issue. In fact, even removing those lines that I mentioned would make it reproducible do not really make it happen. I believe I may had a false positive, given how hard it was to reproduce it with piglit :-/

Comment 14 Jani Saarinen 2018-04-25 11:49:02 UTC

Closing, please re-open is issue still exists.

Comment 15 Chris Wilson 2018-05-11 15:44:58 UTC

commit 77dfedb5be03779f9a5d83e323a1b36e32090105
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 13:11:45 2018 +0100

    drm/i915/execlists: Use rmb() to order CSB reads
    
    We assume that the CSB is written using the normal ringbuffer
    coherency protocols, as outlined in kernel/events/ring_buffer.c:
    
        *   (HW)                              (DRIVER)
        *
        *   if (LOAD ->data_tail) {            LOAD ->data_head
        *                      (A)             smp_rmb()       (C)
        *      STORE $data                     LOAD $data
        *      smp_wmb()       (B)             smp_mb()        (D)
        *      STORE ->data_head               STORE ->data_tail
        *   }
    
    So we assume that the HW fulfils its ordering requirements (B), and so
    we should use a complimentary rmb (C) to ensure that our read of its
    WRITE pointer is completed before we start accessing the data.
    
    The final mb (D) is implied by the uncached mmio we perform to inform
    the HW of our READ pointer.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105064
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105888
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106185
    Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP")
    References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer")
    Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Rafael Antognolli <rafael.antognolli@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Timo Aaltonen <tjaalton@ubuntu.com>
    Tested-by: Timo Aaltonen <tjaalton@ubuntu.com>
    Acked-by: Michel Thierry <michel.thierry@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.