105888 – [CNL] System hang: GEM_BUG_ON(buf[2 * head + 1] != port->context_id) during piglit/gem_ctx_switch, masked by tracing enabled

Bug 105888 - [CNL] System hang: GEM_BUG_ON(buf[2 * head + 1] != port->context_id) during piglit/gem_ctx_switch, masked by tracing enabled

Summary: [CNL] System hang: GEM_BUG_ON(buf[2 * head + 1] != port->context_id) during p...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-04 18:04 UTC by Rafael Antognolli
Modified:	2018-05-11 15:44 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	CNL
i915 features:	GEM/Other

Attachments
dmesg output of the hang. (117.95 KB, text/plain) 2018-04-04 18:04 UTC, Rafael Antognolli	no flags	Details
kernel config (170.07 KB, text/x-mpsub) 2018-04-04 22:00 UTC, Rafael Antognolli	no flags	Details
View All

Description Rafael Antognolli 2018-04-04 18:04:41 UTC

Created attachment 138599 [details]
dmesg output of the hang.

We are seeing system hangs on our CI. They seem to go away with some kernel config flags:

CONFIG_DRM_I915_WERROR, CONFIG_DRM_I915_DEBUG_GEM and CONFIG_DRM_I915_TRACE_GEM

The last one seems to be the one that make the hang entirely go away.

I'm attaching a dmesg output with the error, with a kernel compiled with CONFIG_DRM_I915_DEBUG_GEM at least.

Comment 1 Rafael Antognolli 2018-04-04 18:09:18 UTC

Last commit that I tested and still has the bug:

commit 4165791d29f64e01860a064f3c649447dbac41c3
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Mar 22 19:41:35 2018 +0200
                                                     
    drm/i915: Make force_load_detect effective even w/ DMI quirks/hotplug

    When doing forced load detection testing we should totally ignore any
    hotplug status for the connector. This is mostly relevant for machines
    where we already ignore the hotplug status based on the DMI quirks. On
    other machines we would currently skip the force load detection tests
    on account of the connector already being connected.

    v2: Drop the other force_load_detect check since it's useless now (Maarten)

    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>            
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>          
    Link: https://patchwork.freedesktop.org/patch/msgid/20180322174135.5982-1-ville.syrjala@linux.intel.com
    Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Comment 2 Chris Wilson 2018-04-04 21:45:44 UTC

Too bad, looks like the GEM_TRACE might have the clue as to what happened.

The complaint is that the CSB value read back from the hw doesn't match the tag we programmed into the ELSP. It might be a garbage value, or there might be extra bits in the dword we need to mask. The main worry is that the read is garbage, and it disappearing with GEM_TRACE suggests timing (the GEM_TRACE here will also do extra mmio read to cross the HWSP against the register block).

DMAR / Vt'd is not mentioned in the log, I presume you have it disabled? iommu is always a worry wrt random memory latencices and order of operations.

You can force us to use the older mmio only path with

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 12486d8f534b..78e15e1805f4 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -463,6 +463,8 @@ static void intel_engine_init_batch_pool(struct intel_engine_cs *engine)
 
 static bool csb_force_mmio(struct drm_i915_private *i915)
 {
+       return true;
+
        /*
         * IOMMU adds unpredictable latency causing the CSB write (from the
         * GPU into the HWSP) to only be visible some time after the interrupt

which should be less susceptible to timing. See if the bug goes away with that may be helpful.

Comment 3 Rafael Antognolli 2018-04-04 21:56:57 UTC

Cool, I'll give this a try.

Paulo also suggested disabling some individual GEM_TRACE messages, and it looks like this brings the hang back:

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c                           
index f60b61bf8b3b..37365ec55e63 100644                                         
--- a/drivers/gpu/drm/i915/intel_lrc.c         
+++ b/drivers/gpu/drm/i915/intel_lrc.c               
@@ -927,10 +927,10 @@ static void execlists_submission_tasklet(unsigned long data)
                        head = execlists->csb_head;                               
                        tail = READ_ONCE(buf[write_idx]);
                }                                        
-               GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
-                         engine->name,                                
-                         head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?",
-                         tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?");
+               /* GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n", */                                                                     
+               /*        engine->name, */                                   
+               /*        head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?", */
+               /*        tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?"); */
                                                                                                                                                     
                while (head != tail) {                                                                      
                        struct i915_request *rq;


I'll test your suggestion and report back (might only have results by tomorrow, since some times it takes a while to reproduce it).

Comment 4 Rafael Antognolli 2018-04-04 22:00:39 UTC

Created attachment 138608 [details]
kernel config

And to your question, I don't know what is DMAR / Vt'd.

I'm attaching the kernel config.

Comment 5 Rafael Antognolli 2018-04-04 22:38:50 UTC

And it crashes at the same place with the csb_force_mmio returning false.

Is that at least a good thing (not timing)?

Comment 6 Chris Wilson 2018-04-05 09:55:25 UTC

(In reply to Rafael Antognolli from comment #5)
> And it crashes at the same place with the csb_force_mmio returning false.
> 
> Is that at least a good thing (not timing)?

No, because that's the only thing removing the GEM_TRACE affects; timing. But it does tell us that at that moment, the CSB/mmio would be in sync. Only you removed the message that tells us what's wrong...

Comment 7 Mika Kuoppala 2018-04-06 08:11:45 UTC

(In reply to Rafael Antognolli from comment #5)
> And it crashes at the same place with the csb_force_mmio returning false.
> 
> Is that at least a good thing (not timing)?

Just to confirm: with csb_force_mmio returning true, even with traces off it is stable?

Comment 8 Rafael Antognolli 2018-04-06 14:26:04 UTC

Ugh, Chris suggested returning true, and I went and added a return false :-/

So, let me test this again and come back in a bit.

Comment 9 Rafael Antognolli 2018-04-06 17:28:02 UTC

So far it's looking good with csb_force_mmio returning true and traces off. I did try another run without that hack and it had the system hang on the first run.

But so far after adding the return true; there, it's been stable for 7-8 runs. I'll leave it running the whole day just to be sure.

Comment 10 Mika Kuoppala 2018-04-09 14:22:48 UTC

I managed to hit this with gem_ctx_switch also:

[ 1469.191109] [IGT] gem_ctx_switch: starting subtest bsd-interruptible
[ 1609.465467] execlists_submission_tasklet:1025 GEM_BUG_ON(buf[2 * head + 1] != port->context_id)
[ 1609.465527] ------------[ cut here ]------------
[ 1609.465530] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1025!

Comment 11 Rafael Antognolli 2018-04-09 14:29:38 UTC

Oh, interesting.

Well, if it helps, with csb_force_mmio() returning always true, I didn't see this issue anymore.

Comment 12 Chris Wilson 2018-04-09 14:52:46 UTC

Ok, trim the GEM_TRACE there to exclude the mmio reads, let's just dump what the CSB status values are before it dies.

Comment 13 Rafael Antognolli 2018-04-11 00:52:29 UTC

So, trimming the MMIO reads doesn't help, I still cannot reproduce the issue. In fact, even removing those lines that I mentioned would make it reproducible do not really make it happen. I believe I may had a false positive, given how hard it was to reproduce it with piglit :-/

Comment 14 Jani Saarinen 2018-04-25 11:49:02 UTC

Closing, please re-open is issue still exists.

Comment 15 Chris Wilson 2018-05-11 15:44:58 UTC

commit 77dfedb5be03779f9a5d83e323a1b36e32090105
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 13:11:45 2018 +0100

    drm/i915/execlists: Use rmb() to order CSB reads
    
    We assume that the CSB is written using the normal ringbuffer
    coherency protocols, as outlined in kernel/events/ring_buffer.c:
    
        *   (HW)                              (DRIVER)
        *
        *   if (LOAD ->data_tail) {            LOAD ->data_head
        *                      (A)             smp_rmb()       (C)
        *      STORE $data                     LOAD $data
        *      smp_wmb()       (B)             smp_mb()        (D)
        *      STORE ->data_head               STORE ->data_tail
        *   }
    
    So we assume that the HW fulfils its ordering requirements (B), and so
    we should use a complimentary rmb (C) to ensure that our read of its
    WRITE pointer is completed before we start accessing the data.
    
    The final mb (D) is implied by the uncached mmio we perform to inform
    the HW of our READ pointer.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105064
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105888
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106185
    Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP")
    References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer")
    Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Rafael Antognolli <rafael.antognolli@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Timo Aaltonen <tjaalton@ubuntu.com>
    Tested-by: Timo Aaltonen <tjaalton@ubuntu.com>
    Acked-by: Michel Thierry <michel.thierry@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.