Summary: | [CNL] System hang: GEM_BUG_ON(buf[2 * head + 1] != port->context_id) during piglit/gem_ctx_switch, masked by tracing enabled | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Rafael Antognolli <rafael.antognolli> | ||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | normal | ||||||||
Priority: | medium | CC: | intel-gfx-bugs | ||||||
Version: | DRI git | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | CNL | i915 features: | GEM/Other | ||||||
Attachments: |
|
Last commit that I tested and still has the bug: commit 4165791d29f64e01860a064f3c649447dbac41c3 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Thu Mar 22 19:41:35 2018 +0200 drm/i915: Make force_load_detect effective even w/ DMI quirks/hotplug When doing forced load detection testing we should totally ignore any hotplug status for the connector. This is mostly relevant for machines where we already ignore the hotplug status based on the DMI quirks. On other machines we would currently skip the force load detection tests on account of the connector already being connected. v2: Drop the other force_load_detect check since it's useless now (Maarten) Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180322174135.5982-1-ville.syrjala@linux.intel.com Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Too bad, looks like the GEM_TRACE might have the clue as to what happened. The complaint is that the CSB value read back from the hw doesn't match the tag we programmed into the ELSP. It might be a garbage value, or there might be extra bits in the dword we need to mask. The main worry is that the read is garbage, and it disappearing with GEM_TRACE suggests timing (the GEM_TRACE here will also do extra mmio read to cross the HWSP against the register block). DMAR / Vt'd is not mentioned in the log, I presume you have it disabled? iommu is always a worry wrt random memory latencices and order of operations. You can force us to use the older mmio only path with diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index 12486d8f534b..78e15e1805f4 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -463,6 +463,8 @@ static void intel_engine_init_batch_pool(struct intel_engine_cs *engine) static bool csb_force_mmio(struct drm_i915_private *i915) { + return true; + /* * IOMMU adds unpredictable latency causing the CSB write (from the * GPU into the HWSP) to only be visible some time after the interrupt which should be less susceptible to timing. See if the bug goes away with that may be helpful. Cool, I'll give this a try. Paulo also suggested disabling some individual GEM_TRACE messages, and it looks like this brings the hang back: diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index f60b61bf8b3b..37365ec55e63 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -927,10 +927,10 @@ static void execlists_submission_tasklet(unsigned long data) head = execlists->csb_head; tail = READ_ONCE(buf[write_idx]); } - GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n", - engine->name, - head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?", - tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?"); + /* GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n", */ + /* engine->name, */ + /* head, GEN8_CSB_READ_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?", */ + /* tail, GEN8_CSB_WRITE_PTR(readl(dev_priv->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)))), fw ? "" : "?"); */ while (head != tail) { struct i915_request *rq; I'll test your suggestion and report back (might only have results by tomorrow, since some times it takes a while to reproduce it). Created attachment 138608 [details]
kernel config
And to your question, I don't know what is DMAR / Vt'd.
I'm attaching the kernel config.
And it crashes at the same place with the csb_force_mmio returning false. Is that at least a good thing (not timing)? (In reply to Rafael Antognolli from comment #5) > And it crashes at the same place with the csb_force_mmio returning false. > > Is that at least a good thing (not timing)? No, because that's the only thing removing the GEM_TRACE affects; timing. But it does tell us that at that moment, the CSB/mmio would be in sync. Only you removed the message that tells us what's wrong... (In reply to Rafael Antognolli from comment #5) > And it crashes at the same place with the csb_force_mmio returning false. > > Is that at least a good thing (not timing)? Just to confirm: with csb_force_mmio returning true, even with traces off it is stable? Ugh, Chris suggested returning true, and I went and added a return false :-/ So, let me test this again and come back in a bit. So far it's looking good with csb_force_mmio returning true and traces off. I did try another run without that hack and it had the system hang on the first run. But so far after adding the return true; there, it's been stable for 7-8 runs. I'll leave it running the whole day just to be sure. I managed to hit this with gem_ctx_switch also: [ 1469.191109] [IGT] gem_ctx_switch: starting subtest bsd-interruptible [ 1609.465467] execlists_submission_tasklet:1025 GEM_BUG_ON(buf[2 * head + 1] != port->context_id) [ 1609.465527] ------------[ cut here ]------------ [ 1609.465530] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1025! Oh, interesting. Well, if it helps, with csb_force_mmio() returning always true, I didn't see this issue anymore. Ok, trim the GEM_TRACE there to exclude the mmio reads, let's just dump what the CSB status values are before it dies. So, trimming the MMIO reads doesn't help, I still cannot reproduce the issue. In fact, even removing those lines that I mentioned would make it reproducible do not really make it happen. I believe I may had a false positive, given how hard it was to reproduce it with piglit :-/ Closing, please re-open is issue still exists. commit 77dfedb5be03779f9a5d83e323a1b36e32090105 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 11 13:11:45 2018 +0100 drm/i915/execlists: Use rmb() to order CSB reads We assume that the CSB is written using the normal ringbuffer coherency protocols, as outlined in kernel/events/ring_buffer.c: * (HW) (DRIVER) * * if (LOAD ->data_tail) { LOAD ->data_head * (A) smp_rmb() (C) * STORE $data LOAD $data * smp_wmb() (B) smp_mb() (D) * STORE ->data_head STORE ->data_tail * } So we assume that the HW fulfils its ordering requirements (B), and so we should use a complimentary rmb (C) to ensure that our read of its WRITE pointer is completed before we start accessing the data. The final mb (D) is implied by the uncached mmio we perform to inform the HW of our READ pointer. References: https://bugs.freedesktop.org/show_bug.cgi?id=105064 References: https://bugs.freedesktop.org/show_bug.cgi?id=105888 References: https://bugs.freedesktop.org/show_bug.cgi?id=106185 Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP") References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer") Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Rafael Antognolli <rafael.antognolli@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Timo Aaltonen <tjaalton@ubuntu.com> Tested-by: Timo Aaltonen <tjaalton@ubuntu.com> Acked-by: Michel Thierry <michel.thierry@intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 138599 [details] dmesg output of the hang. We are seeing system hangs on our CI. They seem to go away with some kernel config flags: CONFIG_DRM_I915_WERROR, CONFIG_DRM_I915_DEBUG_GEM and CONFIG_DRM_I915_TRACE_GEM The last one seems to be the one that make the hang entirely go away. I'm attaching a dmesg output with the error, with a kernel compiled with CONFIG_DRM_I915_DEBUG_GEM at least.