Summary: | [CI][BAT] igt@gem_sync@basic-many-each - fail - Failed assertion: !"GPU hung" | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED DUPLICATE | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BYT | i915 features: | GEM/Other |
Description
Martin Peres
2018-08-28 12:58:41 UTC
It reads as a TLB miss (jump into a batch in the middle of nowhere and only found zeros). Chief suspect is ppGTT invalidation. Something odd I spotted here: <7>[ 232.791376] missed_breadcrumb [head 149a8, postfix 149f8, tail 14a20, batch 0x00000000_00002000]: <7>[ 232.791404] missed_breadcrumb [0000] 13244001 00000104 00000000 00000000 11000001 00022220 ffffffff 11000001 <7>[ 232.791412] missed_breadcrumb [0020] 00022228 00470000 12400001 00022228 7fffc000 00000000 18800100 00002000 <7>[ 232.791420] missed_breadcrumb [0040] 13204001 00000104 00000000 00000000 11000001 00002044 00000001 11000001 <7>[ 232.791426] missed_breadcrumb [0060] 00012040 00000001 10800001 000000c0 00000001 01000000 There's only one MI_FLUSH_DW for the invalidate at the start of the request, but in emit_mi_flush_dw(), there's the comment: /* * Not only do we need a full barrier (post-sync write) after * invalidating the TLBs, but we need to wait a little bit * longer. Whether this is merely delaying us, or the * subsequent flush is a key part of serialising with the * post-sync op, this extra pass appears vital before a * mm switch! */ Hmm. ARGH. The answer is I never pushed drm/i915/ringbuffer: Delay after invalidating gen6+ xcs During stress testing of full-ppgtt (on Baytrail at least), we found that the invalidation around a context/mm switch was insufficient (writes would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but it is unclear as to whether this is merely a delaying tactic or if it is truly serialising with the TLB invalidation. Either way, it is empirically required. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Matthew Auld <matthew.william.auld@gmail.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Fingers crossed: commit 70b73f9ac113983f9c7db9887447f1344ac5b69b (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Aug 30 17:10:42 2018 +0100 drm/i915/ringbuffer: Delay after invalidating gen6+ xcs During stress testing of full-ppgtt (on Baytrail at least), we found that the invalidation around a context/mm switch was insufficient (writes would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but it is unclear as to whether this is merely a delaying tactic or if it is truly serialising with the TLB invalidation. Either way, it is empirically required. v2: Avoid the loop for readability; Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107715 References: https://bugs.freedesktop.org/show_bug.cgi?id=107759 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Matthew Auld <matthew.william.auld@gmail.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180830161042.29193-1-chris@chris-wilson.co.uk (In reply to Chris Wilson from comment #4) > Fingers crossed: > > commit 70b73f9ac113983f9c7db9887447f1344ac5b69b (HEAD -> > drm-intel-next-queued, drm-intel/drm-intel-next-queued) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Aug 30 17:10:42 2018 +0100 > > drm/i915/ringbuffer: Delay after invalidating gen6+ xcs > > During stress testing of full-ppgtt (on Baytrail at least), we found > that the invalidation around a context/mm switch was insufficient (writes > would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but > it is unclear as to whether this is merely a delaying tactic or if it is > truly serialising with the TLB invalidation. Either way, it is > empirically required. > > v2: Avoid the loop for readability; > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107715 > References: https://bugs.freedesktop.org/show_bug.cgi?id=107759 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Cc: Matthew Auld <matthew.william.auld@gmail.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180830161042.29193-1- > chris@chris-wilson.co.uk Still happening: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_100/fi-byt-j1900/igt@gem_sync@basic-many-each.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4744/fi-byt-squawks/igt@gem_sync@basic-many-each.html (gem_sync:2671) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500: (gem_sync:2671) igt_aux-CRITICAL: Failed assertion: !"GPU hung" Subtest basic-many-each failed. Forward dup for the same full-ppgtt invalidation bug. *** This bug has been marked as a duplicate of bug 107769 *** Closed as duplicate. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.