Bug 107715 - [CI][BAT] igt@gem_sync@basic-many-each - fail - Failed assertion: !"GPU hung"
Summary: [CI][BAT] igt@gem_sync@basic-many-each - fail - Failed assertion: !"GPU hung"
Status: CLOSED DUPLICATE of bug 107769
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-28 12:58 UTC by Martin Peres
Modified: 2018-10-23 12:32 UTC (History)
1 user (show)

See Also:
i915 platform: BYT
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-08-28 12:58:41 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4703/fi-byt-clapper/igt@gem_sync@basic-many-each.html


(gem_sync:2723) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_sync:2723) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-many-each failed.
Comment 1 Chris Wilson 2018-08-28 13:05:46 UTC
It reads as a TLB miss (jump into a batch in the middle of nowhere and only found zeros). Chief suspect is ppGTT invalidation.
Comment 2 Chris Wilson 2018-08-28 16:54:15 UTC
Something odd I spotted here:

<7>[  232.791376] missed_breadcrumb [head 149a8, postfix 149f8, tail 14a20, batch 0x00000000_00002000]:
<7>[  232.791404] missed_breadcrumb [0000] 13244001 00000104 00000000 00000000 11000001 00022220 ffffffff 11000001
<7>[  232.791412] missed_breadcrumb [0020] 00022228 00470000 12400001 00022228 7fffc000 00000000 18800100 00002000
<7>[  232.791420] missed_breadcrumb [0040] 13204001 00000104 00000000 00000000 11000001 00002044 00000001 11000001
<7>[  232.791426] missed_breadcrumb [0060] 00012040 00000001 10800001 000000c0 00000001 01000000

There's only one MI_FLUSH_DW for the invalidate at the start of the request, but in emit_mi_flush_dw(), there's the comment:

                /*
                 * Not only do we need a full barrier (post-sync write) after
                 * invalidating the TLBs, but we need to wait a little bit
                 * longer. Whether this is merely delaying us, or the
                 * subsequent flush is a key part of serialising with the
                 * post-sync op, this extra pass appears vital before a
                 * mm switch!
                 */

Hmm.
Comment 3 Chris Wilson 2018-08-28 17:00:01 UTC
ARGH. The answer is I never pushed

    drm/i915/ringbuffer: Delay after invalidating gen6+ xcs
    
    During stress testing of full-ppgtt (on Baytrail at least), we found
    that the invalidation around a context/mm switch was insufficient (writes
    would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but
    it is unclear as to whether this is merely a delaying tactic or if it is
    truly serialising with the TLB invalidation. Either way, it is
    empirically required.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Matthew Auld <matthew.william.auld@gmail.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 4 Chris Wilson 2018-08-30 18:12:14 UTC
Fingers crossed:

commit 70b73f9ac113983f9c7db9887447f1344ac5b69b (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 30 17:10:42 2018 +0100

    drm/i915/ringbuffer: Delay after invalidating gen6+ xcs
    
    During stress testing of full-ppgtt (on Baytrail at least), we found
    that the invalidation around a context/mm switch was insufficient (writes
    would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but
    it is unclear as to whether this is merely a delaying tactic or if it is
    truly serialising with the TLB invalidation. Either way, it is
    empirically required.
    
    v2: Avoid the loop for readability;
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107715
    References: https://bugs.freedesktop.org/show_bug.cgi?id=107759
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Matthew Auld <matthew.william.auld@gmail.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180830161042.29193-1-chris@chris-wilson.co.uk
Comment 5 Martin Peres 2018-09-03 11:24:20 UTC
(In reply to Chris Wilson from comment #4)
> Fingers crossed:
> 
> commit 70b73f9ac113983f9c7db9887447f1344ac5b69b (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Aug 30 17:10:42 2018 +0100
> 
>     drm/i915/ringbuffer: Delay after invalidating gen6+ xcs
>     
>     During stress testing of full-ppgtt (on Baytrail at least), we found
>     that the invalidation around a context/mm switch was insufficient (writes
>     would go astray). Adding a second MI_FLUSH_DW barrier prevents this, but
>     it is unclear as to whether this is merely a delaying tactic or if it is
>     truly serialising with the TLB invalidation. Either way, it is
>     empirically required.
>     
>     v2: Avoid the loop for readability;
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107715
>     References: https://bugs.freedesktop.org/show_bug.cgi?id=107759
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Cc: Matthew Auld <matthew.william.auld@gmail.com>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180830161042.29193-1-
> chris@chris-wilson.co.uk

Still happening:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_100/fi-byt-j1900/igt@gem_sync@basic-many-each.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4744/fi-byt-squawks/igt@gem_sync@basic-many-each.html

(gem_sync:2671) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:500:
(gem_sync:2671) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-many-each failed.
Comment 6 Chris Wilson 2018-09-03 11:26:47 UTC
Forward dup for the same full-ppgtt invalidation bug.

*** This bug has been marked as a duplicate of bug 107769 ***
Comment 7 Lakshmi 2018-10-23 12:32:13 UTC
Closed as duplicate.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.