105957 – [CI] igt@gem_eio@* - fail - Test assertion failure function trigger_reset - Failed assertion: igt_seconds_elapsed(&ts) < 2

Bug 105957 - [CI] igt@gem_eio@* - fail - Test assertion failure function trigger_reset - Failed assertion: igt_seconds_elapsed(&ts) < 2

Summary: [CI] igt@gem_eio@* - fail - Test assertion failure function trigger_reset - F...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-04-09 12:39 UTC by Marta Löfstedt
Modified:	2019-10-02 18:17 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	ALL
i915 features:	GEM/Other

Attachments

Description Marta Löfstedt 2018-04-09 12:39:15 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-byt-n2820/igt@gem_eio@unwedge-stress.html

(gem_eio:1302) CRITICAL: Test assertion failure function trigger_reset, file ../tests/gem_eio.c:81:
(gem_eio:1302) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2
Subtest unwedge-stress failed.

Comment 1 Martin Peres 2018-04-20 12:29:13 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_23/fi-cfl-u/igt@gem_eio@in-flight-suspend.html
	
(gem_eio:1624) CRITICAL: Test assertion failure function trigger_reset, file ../tests/gem_eio.c:81:
(gem_eio:1624) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2
Subtest in-flight-suspend failed.

Comment 2 Chris Wilson 2018-05-17 22:34:10 UTC

commit 89ae332745e31a075747a63ac5acc5baccf75769
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 18:58:59 2018 +0100

    tests/gem_eio: Only wait-for-idle inside trigger_reset()
    
    trigger_reset() imposes a tight time constraint (2s) so that we verify
    that the reset itself completes quickly. In the middle of this check, we
    call gem_quiescent_gpu() which may invoke an rcu_barrier() or two to
    clear out the freed memory (DROP_FREED). Those barriers may have
    unbounded latency pushing beyond the 2s timeout, so restrict the
    operation to only wait-for-idle (DROP_ACTIVE).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105957
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Optimistically marking as fixed to see what happens. It's doubtful that the rcu_barrier alone is causing the grief, so I suspect there might be an outside timing influence -- as far as I can tell, the driver is doing the right thing and isn't causing the delay itself.

Comment 3 Martin Peres 2018-05-22 06:14:41 UTC

It was definitely not fixed:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_46/fi-kbl-7567u/igt@gem_eio@in-flight-suspend.html

Comment 4 Chris Wilson 2018-05-22 08:32:39 UTC

(In reply to Martin Peres from comment #3)
> It was definitely not fixed:
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_46/fi-kbl-7567u/
> igt@gem_eio@in-flight-suspend.html

But that isn't the same bug. In that case there was an unexpected GPU hangcheck after resume.

Comment 5 Lakshmi 2018-09-27 10:26:24 UTC

This issue is occurring regularly. 
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-bsw-n3050/igt@gem_eio@in-flight-1us.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_113/fi-bsw-n3050/igt@gem_eio@in-flight-external.html

Comment 6 Lakshmi 2018-10-19 15:29:28 UTC

Update: Last seen CI_DRM_4943_full (1 week, 6 days / 164 runs ago).

Comment 7 Andi 2018-12-07 16:54:01 UTC

This has a very low failure rate and I have been running the test list from IGT_4727 for quite a long time and didn't get any failure.

So far I have been running the test for over 48 hours, 236 times.

Is it OK to lower the "importance" of this bug to "lowest"?

Comment 8 Francesco Balestrieri 2018-12-11 09:13:38 UTC

Last seen 2 weeks ago on GLK:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5207/shard-glk7/igt@gem_eio@wait-wedge-immediate.html

Before that, it happened with weekly frequency.

Comment 9 Chris Wilson 2019-02-15 15:43:51 UTC

Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 12 20:40:34 2019 +0000

    i915/gem_eio: Check average reset times
    
    As we have moved to rcu/srcu to serialise the resets, individual resets
    are subject to small variations in system grace periods. Allow for this
    by only expecting the median reset time to be within our target, thereby
    excluding noisy outliers from perturbing our results (but keep the
    maximum capped to prevent horrid failures!)
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Comment 10 CI Bug Log 2019-03-12 11:16:19 UTC

A CI Bug Log filter associated to this bug has been updated:

{- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 -}
{+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5729/shard-glk4/igt@gem_eio@context-create.html

Comment 11 Lakshmi 2019-03-12 11:17:10 UTC

This bug is reopened to due to the this failure
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5729/shard-glk4/igt@gem_eio@context-create.html

Starting subtest: context-create
(gem_eio:2574) CRITICAL: Test assertion failure function trigger_reset, file ../tests/i915/gem_eio.c:82:
(gem_eio:2574) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2
Subtest context-create failed.
**** DEBUG ****
(gem_eio:2574) i915/gem_context-DEBUG: Test requirement passed: gem_has_contexts(fd)
(gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_eio:2574) DEBUG: Disabling GPU reset
(gem_eio:2574) DEBUG: Test requirement passed: fd >= 0
(gem_eio:2574) DEBUG: Test requirement passed: i915_reset_control(false)
(gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_eio:2574) DEBUG: Enabling GPU reset
(gem_eio:2574) DEBUG: Test requirement passed: fd >= 0
(gem_eio:2574) igt_gt-DEBUG: Triggering GPU reset
(gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_eio:2574) DEBUG: Checking that the GPU recovered
(gem_eio:2574) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_eio:2574) CRITICAL: Test assertion failure function trigger_reset, file ../tests/i915/gem_eio.c:82:
(gem_eio:2574) CRITICAL: Failed assertion: igt_seconds_elapsed(&ts) < 2
(gem_eio:2574) igt_core-INFO: Stack trace:
(gem_eio:2574) igt_core-INFO:   #0 ../lib/igt_core.c:1474 __igt_fail_assert()
(gem_eio:2574) igt_core-INFO:   #1 ../tests/i915/gem_eio.c:83 trigger_reset()
(gem_eio:2574) igt_core-INFO:   #2 ../tests/i915/gem_eio.c:132 test_context_create()
(gem_eio:2574) igt_core-INFO:   #3 ../tests/i915/gem_eio.c:835 __real_main814()
(gem_eio:2574) igt_core-INFO:   #4 ../tests/i915/gem_eio.c:814 main()
(gem_eio:2574) igt_core-INFO:   #5 ../csu/libc-start.c:344 __libc_start_main()
(gem_eio:2574) igt_core-INFO:   #6 [_start+0x2a]
****  END  ****

Comment 12 Francesco Balestrieri 2019-06-03 06:13:39 UTC

Latest occurrence from two weeks ago: 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6087/shard-glk1/igt@gem_eio@wait-10ms.html

Comment 13 Francesco Balestrieri 2019-07-23 08:16:46 UTC

Continues to occur every 1-2 weeks, e.g.:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6402/shard-glk6/igt@gem_eio@in-flight-10ms.html

Comment 14 CI Bug Log 2019-09-19 05:57:32 UTC

A CI Bug Log filter associated to this bug has been updated:

{- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 -}
{+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-tgl-u/igt@gem_eio@in-flight-internal-immediate.html

Comment 15 CI Bug Log 2019-10-01 08:46:51 UTC

A CI Bug Log filter associated to this bug has been updated:

{- all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 -}
{+ all machines: igt@gem_eio@* - fail - Failed assertion: igt_seconds_elapsed(&amp;ts) &lt; 2 +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6976/shard-snb1/igt@gem_eio@kms.html

Comment 16 Chris Wilson 2019-10-02 11:15:50 UTC

Note to self: with the upcoming heartbeat change for snb/ivb, it's likely we need to bump this to allow for the heartbeat timeout, say 5s.

Comment 17 Chris Wilson 2019-10-02 18:17:56 UTC

Give up.

commit 74f55119f9920b65996535210a09147997804136 (HEAD, upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Oct 2 12:17:07 2019 +0100

    i915/gem_eio: Relax timeout for forced resets
    
    It appears we cannot consistently hit our self-imposed QoS target of 2s
    for performing the reset (my theory is that is some RCU scheduling
    quirk), so relax the assertion to a measly 10s.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105957
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.