Bug 107799 - [CI][SHARDS] igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6
Summary: [CI][SHARDS] igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed ass...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: low normal
Assignee: Francesco Balestrieri
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-03 12:08 UTC by Martin Peres
Modified: 2019-03-06 18:22 UTC (History)
1 user (show)

See Also:
i915 platform: BSW/CHT, BYT, G45, GLK, I915G, I945GM, I965G, SNB
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-09-03 12:08:08 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4749/shard-glk5/igt@gem_eio@in-flight-contexts-10ms.html

(gem_eio:2578) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:2578) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:2578) CRITICAL: Wake up following reset+wedge took 1170.696ms
Subtest in-flight-contexts-10ms failed.
Comment 1 Martin Peres 2018-09-11 13:48:23 UTC
Also seen in pre-merge:

https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1821/shard-glk8/igt@gem_eio@in-flight-contexts-immediate.html

(gem_eio:15447) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:15447) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:15447) CRITICAL: Wake up following reset+wedge took 1658.497ms
Subtest in-flight-contexts-immediate failed.
Comment 2 Martin Peres 2018-09-20 09:17:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4647/shard-glk1/igt@gem_eio@in-flight-1us.html

Starting subtest: in-flight-1us
(gem_eio:1942) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:1942) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:1942) CRITICAL: Wake up following reset+wedge took 2523.865ms
Subtest in-flight-1us failed.
Comment 3 Martin Peres 2018-09-21 14:43:36 UTC
Also seen on BYT: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-byt-n2820/igt@gem_eio@in-flight-1us.html

Starting subtest: in-flight-1us
(gem_eio:947) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:947) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:947) CRITICAL: Wake up following reset+wedge took 2703.681ms
Subtest in-flight-1us failed.
Comment 4 Martin Peres 2018-10-30 16:29:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/igt@gem_eio@wait-immediate.html

Starting subtest: wait-immediate
(gem_eio:882) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
Subtest wait-immediate failed.
Comment 5 Joonas Lahtinen 2018-11-19 12:20:56 UTC
After some discussion, agreement is that the issue is actually an intermittent scheduling delay for the userspace process. So it is an issue, but not really high and could also be something outside of our scope. Changing importance to low for reconsideration.
Comment 6 Chris Wilson 2018-11-19 12:47:44 UTC
(In reply to Martin Peres from comment #4)
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/
> igt@gem_eio@wait-immediate.html
> 
> Starting subtest: wait-immediate
> (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file
> ../tests/i915/gem_eio.c:258:
> (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
> (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
> Subtest wait-immediate failed.

Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting to move that into its own thread, if it can't be dramatically sped up since the state is known.
Comment 7 Martin Peres 2018-11-19 13:35:50 UTC
(In reply to Chris Wilson from comment #6)
> (In reply to Martin Peres from comment #4)
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/
> > igt@gem_eio@wait-immediate.html
> > 
> > Starting subtest: wait-immediate
> > (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file
> > ../tests/i915/gem_eio.c:258:
> > (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
> > (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
> > Subtest wait-immediate failed.
> 
> Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting
> to move that into its own thread, if it can't be dramatically sped up since
> the state is known.

Well, that could explain :)

(In reply to Joonas Lahtinen from comment #5)
> After some discussion, agreement is that the issue is actually an
> intermittent scheduling delay for the userspace process. So it is an issue,
> but not really high and could also be something outside of our scope.
> Changing importance to low for reconsideration.

Scheduling delay resulting in over 1s latency? That sounds odd and could easily be worked-around by changing the scheduling policy to realtime.

Anyway, right now, we suppress:
 - SNB BYT BSW GLK: igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6
 - BWR: igt@gem_eio@wait-immediate - fail - Failed assertion: elapsed < 250e6

If you think the customer impact is really warranting a low, then go for it. Otherwise, bump the priority to what you thing is acceptable.
Comment 8 Martin Peres 2018-11-19 14:31:48 UTC
Danvet's take on this:

<danvet> so a few options seem reasonable:
<danvet> 1) fix test assumptions (if those are wrong)
<danvet> 2) fix test (maybe needs SCHED_RT)
<danvet> 3) fix env (maybe we need -rt kernels, but i915 is anything but -rt compliant)
<mupuf> 4) bump the GPU and CPU clocks if PM is slowing you down
<danvet> 5) fix kernel (not the first dead/livelock we've seen)
<danvet> 6) Adrinael's suggestion of moving the check around so we can have more coverage on other issues before we blow up on this
<danvet> 7) don't care
<danvet> re-running doesn't seem to be reasonable ever
<dolphin> yeah, because world is black and white
<dolphin> I'll go with the 7, and do something productive
<danvet> atm choice seems to be 7), which is not great if applied to igt at large
<danvet> because that's the tragedy of the commons problem
<danvet> it's always all the other people's 2% floppy tests :-)
Comment 9 Joonas Lahtinen 2018-11-19 14:35:56 UTC
Just to be explicit: The reason to assign a low priority for this task is that the potential user visible falout is just a longer stutter in a GPU hang scenario (where we've already hit an another issue that causes the hang).
Comment 10 Lakshmi 2018-11-20 12:40:15 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_144/fi-gdg-551/igt@gem_eio@wait-immediate.html
	
Starting subtest: wait-immediate
(gem_eio:1150) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:1150) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:1150) CRITICAL: Wake up following reset+wedge took 256.381ms
Subtest wait-immediate failed.
Comment 11 Martin Peres 2018-11-30 13:09:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4736/shard-glk9/igt@gem_eio@reset-stress.html

Starting subtest: reset-stress
(gem_eio:7946) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:7946) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:7946) CRITICAL: Wake up following reset+wedge took 445.988ms
Subtest reset-stress failed.

	
<6> [3804.883762] Console: switching to colour dummy device 80x25
<6> [3804.883866] [IGT] gem_eio: executing
<5> [3804.897459] Setting dangerous option reset - tainting kernel
<7> [3804.901938] [drm:i915_reset_device [i915]] resetting chip
<5> [3804.902142] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [3804.914782] Setting dangerous option reset - tainting kernel
<6> [3804.926600] [IGT] gem_eio: starting subtest reset-stress
<5> [3804.936303] Setting dangerous option reset - tainting kernel
<7> [3805.045373] [drm:i915_reset_device [i915]] resetting chip
<5> [3805.045512] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<6> [3805.610883] [IGT] gem_eio: exiting, ret=99
<5> [3805.611436] Setting dangerous option reset - tainting kernel
<7> [3805.613946] [drm:i915_reset_device [i915]] resetting chip
<5> [3805.614067] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<6> [3805.642387] Console: switching to colour frame buffer device 240x67
Comment 12 Martin Peres 2018-12-20 14:58:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_156/fi-elk-e7500/igt@gem_eio@in-flight-10ms.html

Starting subtest: in-flight-10ms
(gem_eio:1254) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:1254) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:1254) CRITICAL: Wake up following reset+wedge took 421.463ms
Subtest in-flight-10ms failed.
Comment 13 CI Bug Log 2019-02-11 10:50:20 UTC
A CI Bug Log filter associated to this bug has been updated:

{- GDG BWR GLK: igt@gem_eio@wait-immediate|(reset|unwedge)-stress - fail - Failed assertion: elapsed &lt; 250e6 -}
{+ GDG BWR SNB GLK: igt@gem_eio@wait-immediate|(reset|unwedge)-stress - fail - Failed assertion: elapsed &lt; 250e6 +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-snb-2600/igt@gem_eio@reset-stress.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5579/shard-snb4/igt@gem_eio@unwedge-stress.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5583/shard-snb5/igt@gem_eio@reset-stress.html
Comment 14 Chris Wilson 2019-02-15 15:41:44 UTC
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 12 20:40:34 2019 +0000

    i915/gem_eio: Check average reset times
    
    As we have moved to rcu/srcu to serialise the resets, individual resets
    are subject to small variations in system grace periods. Allow for this
    by only expecting the median reset time to be within our target, thereby
    excluding noisy outliers from perturbing our results (but keep the
    maximum capped to prevent horrid failures!)
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Comment 15 CI Bug Log 2019-02-15 17:26:30 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* GLK: igt@gem_eio@wait-1us - fail - Failed assertion: elapsed &lt; 250e6
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4829/shard-glk9/igt@gem_eio@wait-1us.html
Comment 16 Martin Peres 2019-03-06 18:22:15 UTC
(In reply to Chris Wilson from comment #14)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Feb 12 20:40:34 2019 +0000
> 
>     i915/gem_eio: Check average reset times
>     
>     As we have moved to rcu/srcu to serialise the resets, individual resets
>     are subject to small variations in system grace periods. Allow for this
>     by only expecting the median reset time to be within our target, thereby
>     excluding noisy outliers from perturbing our results (but keep the
>     maximum capped to prevent horrid failures!)
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Looks pretty good! Closing!
Comment 17 CI Bug Log 2019-03-06 18:22:30 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.