Bug 107799 - [CI][SHARDS] igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6
Summary: [CI][SHARDS] igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed ass...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: low normal
Assignee: Francesco Balestrieri
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-03 12:08 UTC by Martin Peres
Modified: 2018-11-30 13:09 UTC (History)
1 user (show)

See Also:
i915 platform: BYT, GLK, I915G, I945GM
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-09-03 12:08:08 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4749/shard-glk5/igt@gem_eio@in-flight-contexts-10ms.html

(gem_eio:2578) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:2578) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:2578) CRITICAL: Wake up following reset+wedge took 1170.696ms
Subtest in-flight-contexts-10ms failed.
Comment 1 Martin Peres 2018-09-11 13:48:23 UTC
Also seen in pre-merge:

https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1821/shard-glk8/igt@gem_eio@in-flight-contexts-immediate.html

(gem_eio:15447) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:15447) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:15447) CRITICAL: Wake up following reset+wedge took 1658.497ms
Subtest in-flight-contexts-immediate failed.
Comment 2 Martin Peres 2018-09-20 09:17:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4647/shard-glk1/igt@gem_eio@in-flight-1us.html

Starting subtest: in-flight-1us
(gem_eio:1942) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:1942) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:1942) CRITICAL: Wake up following reset+wedge took 2523.865ms
Subtest in-flight-1us failed.
Comment 3 Martin Peres 2018-09-21 14:43:36 UTC
Also seen on BYT: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-byt-n2820/igt@gem_eio@in-flight-1us.html

Starting subtest: in-flight-1us
(gem_eio:947) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258:
(gem_eio:947) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:947) CRITICAL: Wake up following reset+wedge took 2703.681ms
Subtest in-flight-1us failed.
Comment 4 Martin Peres 2018-10-30 16:29:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/igt@gem_eio@wait-immediate.html

Starting subtest: wait-immediate
(gem_eio:882) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
Subtest wait-immediate failed.
Comment 5 Joonas Lahtinen 2018-11-19 12:20:56 UTC
After some discussion, agreement is that the issue is actually an intermittent scheduling delay for the userspace process. So it is an issue, but not really high and could also be something outside of our scope. Changing importance to low for reconsideration.
Comment 6 Chris Wilson 2018-11-19 12:47:44 UTC
(In reply to Martin Peres from comment #4)
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/
> igt@gem_eio@wait-immediate.html
> 
> Starting subtest: wait-immediate
> (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file
> ../tests/i915/gem_eio.c:258:
> (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
> (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
> Subtest wait-immediate failed.

Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting to move that into its own thread, if it can't be dramatically sped up since the state is known.
Comment 7 Martin Peres 2018-11-19 13:35:50 UTC
(In reply to Chris Wilson from comment #6)
> (In reply to Martin Peres from comment #4)
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/
> > igt@gem_eio@wait-immediate.html
> > 
> > Starting subtest: wait-immediate
> > (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file
> > ../tests/i915/gem_eio.c:258:
> > (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6
> > (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms
> > Subtest wait-immediate failed.
> 
> Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting
> to move that into its own thread, if it can't be dramatically sped up since
> the state is known.

Well, that could explain :)

(In reply to Joonas Lahtinen from comment #5)
> After some discussion, agreement is that the issue is actually an
> intermittent scheduling delay for the userspace process. So it is an issue,
> but not really high and could also be something outside of our scope.
> Changing importance to low for reconsideration.

Scheduling delay resulting in over 1s latency? That sounds odd and could easily be worked-around by changing the scheduling policy to realtime.

Anyway, right now, we suppress:
 - SNB BYT BSW GLK: igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6
 - BWR: igt@gem_eio@wait-immediate - fail - Failed assertion: elapsed < 250e6

If you think the customer impact is really warranting a low, then go for it. Otherwise, bump the priority to what you thing is acceptable.
Comment 8 Martin Peres 2018-11-19 14:31:48 UTC
Danvet's take on this:

<danvet> so a few options seem reasonable:
<danvet> 1) fix test assumptions (if those are wrong)
<danvet> 2) fix test (maybe needs SCHED_RT)
<danvet> 3) fix env (maybe we need -rt kernels, but i915 is anything but -rt compliant)
<mupuf> 4) bump the GPU and CPU clocks if PM is slowing you down
<danvet> 5) fix kernel (not the first dead/livelock we've seen)
<danvet> 6) Adrinael's suggestion of moving the check around so we can have more coverage on other issues before we blow up on this
<danvet> 7) don't care
<danvet> re-running doesn't seem to be reasonable ever
<dolphin> yeah, because world is black and white
<dolphin> I'll go with the 7, and do something productive
<danvet> atm choice seems to be 7), which is not great if applied to igt at large
<danvet> because that's the tragedy of the commons problem
<danvet> it's always all the other people's 2% floppy tests :-)
Comment 9 Joonas Lahtinen 2018-11-19 14:35:56 UTC
Just to be explicit: The reason to assign a low priority for this task is that the potential user visible falout is just a longer stutter in a GPU hang scenario (where we've already hit an another issue that causes the hang).
Comment 10 Lakshmi 2018-11-20 12:40:15 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_144/fi-gdg-551/igt@gem_eio@wait-immediate.html
	
Starting subtest: wait-immediate
(gem_eio:1150) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:1150) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:1150) CRITICAL: Wake up following reset+wedge took 256.381ms
Subtest wait-immediate failed.
Comment 11 Martin Peres 2018-11-30 13:09:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4736/shard-glk9/igt@gem_eio@reset-stress.html

Starting subtest: reset-stress
(gem_eio:7946) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258:
(gem_eio:7946) CRITICAL: Failed assertion: elapsed < 250e6
(gem_eio:7946) CRITICAL: Wake up following reset+wedge took 445.988ms
Subtest reset-stress failed.

	
<6> [3804.883762] Console: switching to colour dummy device 80x25
<6> [3804.883866] [IGT] gem_eio: executing
<5> [3804.897459] Setting dangerous option reset - tainting kernel
<7> [3804.901938] [drm:i915_reset_device [i915]] resetting chip
<5> [3804.902142] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [3804.914782] Setting dangerous option reset - tainting kernel
<6> [3804.926600] [IGT] gem_eio: starting subtest reset-stress
<5> [3804.936303] Setting dangerous option reset - tainting kernel
<7> [3805.045373] [drm:i915_reset_device [i915]] resetting chip
<5> [3805.045512] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<6> [3805.610883] [IGT] gem_eio: exiting, ret=99
<5> [3805.611436] Setting dangerous option reset - tainting kernel
<7> [3805.613946] [drm:i915_reset_device [i915]] resetting chip
<5> [3805.614067] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<6> [3805.642387] Console: switching to colour frame buffer device 240x67


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.