Summary: | [CI][SHARDS] igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6 | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Francesco Balestrieri <francesco.balestrieri> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | low | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BSW/CHT, BYT, G45, GLK, I915G, I945GM, I965G, SNB | i915 features: | GEM/Other |
Description
Martin Peres
2018-09-03 12:08:08 UTC
Also seen in pre-merge: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1821/shard-glk8/igt@gem_eio@in-flight-contexts-immediate.html (gem_eio:15447) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258: (gem_eio:15447) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:15447) CRITICAL: Wake up following reset+wedge took 1658.497ms Subtest in-flight-contexts-immediate failed. https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4647/shard-glk1/igt@gem_eio@in-flight-1us.html Starting subtest: in-flight-1us (gem_eio:1942) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258: (gem_eio:1942) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:1942) CRITICAL: Wake up following reset+wedge took 2523.865ms Subtest in-flight-1us failed. Also seen on BYT: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_115/fi-byt-n2820/igt@gem_eio@in-flight-1us.html Starting subtest: in-flight-1us (gem_eio:947) CRITICAL: Test assertion failure function check_wait, file ../tests/gem_eio.c:258: (gem_eio:947) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:947) CRITICAL: Wake up following reset+wedge took 2703.681ms Subtest in-flight-1us failed. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/igt@gem_eio@wait-immediate.html Starting subtest: wait-immediate (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258: (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms Subtest wait-immediate failed. After some discussion, agreement is that the issue is actually an intermittent scheduling delay for the userspace process. So it is an issue, but not really high and could also be something outside of our scope. Changing importance to low for reconsideration. (In reply to Martin Peres from comment #4) > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/ > igt@gem_eio@wait-immediate.html > > Starting subtest: wait-immediate > (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file > ../tests/i915/gem_eio.c:258: > (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6 > (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms > Subtest wait-immediate failed. Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting to move that into its own thread, if it can't be dramatically sped up since the state is known. (In reply to Chris Wilson from comment #6) > (In reply to Martin Peres from comment #4) > > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-bwr-2160/ > > igt@gem_eio@wait-immediate.html > > > > Starting subtest: wait-immediate > > (gem_eio:882) CRITICAL: Test assertion failure function check_wait, file > > ../tests/i915/gem_eio.c:258: > > (gem_eio:882) CRITICAL: Failed assertion: elapsed < 250e6 > > (gem_eio:882) CRITICAL: Wake up following reset+wedge took 290.703ms > > Subtest wait-immediate failed. > > Fwiw, bwr appears to be slowed down by display post-reset probing. Tempting > to move that into its own thread, if it can't be dramatically sped up since > the state is known. Well, that could explain :) (In reply to Joonas Lahtinen from comment #5) > After some discussion, agreement is that the issue is actually an > intermittent scheduling delay for the userspace process. So it is an issue, > but not really high and could also be something outside of our scope. > Changing importance to low for reconsideration. Scheduling delay resulting in over 1s latency? That sounds odd and could easily be worked-around by changing the scheduling policy to realtime. Anyway, right now, we suppress: - SNB BYT BSW GLK: igt@gem_eio@in-flight-*-(10ms|1us|immediate) - fail - Failed assertion: elapsed < 250e6 - BWR: igt@gem_eio@wait-immediate - fail - Failed assertion: elapsed < 250e6 If you think the customer impact is really warranting a low, then go for it. Otherwise, bump the priority to what you thing is acceptable. Danvet's take on this: <danvet> so a few options seem reasonable: <danvet> 1) fix test assumptions (if those are wrong) <danvet> 2) fix test (maybe needs SCHED_RT) <danvet> 3) fix env (maybe we need -rt kernels, but i915 is anything but -rt compliant) <mupuf> 4) bump the GPU and CPU clocks if PM is slowing you down <danvet> 5) fix kernel (not the first dead/livelock we've seen) <danvet> 6) Adrinael's suggestion of moving the check around so we can have more coverage on other issues before we blow up on this <danvet> 7) don't care <danvet> re-running doesn't seem to be reasonable ever <dolphin> yeah, because world is black and white <dolphin> I'll go with the 7, and do something productive <danvet> atm choice seems to be 7), which is not great if applied to igt at large <danvet> because that's the tragedy of the commons problem <danvet> it's always all the other people's 2% floppy tests :-) Just to be explicit: The reason to assign a low priority for this task is that the potential user visible falout is just a longer stutter in a GPU hang scenario (where we've already hit an another issue that causes the hang). https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_144/fi-gdg-551/igt@gem_eio@wait-immediate.html Starting subtest: wait-immediate (gem_eio:1150) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258: (gem_eio:1150) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:1150) CRITICAL: Wake up following reset+wedge took 256.381ms Subtest wait-immediate failed. https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4736/shard-glk9/igt@gem_eio@reset-stress.html Starting subtest: reset-stress (gem_eio:7946) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258: (gem_eio:7946) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:7946) CRITICAL: Wake up following reset+wedge took 445.988ms Subtest reset-stress failed. <6> [3804.883762] Console: switching to colour dummy device 80x25 <6> [3804.883866] [IGT] gem_eio: executing <5> [3804.897459] Setting dangerous option reset - tainting kernel <7> [3804.901938] [drm:i915_reset_device [i915]] resetting chip <5> [3804.902142] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <5> [3804.914782] Setting dangerous option reset - tainting kernel <6> [3804.926600] [IGT] gem_eio: starting subtest reset-stress <5> [3804.936303] Setting dangerous option reset - tainting kernel <7> [3805.045373] [drm:i915_reset_device [i915]] resetting chip <5> [3805.045512] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <6> [3805.610883] [IGT] gem_eio: exiting, ret=99 <5> [3805.611436] Setting dangerous option reset - tainting kernel <7> [3805.613946] [drm:i915_reset_device [i915]] resetting chip <5> [3805.614067] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <6> [3805.642387] Console: switching to colour frame buffer device 240x67 https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_156/fi-elk-e7500/igt@gem_eio@in-flight-10ms.html Starting subtest: in-flight-10ms (gem_eio:1254) CRITICAL: Test assertion failure function check_wait, file ../tests/i915/gem_eio.c:258: (gem_eio:1254) CRITICAL: Failed assertion: elapsed < 250e6 (gem_eio:1254) CRITICAL: Wake up following reset+wedge took 421.463ms Subtest in-flight-10ms failed. A CI Bug Log filter associated to this bug has been updated: {- GDG BWR GLK: igt@gem_eio@wait-immediate|(reset|unwedge)-stress - fail - Failed assertion: elapsed < 250e6 -} {+ GDG BWR SNB GLK: igt@gem_eio@wait-immediate|(reset|unwedge)-stress - fail - Failed assertion: elapsed < 250e6 +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_213/fi-snb-2600/igt@gem_eio@reset-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5579/shard-snb4/igt@gem_eio@unwedge-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5583/shard-snb5/igt@gem_eio@reset-stress.html Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 12 20:40:34 2019 +0000 i915/gem_eio: Check average reset times As we have moved to rcu/srcu to serialise the resets, individual resets are subject to small variations in system grace periods. Allow for this by only expecting the median reset time to be within our target, thereby excluding noisy outliers from perturbing our results (but keep the maximum capped to prevent horrid failures!) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> The CI Bug Log issue associated to this bug has been updated. ### New filters associated * GLK: igt@gem_eio@wait-1us - fail - Failed assertion: elapsed < 250e6 - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4829/shard-glk9/igt@gem_eio@wait-1us.html (In reply to Chris Wilson from comment #14) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Feb 12 20:40:34 2019 +0000 > > i915/gem_eio: Check average reset times > > As we have moved to rcu/srcu to serialise the resets, individual resets > are subject to small variations in system grace periods. Allow for this > by only expecting the median reset time to be within our target, thereby > excluding noisy outliers from perturbing our results (but keep the > maximum capped to prevent horrid failures!) > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Looks pretty good! Closing! The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.