Bug 109661 - [CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit
Summary: [CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limi...
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-18 10:02 UTC by Martin Peres
Modified: 2019-07-15 08:26 UTC (History)
1 user (show)

See Also:
i915 platform: BYT, GLK, ICL, SNB
i915 features: GEM/Other


Attachments
attachment-13473-0.html (36.68 KB, text/html)
2019-06-25 06:47 UTC, Shuang He
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2019-02-18 10:02:29 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html

Starting subtest: unwedge-stress
(gem_eio:1403) CRITICAL: Test assertion failure function check_wait_elapsed, file ../tests/i915/gem_eio.c:292:
(gem_eio:1403) CRITICAL: Failed assertion: med < limit && max < 5 * limit
(gem_eio:1403) CRITICAL: Wake up following reset+wedge took 187.662+-491.413ms (min:8.917ms, median:22.893ms, max:1810.883ms); limit set to 250ms on average and 1250ms maximum
Subtest unwedge-stress failed.
Comment 1 CI Bug Log 2019-02-18 10:03:17 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5615/shard-snb7/igt@gem_eio@unwedge-stress.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5622/shard-snb1/igt@gem_eio@unwedge-stress.html
Comment 2 Chris Wilson 2019-02-18 12:37:22 UTC
It exceeded 3s in some runs. Gah.

https://patchwork.freedesktop.org/patch/286706/ is my hope.
Comment 3 Chris Wilson 2019-02-19 14:49:44 UTC
Fingers crossed once again,

commit 8f54b3c6c921275d10e33746553c40294ffa0d58
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 19 12:21:57 2019 +0000

    drm/i915: Trim delays for wedging
    
    CI still reports the occasional multi-second delay for resets, in
    particular along the wedge+recovery paths. As the likely, and unbounded,
    delay here is from sync_rcu, use the expedited variant instead.
    
    Testcase: igt/gem_eio/unwedge-stress
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-7-chris@chris-wilson.co.uk
Comment 4 CI Bug Log 2019-02-20 12:21:03 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4838/shard-snb5/igt@gem_eio@reset-stress.html
Comment 5 Chris Wilson 2019-02-20 12:24:21 UTC
Now that's just cruel, having supplied a patch specifically for the unwedge-stress subtest, you cross-pollute it with reset-stress!

Not that it'll make much difference, but there is quite a difference in driver paths between the two subtests.
Comment 6 Martin Peres 2019-03-06 15:31:51 UTC
(In reply to Chris Wilson from comment #5)
> Now that's just cruel, having supplied a patch specifically for the
> unwedge-stress subtest, you cross-pollute it with reset-stress!
> 
> Not that it'll make much difference, but there is quite a difference in
> driver paths between the two subtests.

Sorry about that! However, unwedge-stress is still failing:
 - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4855/shard-snb5/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4858/shard-snb4/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5671/shard-snb4/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5672/shard-snb2/igt@gem_eio@unwedge-stress.html


If the fix for these issues is not fixing the reset-stress issues, we'll create a new bug!
Comment 7 Chris Wilson 2019-03-06 15:39:14 UTC
We're just at a mercy of an unbounded wait. We're using sync_rcu_expedited everywhere we can here and still we get delayed. I'm tempted to remove the fail for the max timeout being several seconds so long as the median is reasonable (all the limits are arbitrary anyway).
Comment 8 CI Bug Log 2019-04-18 07:24:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/igt@gem_eio@unwedge-stress.html
Comment 9 Lakshmi 2019-04-18 07:24:36 UTC
(In reply to CI Bug Log from comment #8)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med
> &lt; limit &amp;&amp; max &lt; 5 * limit -}
> {+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion:
> med &lt; limit &amp;&amp; max &lt; 5 * limit +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/
> igt@gem_eio@unwedge-stress.html

Also seen on GLK.
Comment 10 CI Bug Log 2019-04-25 21:16:03 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_258/fi-byt-n2820/igt@gem_eio@unwedge-stress.html
Comment 11 Chris Wilson 2019-04-27 09:11:30 UTC
It looks like it was the reset worker feeding in the restart request that dragged us down.

commit 79ffac8599c4d8aa84d313920d3d86d7361c252b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 24 21:07:17 2019 +0100

    drm/i915: Invert the GEM wakeref hierarchy
Comment 12 CI Bug Log 2019-05-06 14:12:21 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6043/shard-skl5/igt@gem_eio@reset-stress.html
Comment 13 CI Bug Log 2019-05-27 13:18:41 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/igt@gem_eio@unwedge-stress.html
Comment 14 Lakshmi 2019-05-27 13:19:30 UTC
(In reply to CI Bug Log from comment #13)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
> assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
> {+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
> assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/
> igt@gem_eio@unwedge-stress.html

Reopened this bug as this failure happened on ICL.
Comment 15 CI Bug Log 2019-06-25 06:46:41 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}


  No new failures caught with the new filter
Comment 16 Shuang He 2019-06-25 06:47:13 UTC
Created attachment 144631 [details]
attachment-13473-0.html

Dear sender,
i have take leave during ww26.2. Please call me cell phone if urgency, sorry for the inconvenience it might bring to you.
Comment 17 Chris Wilson 2019-07-05 11:31:10 UTC
For reference,

commit f0e39642f6f8da5406627bfa79c6600df949e203 (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jul 2 12:40:45 2019 +0100

    i915/gem_eio: Assert the hanging request is correctly identified
    
    When forcing a reset, it is crucial that the kernel correctly identifies
    the injected hang. Verify this is the case for reset-stress.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

One hypothesis is that we are not resetting the guilty request and so hitting a hangcheck instead.
Comment 18 CI Bug Log 2019-07-08 09:06:57 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5088/shard-apl8/igt@gem_eio@reset-stress.html
Comment 19 Chris Wilson 2019-07-09 15:04:01 UTC
<7> [944.138584] [IGT] Forcing GPU reset
<7> [944.138848] [drm:i915_reset_device [i915]] resetting chip
<5> [944.138957] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.139197] [IGT] Checking that the GPU recovered
<5> [944.162438] Setting dangerous option reset - tainting kernel
<7> [944.275166] [drm:i915_reset_device [i915]] resetting chip
<5> [944.276899] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [944.277178] Setting dangerous option reset - tainting kernel
<7> [944.277284] [IGT] Forcing GPU reset
<7> [944.277557] [drm:i915_reset_device [i915]] resetting chip
<5> [944.278273] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.278579] [IGT] Checking that the GPU recovered
<5> [944.302432] Setting dangerous option reset - tainting kernel
<7> [946.381889] [drm:i915_reset_device [i915]] resetting chip
<5> [946.382011] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [946.382270] Setting dangerous option reset - tainting kernel
<7> [946.382345] [IGT] Forcing GPU reset
<7> [946.382557] [drm:i915_reset_device [i915]] resetting chip
<5> [946.383318] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [946.383621] [IGT] Checking that the GPU recovered
<6> [946.475026] [IGT] gem_eio: exiting, ret=98

Which confirms that normally we expect quick reset+recovery cycles (with a reset period of 100ms between iterations). It also tells us that the delay is before i915_reset_device (although we could do with drm.debug=7 to be sure), which is the preamble in i915_handle_error(). Of note the only thing there is synchronize_rcu_expedited(). :|
Comment 20 CI Bug Log 2019-07-15 08:26:46 UTC
A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6470/re-cml-u/igt@gem_eio@reset-stress.html


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.