Bug 109661

Summary:

[CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit

Product:

DRI

Reporter:

Martin Peres <martin.peres>

Component:

DRM/Intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED MOVED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

high

CC:

intel-gfx-bugs

Version:

XOrg git

Hardware:

Other

OS:

All

Whiteboard:

ReadyForDev

i915 platform:

SNB

i915 features:

GEM/Other

Attachments:

Description	Flags
attachment-13473-0.html	none

Description Martin Peres 2019-02-18 10:02:29 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html

Starting subtest: unwedge-stress
(gem_eio:1403) CRITICAL: Test assertion failure function check_wait_elapsed, file ../tests/i915/gem_eio.c:292:
(gem_eio:1403) CRITICAL: Failed assertion: med < limit && max < 5 * limit
(gem_eio:1403) CRITICAL: Wake up following reset+wedge took 187.662+-491.413ms (min:8.917ms, median:22.893ms, max:1810.883ms); limit set to 250ms on average and 1250ms maximum
Subtest unwedge-stress failed.

Comment 1 CI Bug Log 2019-02-18 10:03:17 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5615/shard-snb7/igt@gem_eio@unwedge-stress.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5622/shard-snb1/igt@gem_eio@unwedge-stress.html

Comment 2 Chris Wilson 2019-02-18 12:37:22 UTC

It exceeded 3s in some runs. Gah.

https://patchwork.freedesktop.org/patch/286706/ is my hope.

Comment 3 Chris Wilson 2019-02-19 14:49:44 UTC

Fingers crossed once again,

commit 8f54b3c6c921275d10e33746553c40294ffa0d58
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Feb 19 12:21:57 2019 +0000

    drm/i915: Trim delays for wedging
    
    CI still reports the occasional multi-second delay for resets, in
    particular along the wedge+recovery paths. As the likely, and unbounded,
    delay here is from sync_rcu, use the expedited variant instead.
    
    Testcase: igt/gem_eio/unwedge-stress
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-7-chris@chris-wilson.co.uk

Comment 4 CI Bug Log 2019-02-20 12:21:03 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4838/shard-snb5/igt@gem_eio@reset-stress.html

Comment 5 Chris Wilson 2019-02-20 12:24:21 UTC

Now that's just cruel, having supplied a patch specifically for the unwedge-stress subtest, you cross-pollute it with reset-stress!

Not that it'll make much difference, but there is quite a difference in driver paths between the two subtests.

Comment 6 Martin Peres 2019-03-06 15:31:51 UTC

(In reply to Chris Wilson from comment #5)
> Now that's just cruel, having supplied a patch specifically for the
> unwedge-stress subtest, you cross-pollute it with reset-stress!
> 
> Not that it'll make much difference, but there is quite a difference in
> driver paths between the two subtests.

Sorry about that! However, unwedge-stress is still failing:
 - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4855/shard-snb5/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4858/shard-snb4/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5671/shard-snb4/igt@gem_eio@unwedge-stress.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5672/shard-snb2/igt@gem_eio@unwedge-stress.html


If the fix for these issues is not fixing the reset-stress issues, we'll create a new bug!

Comment 7 Chris Wilson 2019-03-06 15:39:14 UTC

We're just at a mercy of an unbounded wait. We're using sync_rcu_expedited everywhere we can here and still we get delayed. I'm tempted to remove the fail for the max timeout being several seconds so long as the median is reasonable (all the limits are arbitrary anyway).

Comment 8 CI Bug Log 2019-04-18 07:24:06 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/igt@gem_eio@unwedge-stress.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/igt@gem_eio@unwedge-stress.html

Comment 9 Lakshmi 2019-04-18 07:24:36 UTC

(In reply to CI Bug Log from comment #8)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med
> &lt; limit &amp;&amp; max &lt; 5 * limit -}
> {+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion:
> med &lt; limit &amp;&amp; max &lt; 5 * limit +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/
> igt@gem_eio@unwedge-stress.html
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/
> igt@gem_eio@unwedge-stress.html

Also seen on GLK.

Comment 10 CI Bug Log 2019-04-25 21:16:03 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_258/fi-byt-n2820/igt@gem_eio@unwedge-stress.html

Comment 11 Chris Wilson 2019-04-27 09:11:30 UTC

It looks like it was the reset worker feeding in the restart request that dragged us down.

commit 79ffac8599c4d8aa84d313920d3d86d7361c252b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 24 21:07:17 2019 +0100

    drm/i915: Invert the GEM wakeref hierarchy

Comment 12 CI Bug Log 2019-05-06 14:12:21 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6043/shard-skl5/igt@gem_eio@reset-stress.html

Comment 13 CI Bug Log 2019-05-27 13:18:41 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/igt@gem_eio@unwedge-stress.html

Comment 14 Lakshmi 2019-05-27 13:19:30 UTC

(In reply to CI Bug Log from comment #13)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
> assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
> {+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed
> assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/
> igt@gem_eio@unwedge-stress.html

Reopened this bug as this failure happened on ICL.

Comment 15 CI Bug Log 2019-06-25 06:46:41 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}


  No new failures caught with the new filter

Comment 16 Shuang He 2019-06-25 06:47:13 UTC

Created attachment 144631 [details]
attachment-13473-0.html

Dear sender,
i have take leave during ww26.2. Please call me cell phone if urgency, sorry for the inconvenience it might bring to you.

Comment 17 Chris Wilson 2019-07-05 11:31:10 UTC

For reference,

commit f0e39642f6f8da5406627bfa79c6600df949e203 (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jul 2 12:40:45 2019 +0100

    i915/gem_eio: Assert the hanging request is correctly identified
    
    When forcing a reset, it is crucial that the kernel correctly identifies
    the injected hang. Verify this is the case for reset-stress.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

One hypothesis is that we are not resetting the guilty request and so hitting a hangcheck instead.

Comment 18 CI Bug Log 2019-07-08 09:06:57 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5088/shard-apl8/igt@gem_eio@reset-stress.html

Comment 19 Chris Wilson 2019-07-09 15:04:01 UTC

<7> [944.138584] [IGT] Forcing GPU reset
<7> [944.138848] [drm:i915_reset_device [i915]] resetting chip
<5> [944.138957] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.139197] [IGT] Checking that the GPU recovered
<5> [944.162438] Setting dangerous option reset - tainting kernel
<7> [944.275166] [drm:i915_reset_device [i915]] resetting chip
<5> [944.276899] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [944.277178] Setting dangerous option reset - tainting kernel
<7> [944.277284] [IGT] Forcing GPU reset
<7> [944.277557] [drm:i915_reset_device [i915]] resetting chip
<5> [944.278273] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [944.278579] [IGT] Checking that the GPU recovered
<5> [944.302432] Setting dangerous option reset - tainting kernel
<7> [946.381889] [drm:i915_reset_device [i915]] resetting chip
<5> [946.382011] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<5> [946.382270] Setting dangerous option reset - tainting kernel
<7> [946.382345] [IGT] Forcing GPU reset
<7> [946.382557] [drm:i915_reset_device [i915]] resetting chip
<5> [946.383318] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
<7> [946.383621] [IGT] Checking that the GPU recovered
<6> [946.475026] [IGT] gem_eio: exiting, ret=98

Which confirms that normally we expect quick reset+recovery cycles (with a reset period of 100ms between iterations). It also tells us that the delay is before i915_reset_device (although we could do with drm.debug=7 to be sure), which is the preamble in i915_handle_error(). Of note the only thing there is synchronize_rcu_expedited(). :|

Comment 20 CI Bug Log 2019-07-15 08:26:46 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6470/re-cml-u/igt@gem_eio@reset-stress.html

Comment 21 CI Bug Log 2019-08-09 09:38:18 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL BXT APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_338/fi-bxt-j4205/igt@gem_eio@reset-stress.html

Comment 22 CI Bug Log 2019-09-09 10:13:37 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL BXT APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-cfl-8109u/igt@gem_eio@reset-stress.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-cfl-guc/igt@gem_eio@reset-stress.html

Comment 23 CI Bug Log 2019-09-20 08:02:38 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5193/shard-snb6/igt@gem_eio@kms.html

Comment 24 CI Bug Log 2019-09-20 08:07:43 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SNB: igt@runner@aborted - fail -Previous test: gem_eio (kms)
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3440/shard-snb5/igt@runner@aborted.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5193/shard-snb6/igt@runner@aborted.html

Comment 25 CI Bug Log 2019-09-20 11:02:04 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}


  No new failures caught with the new filter

Comment 26 CI Bug Log 2019-09-20 11:02:23 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit -}
{+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med &lt; limit &amp;&amp; max &lt; 5 * limit +}


  No new failures caught with the new filter

Comment 27 CI Bug Log 2019-09-20 11:02:28 UTC

The CI Bug Log issue associated to this bug has been updated.

### Removed filters

* SNB: igt@runner@aborted - fail -Previous test: gem_eio (kms) (added on 2 hours ago)

Comment 28 ashutosh.dixit 2019-10-23 03:08:24 UTC

Bug assessment: for over a month, reset-stress and unwedge-stress gem_eio subtests are passing on all platforms (including ICL), except SNB. Will watch for some more time and reduce the severity of the bug if the failures are not seen on other platforms. Also perhaps the SNB failures can be fixed by increasing the time to complete the wedge somewhat for SNB.

Comment 29 ashutosh.dixit 2019-10-23 03:10:37 UTC

Submitted the following patch (not tested) as a candidate fix for the SNB issue:

    i915/gem_eio: Attempt to fix reset-stress/unwedge-stress failures on SNB

    gem_eio reset-stress and unwedge-stress subtests are now passing on
    all platforms except SNB. Attempt to fix failures in SNB by giving
    a little more time to complete the wedge.

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109661
    Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>

diff --git a/tests/i915/gem_eio.c b/tests/i915/gem_eio.c
index 892f3657c..20f66e00d 100644
--- a/tests/i915/gem_eio.c
+++ b/tests/i915/gem_eio.c
@@ -300,7 +300,7 @@ static void check_wait_elapsed(const char *prefix, int fd, igt_stats_t *st)
         * modeset back on) around resets, so may take a lot longer.
         */
        limit = 250e6;
-       if (intel_gen(intel_get_drm_devid(fd)) < 5)
+       if (intel_gen(intel_get_drm_devid(fd)) <= 5)
                limit += 300e6; /* guestimate for 2x worstcase modeset */

        med = igt_stats_get_median(st);

Comment 30 Francesco Balestrieri 2019-11-11 10:21:43 UTC

Updating platform field accordingly.

Comment 31 Martin Peres 2019-11-29 18:08:17 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/232.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.