Bug 110429 - [CI][SHARDS] igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test.
Summary: [CI][SHARDS] igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset...
Status: RESOLVED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-15 07:43 UTC by Martin Peres
Modified: 2019-07-20 12:59 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other


Attachments

Description Martin Peres 2019-04-15 07:43:15 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5922/shard-iclb3/igt@i915_selftest@live_hangcheck.html

<6> [2204.414577] i915: Running intel_hangcheck_live_selftests/igt_atomic_reset
<3> [2204.468308] igt_atomic_reset_engine timed out, cancelling test.
Comment 1 CI Bug Log 2019-04-15 07:43:55 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test.
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5922/shard-iclb3/igt@i915_selftest@live_hangcheck.html
Comment 2 Francesco Balestrieri 2019-04-18 08:40:10 UTC
From:

<3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

It seems that the HW failed to respond and the test timed out. We should increase the timeout of the test to get the actual failure.
Comment 3 Arek Hiler 2019-04-25 11:18:31 UTC
(In reply to Francesco Balestrieri from comment #2)
> From:
> 
> <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset
> request timeout
> 
> It seems that the HW failed to respond and the test timed out. We should
> increase the timeout of the test to get the actual failure.

Was there anything done already? Did we increase the timeout? The issue was seen only once in CI_DRM_5922.

If it is a real issue, not a single time random fluke, it means we may fail to reset from atomic context from time to time, leading to a wedged GPU if it ever happens on a live system. But that's a stretch.

TBH, I don't think that this particular issue has anything to do with atomic contexts. It is just one engine being randomly stuck or taking more time to reset than expected. So it is not that serious, especially with this failure rate.

Let's keep an eye on this.
Comment 4 Chris Wilson 2019-04-25 12:00:20 UTC
(In reply to Arek Hiler from comment #3)
> (In reply to Francesco Balestrieri from comment #2)
> > From:
> > 
> > <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset
> > request timeout
> > 
> > It seems that the HW failed to respond and the test timed out. We should
> > increase the timeout of the test to get the actual failure.
> 
> Was there anything done already? Did we increase the timeout? The issue was
> seen only once in CI_DRM_5922.

The error has occurred quite rarely across the reset tests over the years, and over the years we have applied whatever w/a we could find to reduce the rate of incidence. We haven't increased the timeout applied to the selftest yet -- and it would not fix the bug, just make the cause more obvious.
Comment 5 Martin Peres 2019-04-26 11:32:04 UTC
(In reply to Chris Wilson from comment #4)
> (In reply to Arek Hiler from comment #3)
> > (In reply to Francesco Balestrieri from comment #2)
> > > From:
> > > 
> > > <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset
> > > request timeout
> > > 
> > > It seems that the HW failed to respond and the test timed out. We should
> > > increase the timeout of the test to get the actual failure.
> > 
> > Was there anything done already? Did we increase the timeout? The issue was
> > seen only once in CI_DRM_5922.
> 
> The error has occurred quite rarely across the reset tests over the years,
> and over the years we have applied whatever w/a we could find to reduce the
> rate of incidence. We haven't increased the timeout applied to the selftest
> yet -- and it would not fix the bug, just make the cause more obvious.

So this failure has been twice, once on CFL and once on ICL.

I think your explanation makes sense, and we should try to reduce the reproduction rate as much as possible, but this does not look like a new regression, more like an architectural issue.

Dropping the priority to medium so as we can periodically check that this does not become more apparent.
Comment 6 CI Bug Log 2019-05-15 10:02:30 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test. -}
{+ VEGA_M ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test. +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4988/fi-kbl-8809g/igt@i915_selftest@live_hangcheck.html
Comment 7 Chris Wilson 2019-05-15 20:52:20 UTC
(In reply to CI Bug Log from comment #6)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- ICL: igt@i915_selftest@live_hangcheck - dmesg-fail -
> igt_atomic_reset_engine timed out, cancelling test. -}
> {+ VEGA_M ICL: igt@i915_selftest@live_hangcheck - dmesg-fail -
> igt_atomic_reset_engine timed out, cancelling test. +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4988/fi-kbl-8809g/
> igt@i915_selftest@live_hangcheck.html

Different bug. That one looks like context corruption.
Comment 8 Chris Wilson 2019-07-20 12:59:20 UTC
Original bug 404.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.