https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5922/shard-iclb3/igt@i915_selftest@live_hangcheck.html <6> [2204.414577] i915: Running intel_hangcheck_live_selftests/igt_atomic_reset <3> [2204.468308] igt_atomic_reset_engine timed out, cancelling test.
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test. - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5922/shard-iclb3/igt@i915_selftest@live_hangcheck.html
From: <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout It seems that the HW failed to respond and the test timed out. We should increase the timeout of the test to get the actual failure.
(In reply to Francesco Balestrieri from comment #2) > From: > > <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset > request timeout > > It seems that the HW failed to respond and the test timed out. We should > increase the timeout of the test to get the actual failure. Was there anything done already? Did we increase the timeout? The issue was seen only once in CI_DRM_5922. If it is a real issue, not a single time random fluke, it means we may fail to reset from atomic context from time to time, leading to a wedged GPU if it ever happens on a live system. But that's a stretch. TBH, I don't think that this particular issue has anything to do with atomic contexts. It is just one engine being randomly stuck or taking more time to reset than expected. So it is not that serious, especially with this failure rate. Let's keep an eye on this.
(In reply to Arek Hiler from comment #3) > (In reply to Francesco Balestrieri from comment #2) > > From: > > > > <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset > > request timeout > > > > It seems that the HW failed to respond and the test timed out. We should > > increase the timeout of the test to get the actual failure. > > Was there anything done already? Did we increase the timeout? The issue was > seen only once in CI_DRM_5922. The error has occurred quite rarely across the reset tests over the years, and over the years we have applied whatever w/a we could find to reduce the rate of incidence. We haven't increased the timeout applied to the selftest yet -- and it would not fix the bug, just make the cause more obvious.
(In reply to Chris Wilson from comment #4) > (In reply to Arek Hiler from comment #3) > > (In reply to Francesco Balestrieri from comment #2) > > > From: > > > > > > <3> [2204.524458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset > > > request timeout > > > > > > It seems that the HW failed to respond and the test timed out. We should > > > increase the timeout of the test to get the actual failure. > > > > Was there anything done already? Did we increase the timeout? The issue was > > seen only once in CI_DRM_5922. > > The error has occurred quite rarely across the reset tests over the years, > and over the years we have applied whatever w/a we could find to reduce the > rate of incidence. We haven't increased the timeout applied to the selftest > yet -- and it would not fix the bug, just make the cause more obvious. So this failure has been twice, once on CFL and once on ICL. I think your explanation makes sense, and we should try to reduce the reproduction rate as much as possible, but this does not look like a new regression, more like an architectural issue. Dropping the priority to medium so as we can periodically check that this does not become more apparent.
A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test. -} {+ VEGA_M ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - igt_atomic_reset_engine timed out, cancelling test. +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4988/fi-kbl-8809g/igt@i915_selftest@live_hangcheck.html
(In reply to CI Bug Log from comment #6) > A CI Bug Log filter associated to this bug has been updated: > > {- ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - > igt_atomic_reset_engine timed out, cancelling test. -} > {+ VEGA_M ICL: igt@i915_selftest@live_hangcheck - dmesg-fail - > igt_atomic_reset_engine timed out, cancelling test. +} > > New failures caught by the filter: > > * > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4988/fi-kbl-8809g/ > igt@i915_selftest@live_hangcheck.html Different bug. That one looks like context corruption.
Original bug 404.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.