Summary: | [CI] igt@.* - dmesg-warn/dmesg-fail - *ERROR* rcs0: reset request timeout | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs, petri.latvala, tomi.p.sarvela |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | KBL | i915 features: | GEM/Other |
Description
Martin Peres
2018-05-28 11:51:38 UTC
The GPU totally died, and I mean totally! It's kind of the reason why we have: taint: /* * History tells us that if we cannot reset the GPU now, we * never will. This then impacts everything that is run * subsequently. On failing the reset, we mark the driver * as wedged, preventing further execution on the GPU. * We also want to go one step further and add a taint to the * kernel so that any subsequent faults can be traced back to * this failure. This is important for CI, where if the * GPU/driver fails we would like to reboot and restart testing * rather than continue on into oblivion. For everyone else, * the system should still plod along, but they have been warned! */ add_taint(TAINT_WARN, LOCKDEP_STILL_OK); for this condition with the expectation that CI was going to reboot the machine when it occurs. (In reply to Chris Wilson from comment #1) > The GPU totally died, and I mean totally! It's kind of the reason why we > have: > > taint: > /* > * History tells us that if we cannot reset the GPU now, we > * never will. This then impacts everything that is run > * subsequently. On failing the reset, we mark the driver > * as wedged, preventing further execution on the GPU. > * We also want to go one step further and add a taint to the > * kernel so that any subsequent faults can be traced back to > * this failure. This is important for CI, where if the > * GPU/driver fails we would like to reboot and restart testing > * rather than continue on into oblivion. For everyone else, > * the system should still plod along, but they have been warned! > */ > add_taint(TAINT_WARN, LOCKDEP_STILL_OK); > > for this condition with the expectation that CI was going to reboot the > machine when it occurs. OK! Adding Tomi and Petri to see what we can do here :) If doing a reboot cycle is still too slow, a full suspend/resume should be enough. And we can also check whether the device recovered immediately, going to a reboot afterwards as required. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4426/shard-kbl7/igt@gem_eio@wait-wedge-immediate.html [ 331.691790] Setting dangerous option reset - tainting kernel [ 331.692131] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff [ 331.693980] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [ 331.801322] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [ 331.905263] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [ 332.007147] i915 0000:00:02.0: Failed to reset chip [ 332.011489] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout The behaviour should have substantially changed with commit f4e60c5cfbf217cc9faa3aeb63742860154fcfef (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Mon Aug 13 16:01:16 2018 +0300 drm/i915: Force reset on unready engine If engine reports that it is not ready for reset, we give up. Evidence shows that forcing a per engine reset on an engine which is not reporting to be ready for reset, can bring it back into a working order. There is risk that we corrupt the context image currently executing on that engine. But that is a risk worth taking as if we unblock the engine, we prevent a whole device wedging in a case of full gpu reset. Reset individual engine even if it reports that it is not prepared for reset, but only if we aim for full gpu reset and not on first reset attempt. v2: force reset only on later attempts, readability (Chris) v3: simplify with adequate caffeine levels (Chris) v4: comment about risks and migitations (Chris) Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20180813130116.7250-1-mika.kuoppala@linux.intel.com (In reply to Chris Wilson from comment #5) > The behaviour should have substantially changed with > > commit f4e60c5cfbf217cc9faa3aeb63742860154fcfef (HEAD -> > drm-intel-next-queued, drm-intel/drm-intel-next-queued) > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Mon Aug 13 16:01:16 2018 +0300 > > drm/i915: Force reset on unready engine > > If engine reports that it is not ready for reset, we > give up. Evidence shows that forcing a per engine reset > on an engine which is not reporting to be ready for reset, > can bring it back into a working order. There is risk that > we corrupt the context image currently executing on that > engine. But that is a risk worth taking as if we unblock > the engine, we prevent a whole device wedging in a case > of full gpu reset. > > Reset individual engine even if it reports that it is not > prepared for reset, but only if we aim for full gpu reset > and not on first reset attempt. > > v2: force reset only on later attempts, readability (Chris) > v3: simplify with adequate caffeine levels (Chris) > v4: comment about risks and migitations (Chris) > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180813130116.7250-1-mika. > kuoppala@linux.intel.com Seems like it fixed it. Thanks! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.