Starting from CI_DRM_2765, the test igt@gem_ringfill@basic-default-hang started passing on a lot of platforms, but on some, it also started outputting the following dmesg message, which triggers a dmesg-warn instead of pass: [ 201.235709] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering. Full logs: - fi-blb-e6850: https://intel-gfx-ci.01.org/CI/CI_DRM_2765/fi-blb-e6850/igt@gem_ringfill@basic-default-hang.html - fi-pnv-d510: https://intel-gfx-ci.01.org/CI/CI_DRM_2770/fi-pnv-d510/igt@gem_ringfill@basic-default-hang.html
This is still bug 99093, but the deadlock has been downgraded to rendering corruption, user data loss and a warning in dmesg.
(In reply to Chris Wilson from comment #1) > This is still bug 99093, but the deadlock has been downgraded to rendering > corruption, user data loss and a warning in dmesg. I assumed so, but should we ignore that in CI for now until this is properly fixed?
Yes, I left the dmesg warning in there so that there remained some impetus from BAT to fix this. I'm optimistic that this is fixed properly along with the remaining bugs surrounding GPU hangs and atomic.
Adding tag into "Whiteboard" field - ReadyForDev *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included
Quick update: issue remains the same.Track is keep on cibuglog.
pnv isn't really a top priority ...
A bit more context, requested by Marta: Throwing the work away on reset is how older platforms worked for years. Fixing them would require rewriting a chunk of the driver, which we really can't justify.
Note that sometime we hit: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3731/fi-pnv-d510/igt@gem_ringfill@basic-default-hang.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3733/fi-blb-e6850/igt@gem_ringfill@basic-default-hang.html [ 332.256029] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering. [ 332.339420] i915 0000:00:02.0: Resetting chip after gpu hang [ 332.988985] [drm:i915_reset [i915]] *ERROR* Failed hw init on reset -5 this cause a lot of consecutive tests to be skipped.
Odd that we haven't hit this on CTG before, anyways: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3795/fi-ctg-p8600/igt@gem_ringfill@basic-default-hang.html [ 248.770066] i915 0000:00:02.0: Resetting chip after gpu hang [ 253.921249] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering. [ 258.132155] i915 0000:00:02.0: Failed to reset chip
dmesg spam be silenced (by avoiding the mutex deadlock in this and only this scenario): commit 8d52e447807b350b98ffb4e64bc2fcc1f181c5be Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Jun 23 11:39:51 2018 +0100 drm/i915: Defer modeset cleanup to a secondary task If we avoid cleaning up the old state immediately in intel_atomic_commit_tail() and defer it to a second task, we can avoid taking heavily contended locks when the caller is ready to procede. Subsequent modesets will wait for the cleanup operation (either directly via the ordered modeset wq or indirectly through the atomic helperr) which keeps the number of inflight cleanup tasks in check. As an example, during reset an immediate modeset is performed to disable the displays before the HW is reset, which must avoid struct_mutex to avoid recursion. Moving the cleanup to a separate task, defers acquiring the struct_mutex to after the GPU is running again, allowing it to complete. Even in a few patches time (optimist!) when we no longer require struct_mutex to unpin the framebuffers, it will still be good practice to minimise the number of contention points along reset. The mutex dependency still exists (as one modeset flushes the other), but in the short term it resolves the deadlock for simple reset cases. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20180623103951.23889-1-chris@chris-wilson.co.uk Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
(In reply to Chris Wilson from comment #10) > dmesg spam be silenced (by avoiding the mutex deadlock in this and only this > scenario): > > commit 8d52e447807b350b98ffb4e64bc2fcc1f181c5be > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Sat Jun 23 11:39:51 2018 +0100 > > drm/i915: Defer modeset cleanup to a secondary task > > If we avoid cleaning up the old state immediately in > intel_atomic_commit_tail() and defer it to a second task, we can avoid > taking heavily contended locks when the caller is ready to procede. > Subsequent modesets will wait for the cleanup operation (either directly > via the ordered modeset wq or indirectly through the atomic helperr) > which keeps the number of inflight cleanup tasks in check. > > As an example, during reset an immediate modeset is performed to disable > the displays before the HW is reset, which must avoid struct_mutex to > avoid recursion. Moving the cleanup to a separate task, defers acquiring > the struct_mutex to after the GPU is running again, allowing it to > complete. Even in a few patches time (optimist!) when we no longer > require struct_mutex to unpin the framebuffers, it will still be good > practice to minimise the number of contention points along reset. The > mutex dependency still exists (as one modeset flushes the other), but in > the short term it resolves the deadlock for simple reset cases. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180623103951.23889-1- > chris@chris-wilson.co.uk > Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Thanks! Closing!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.