Bug 101600

Summary:	[BAT][PNV,BLB] i915_reset_device timed out, cancelling all in-flight rendering. provokes a dmesg-warn
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED WORKSFORME	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	critical
Priority:	high	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	PNV	i915 features:	GPU hang

Description Martin Peres 2017-06-26 13:41:25 UTC

Starting from CI_DRM_2765, the test igt@gem_ringfill@basic-default-hang started passing on a lot of platforms, but on some, it also started outputting the following dmesg message, which triggers a dmesg-warn instead of pass:

[  201.235709] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.

Full logs:
 - fi-blb-e6850: https://intel-gfx-ci.01.org/CI/CI_DRM_2765/fi-blb-e6850/igt@gem_ringfill@basic-default-hang.html
 - fi-pnv-d510: https://intel-gfx-ci.01.org/CI/CI_DRM_2770/fi-pnv-d510/igt@gem_ringfill@basic-default-hang.html

Comment 1 Chris Wilson 2017-06-27 13:39:32 UTC

This is still bug 99093, but the deadlock has been downgraded to rendering corruption, user data loss and a warning in dmesg.

Comment 2 Martin Peres 2017-06-27 13:41:23 UTC

(In reply to Chris Wilson from comment #1)
> This is still bug 99093, but the deadlock has been downgraded to rendering
> corruption, user data loss and a warning in dmesg.

I assumed so, but should we ignore that in CI for now until this is properly fixed?

Comment 3 Chris Wilson 2017-06-27 13:45:50 UTC

Yes, I left the dmesg warning in there so that there remained some impetus from BAT to fix this. I'm optimistic that this is fixed properly along with the remaining bugs surrounding GPU hangs and atomic.

Comment 4 Elizabeth 2017-06-27 14:44:05 UTC

Adding tag into "Whiteboard" field - ReadyForDev
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 5 Elizabeth 2017-10-13 20:17:17 UTC

Quick update: issue remains the same.Track is keep on cibuglog.

Comment 6 Daniel Vetter 2017-11-08 10:16:47 UTC

pnv isn't really a top priority ...

Comment 7 Daniel Vetter 2017-11-08 12:06:04 UTC

A bit more context, requested by Marta: Throwing the work away on reset is how older platforms worked for years. Fixing them would require rewriting a chunk of the driver, which we really can't justify.

Comment 8 Marta Löfstedt 2018-02-07 13:47:13 UTC

Note that sometime we hit:
 https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3731/fi-pnv-d510/igt@gem_ringfill@basic-default-hang.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3733/fi-blb-e6850/igt@gem_ringfill@basic-default-hang.html
	

[  332.256029] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[  332.339420] i915 0000:00:02.0: Resetting chip after gpu hang
[  332.988985] [drm:i915_reset [i915]] *ERROR* Failed hw init on reset -5

this cause a lot of consecutive tests to be skipped.

Comment 9 Marta Löfstedt 2018-02-19 10:28:05 UTC

Odd that we haven't hit this on CTG before, anyways:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3795/fi-ctg-p8600/igt@gem_ringfill@basic-default-hang.html


[  248.770066] i915 0000:00:02.0: Resetting chip after gpu hang
[  253.921249] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[  258.132155] i915 0000:00:02.0: Failed to reset chip

Comment 10 Chris Wilson 2018-06-25 16:29:27 UTC

dmesg spam be silenced (by avoiding the mutex deadlock in this and only this scenario):

commit 8d52e447807b350b98ffb4e64bc2fcc1f181c5be
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jun 23 11:39:51 2018 +0100

    drm/i915: Defer modeset cleanup to a secondary task
    
    If we avoid cleaning up the old state immediately in
    intel_atomic_commit_tail() and defer it to a second task, we can avoid
    taking heavily contended locks when the caller is ready to procede.
    Subsequent modesets will wait for the cleanup operation (either directly
    via the ordered modeset wq or indirectly through the atomic helperr)
    which keeps the number of inflight cleanup tasks in check.
    
    As an example, during reset an immediate modeset is performed to disable
    the displays before the HW is reset, which must avoid struct_mutex to
    avoid recursion. Moving the cleanup to a separate task, defers acquiring
    the struct_mutex to after the GPU is running again, allowing it to
    complete. Even in a few patches time (optimist!) when we no longer
    require struct_mutex to unpin the framebuffers, it will still be good
    practice to minimise the number of contention points along reset. The
    mutex dependency still exists (as one modeset flushes the other), but in
    the short term it resolves the deadlock for simple reset cases.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180623103951.23889-1-chris@chris-wilson.co.uk
    Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 11 Martin Peres 2019-03-08 15:53:16 UTC

(In reply to Chris Wilson from comment #10)
> dmesg spam be silenced (by avoiding the mutex deadlock in this and only this
> scenario):
> 
> commit 8d52e447807b350b98ffb4e64bc2fcc1f181c5be
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sat Jun 23 11:39:51 2018 +0100
> 
>     drm/i915: Defer modeset cleanup to a secondary task
>     
>     If we avoid cleaning up the old state immediately in
>     intel_atomic_commit_tail() and defer it to a second task, we can avoid
>     taking heavily contended locks when the caller is ready to procede.
>     Subsequent modesets will wait for the cleanup operation (either directly
>     via the ordered modeset wq or indirectly through the atomic helperr)
>     which keeps the number of inflight cleanup tasks in check.
>     
>     As an example, during reset an immediate modeset is performed to disable
>     the displays before the HW is reset, which must avoid struct_mutex to
>     avoid recursion. Moving the cleanup to a separate task, defers acquiring
>     the struct_mutex to after the GPU is running again, allowing it to
>     complete. Even in a few patches time (optimist!) when we no longer
>     require struct_mutex to unpin the framebuffers, it will still be good
>     practice to minimise the number of contention points along reset. The
>     mutex dependency still exists (as one modeset flushes the other), but in
>     the short term it resolves the deadlock for simple reset cases.
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180623103951.23889-1-
> chris@chris-wilson.co.uk
>     Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>     Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Thanks! Closing!

Comment 12 CI Bug Log 2019-03-08 15:53:22 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.