100181 – [hsw] GPU hang on context restore

Bug 100181 - [hsw] GPU hang on context restore

Summary: [hsw] GPU hang on context restore

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (4):	100214 100315 100347 100521 (view as bug list)
Depends on:
Blocks:

Reported:	2017-03-13 09:32 UTC by me@tobin.cc
Modified:	2017-03-31 19:40 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:	HSW
i915 features:	GPU hang

Attachments
output from /sys/class/drm/card0/error (14.08 KB, application/x-bzip) 2017-03-13 09:32 UTC, me@tobin.cc	no flags	Details
Complete dmesg output since boot (58.25 KB, text/plain) 2017-03-13 09:33 UTC, me@tobin.cc	no flags	Details
cat /sys/class/drm/card0/error (9.04 KB, text/plain) 2017-03-27 09:48 UTC, martink	no flags	Details
just the next example of /sys/class/drm/card0/error (9.04 KB, text/plain) 2017-03-27 10:06 UTC, martink	no flags	Details
View All

Description me@tobin.cc 2017-03-13 09:32:51 UTC

Created attachment 130189 [details]
output from /sys/class/drm/card0/error

GPU hangs during 'normal' computer usage. Cannot reproduce, happens approximately ever hour or two.

Desktop System:
Linux 4.11.0-rc1-torvalds #6 SMP x86_64 x86_64 GNU/Linux

Primary applications running:
Firefox, Terminator, Emacs, Rhythmbox

Edited dmesg output

[drm] GPU HANG: ecode 7:0:0xf3cffffe, in compiz [1167], reason: Hang on render ring, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang

Bug present while running all mainline kernels since about 4.10-rc8. Also present when running Ubuntu kernel 4.4.0-65-generic.

Comment 1 me@tobin.cc 2017-03-13 09:33:53 UTC

Created attachment 130190 [details]
Complete dmesg output since boot

Comment 2 me@tobin.cc 2017-03-14 21:40:32 UTC

Bug is reproducible by building a kernel with threading set to utilize all cpu's. ie all cpu's are running at 100% load.

Comment 3 me@tobin.cc 2017-03-15 05:11:23 UTC

Kernel build does not trigger bug 100% of the time.

Comment 4 Chris Wilson 2017-03-15 13:17:02 UTC

*** Bug 100214 has been marked as a duplicate of this bug. ***

Comment 5 Chris Wilson 2017-03-22 20:52:00 UTC

*** Bug 100315 has been marked as a duplicate of this bug. ***

Comment 6 Chris Wilson 2017-03-22 20:52:06 UTC

*** Bug 100347 has been marked as a duplicate of this bug. ***

Comment 7 Chris Wilson 2017-03-23 10:00:34 UTC

commit 5d4bac5503fcc67dd7999571e243cee49371aef7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 22 20:59:30 2017 +0000

    drm/i915: Restore marking context objects as dirty on pinning
    
    Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between
    legacy/execlists/guc") converted the legacy intel_ringbuffer submission
    to the same context pinning mechanism as execlists - that is to pin the
    context until the subsequent request is retired. Previously it used the
    vma retirement of the context object to keep itself pinned until the
    next request (after i915_vma_move_to_active()). In the conversion, I
    missed that the vma retirement was also responsible for marking the
    object as dirty. Mark the context object as dirty when pinning
    (equivalent to execlists) which ensures that if the context is swapped
    out due to mempressure or suspend/hibernation, when it is loaded back in
    it does so with the previous state (and not all zero).
    
    Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc")
    Reported-by: Dennis Gilmore <dennis@ausil.us>
    Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1
    Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Comment 8 martink 2017-03-27 09:48:49 UTC

Created attachment 130478 [details]
cat /sys/class/drm/card0/error

This is not fixed in -rc4, which includes said patch. I append /sys/class/drm/card0/error from my gpu hang under load (compiling stuff).

Again. 4.10 is clearly *good*. This is introduced in 4.11-rc1.

thanks
                  martin

Comment 9 martink 2017-03-27 09:50:39 UTC

-rc4 is *better* than rc3, so this patch clearly improved the situation. -rc4 is still worse than 4.10 as I experienced a gpu hang again.

Comment 10 martink 2017-03-27 10:06:02 UTC

Created attachment 130479 [details]
just the next example of /sys/class/drm/card0/error

Comment 11 Chris Wilson 2017-03-27 11:32:31 UTC

-rc4 doesn't include that patch.

Comment 12 Chris Wilson 2017-03-31 19:40:12 UTC

*** Bug 100521 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.