Bug 100181

Summary: [hsw] GPU hang on context restore
Product: DRI Reporter: me <me>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: dennis, intel-gfx-bugs, martink, mathieu.marquer, me, mikhail.v.gavrilov
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: HSW i915 features: GPU hang
Attachments:
Description Flags
output from /sys/class/drm/card0/error
none
Complete dmesg output since boot
none
cat /sys/class/drm/card0/error
none
just the next example of /sys/class/drm/card0/error none

Description me@tobin.cc 2017-03-13 09:32:51 UTC
Created attachment 130189 [details]
output from /sys/class/drm/card0/error

GPU hangs during 'normal' computer usage. Cannot reproduce, happens approximately ever hour or two.

Desktop System:
Linux 4.11.0-rc1-torvalds #6 SMP x86_64 x86_64 GNU/Linux

Primary applications running:
Firefox, Terminator, Emacs, Rhythmbox

Edited dmesg output

[drm] GPU HANG: ecode 7:0:0xf3cffffe, in compiz [1167], reason: Hang on render ring, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang

Bug present while running all mainline kernels since about 4.10-rc8. Also present when running Ubuntu kernel 4.4.0-65-generic.
Comment 1 me@tobin.cc 2017-03-13 09:33:53 UTC
Created attachment 130190 [details]
Complete dmesg output since boot
Comment 2 me@tobin.cc 2017-03-14 21:40:32 UTC
Bug is reproducible by building a kernel with threading set to utilize all cpu's. ie all cpu's are running at 100% load.
Comment 3 me@tobin.cc 2017-03-15 05:11:23 UTC
Kernel build does not trigger bug 100% of the time.
Comment 4 Chris Wilson 2017-03-15 13:17:02 UTC
*** Bug 100214 has been marked as a duplicate of this bug. ***
Comment 5 Chris Wilson 2017-03-22 20:52:00 UTC
*** Bug 100315 has been marked as a duplicate of this bug. ***
Comment 6 Chris Wilson 2017-03-22 20:52:06 UTC
*** Bug 100347 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2017-03-23 10:00:34 UTC
commit 5d4bac5503fcc67dd7999571e243cee49371aef7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 22 20:59:30 2017 +0000

    drm/i915: Restore marking context objects as dirty on pinning
    
    Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between
    legacy/execlists/guc") converted the legacy intel_ringbuffer submission
    to the same context pinning mechanism as execlists - that is to pin the
    context until the subsequent request is retired. Previously it used the
    vma retirement of the context object to keep itself pinned until the
    next request (after i915_vma_move_to_active()). In the conversion, I
    missed that the vma retirement was also responsible for marking the
    object as dirty. Mark the context object as dirty when pinning
    (equivalent to execlists) which ensures that if the context is swapped
    out due to mempressure or suspend/hibernation, when it is loaded back in
    it does so with the previous state (and not all zero).
    
    Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc")
    Reported-by: Dennis Gilmore <dennis@ausil.us>
    Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1
    Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 8 martink 2017-03-27 09:48:49 UTC
Created attachment 130478 [details]
cat /sys/class/drm/card0/error

This is not fixed in -rc4, which includes said patch. I append /sys/class/drm/card0/error from my gpu hang under load (compiling stuff).

Again. 4.10 is clearly *good*. This is introduced in 4.11-rc1.

thanks
                  martin
Comment 9 martink 2017-03-27 09:50:39 UTC
-rc4 is *better* than rc3, so this patch clearly improved the situation. -rc4 is still worse than 4.10 as I experienced a gpu hang again.
Comment 10 martink 2017-03-27 10:06:02 UTC
Created attachment 130479 [details]
just the next example of /sys/class/drm/card0/error
Comment 11 Chris Wilson 2017-03-27 11:32:31 UTC
-rc4 doesn't include that patch.
Comment 12 Chris Wilson 2017-03-31 19:40:12 UTC
*** Bug 100521 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.