Bug 98288

Summary: linux 4.9-r1: gpu hangs after hibernation
Product: DRI Reporter: Martin Ziegler <ziegler>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: highest CC: intel-gfx-bugs, ziegler
Version: XOrg gitKeywords: regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BDW i915 features: display/DP
Attachments:
Description Flags
..config of my linux kernel
none
content of /sys/class/drm/card0/error
none
kernel log with drm.debug=0x1e none

Description Martin Ziegler 2016-10-17 10:37:53 UTC
If I hibernate from the console, thaw, and switch between the console
and X, the gpu hangs itself. The machine becomes unresponsive except
for the power button.

    21:12:16  kernel: [drm] GPU HANG: ecode 8:0:0x0f71ffff, in Xorg [2047], reason: Hang on render ring, action: reset
    21:12:16  kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
    21:12:16  kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
    21:12:16  kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
    21:12:16  kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
    21:12:16  kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
    21:12:16  kernel: drm/i915: Resetting chip after gpu hang

I tried to bisect, and found

  commit 068715b922a6f87c454cdfa15bb8049d2076eee6
  Author: Chris Wilson <chris@chris-wilson.co.uk>
  Date:   Thu Aug 18 17:17:11 2016 +0100

    drm/i915/cmdparser: Add the TIMESTAMP register for the other
    engines

as the first bad commit, i.e with a gpu hang after hibernation. This
sseems spurious, reverting this commit didnt not help.

This is a Lenoveo T450s with Broadwell-U Integrated Graphics. I attach
my .config and /sys/class/drm/card0/error.

    Regards,
    Martin
Comment 1 Martin Ziegler 2016-10-17 10:39:31 UTC
Created attachment 127348 [details]
..config of my linux kernel
Comment 2 Martin Ziegler 2016-10-17 10:43:40 UTC
Created attachment 127349 [details]
content of /sys/class/drm/card0/error
Comment 3 yann 2016-10-17 14:41:55 UTC
Can you attached as well your kernel log ; please add "drm.debug=0x1e log_buf_len=1M" in your boot command line
Comment 4 Martin Ziegler 2016-10-17 16:22:02 UTC
I the website did not let me add an attachment. You can find it on my homepage 

http://home.mathematik.uni-freiburg.de/ziegler/kern_log
Comment 5 Martin Ziegler 2016-10-17 16:23:37 UTC
The kernel log contains two runs of linux-4.9-rc1

In the first run I hibernated at 17:13:09 and again at 17:14:13,
but could not trigger the gpu hang.

The second run started at 17:16:38, hibernation at 17:17:15 and
the gpu hang at 17:17:49
Comment 6 Martin Ziegler 2016-10-17 16:27:25 UTC
Created attachment 127364 [details]
kernel log with drm.debug=0x1e
Comment 7 Martin Ziegler 2016-10-17 16:28:23 UTC
the kernel log is attached now.
Comment 8 yann 2016-10-18 16:05:08 UTC
(In reply to Martin Ziegler from comment #7)
> the kernel log is attached now.

thanks Martin. 

It looks like prior to gpu hang happen there are many warning messages linked to dp link training (?) intel_dp_aux_transfer (with i915_hotplug_work_func event): WARN_ON(!msg->buffer != !msg->size) and this is also same as bug 98304 and bug 97344

*** This bug has been marked as a duplicate of bug 97344 ***
Comment 9 Martin Steigerwald 2016-12-01 08:23:33 UTC
According to Jani in https://bugs.freedesktop.org/show_bug.cgi?id=97344#c10 the GPU hang is unrelated to bug #97344.

I can confirm a GPU hang on resume from hibernation for 4.9-rc7 plus some mini merges by Linus, compiled yesterday. I am back at 4.8 again as I do not want to afford an unstable kernel at the moment.

I also had a GPU hang with PlaneShift with a slightly older kernel (4.9-rc4 + drm-intel-fixes), but as instructed by kernel log I reported this as new bug, #98922. Yet it may be related to this one and various other ones like #98794, #98860, #98891.
Comment 10 Chris Wilson 2016-12-01 08:41:58 UTC
The RING_HEAD (loaded from the context) is at the old tail pointer (minus the WA_TAIL) which was lost over suspend (due to stolen memory being reused). We wrote the next request after the WA_TAIL leaving 2 dwords of garbage in the ring.
Comment 11 Chris Wilson 2016-12-01 08:42:59 UTC
commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 21 14:51:08 2016 +0100

    drm/i915/execlists: Reset RING registers upon resume
    
    There is a disparity in the context image saved to disk and our own
    bookkeeping - that is we presume the RING_HEAD and RING_TAIL match our
    stored ce->ring->tail value. However, as we emit WA_TAIL_DWORDS into the
    ring but may not tell the GPU about them, the GPU may be lagging behind
    our bookkeeping. Upon hibernation we do not save stolen pages, presuming
    that their contents are volatile. This means that although we start
    writing into the ring at tail, the GPU starts executing from its HEAD
    and there may be some garbage in between and so the GPU promptly hangs
    upon resume.
    
    Testcase: igt/gem_exec_suspend/basic-S4
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96526
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20160921135108.29574-3-ch
ris@chris-wilson.co.uk

*** This bug has been marked as a duplicate of bug 96526 ***
Comment 12 Martin Ziegler 2016-12-11 12:45:19 UTC
Update: Tbe bug is still present in the recent kernel.

 commit 045169816b31b10faed984b01c390db1b32ee4c1
 Merge: cd66289 678b5c6
 Author: Linus Torvalds <torvalds@linux-foundation.org>
 Date:   Sat Dec 10 09:47:13 2016 -0800

hibernate is unusable since 4.9-rc1
Comment 13 Jari Tahvanainen 2016-12-13 08:10:37 UTC
Closing resolved as duplicate of closed+fixed.
Comment 14 Martin Ziegler 2016-12-13 09:28:48 UTC
Chris' patch solved the problem.

Thanks.

The patch is not yet in Linus' v4.9 though
Comment 15 Martin Ziegler 2017-01-17 21:12:31 UTC
Chris's patch appeared in linux-4.10.rc1.

But is still not in linux-4.9.4 (from Jan 15, 2017)
The cpu-hang is still reproducible.
Comment 16 Jani Nikula 2017-01-18 08:41:35 UTC
Chris, is bafb2f7d4755 ("drm/i915/execlists: Reset RING registers upon resume") cc: stable material?
Comment 17 Martin Ziegler 2017-02-07 16:10:34 UTC
The patch is in 4.9.9-rc1:

https://lkml.org/lkml/2017/2/7/311

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.