Bug 101991

Summary: [drm] GPU HANG: ecode 9:0:0xeede0199, in chrome [6476], reason: Hang on render ring, action: reset
Product: DRI Reporter: Thiago Macieira <thiago>
Component: DRM/IntelAssignee: Thiago Macieira <thiago>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: anatol.pomozov, intel-gfx-bugs, marc
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: ReadyForDev
i915 platform: SKL i915 features: GPU hang
Attachments:
Description Flags
i915_error_state
none
i915_error_state 2017-08-09
none
card0_error
none
card0_error 2017-09-19
none
card0_error 2017-10-05
none
card0_error 2017-11-10 none

Description Thiago Macieira 2017-07-31 19:53:54 UTC
Created attachment 133163 [details]
i915_error_state

dmesg:
[293734.793602] [drm] GPU HANG: ecode 9:0:0xeede0199, in chrome [6476], reason: Hang on render ring, action: reset
[293734.793647] drm/i915: Resetting chip after gpu hang
[293735.502144] [drm:gen8_reset_engines [i915]] *ERROR* render ring: reset request timeout
[293735.502190] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

i915_error_state attached.
Comment 1 Thiago Macieira 2017-07-31 20:02:50 UTC
$ uname -a
Linux tjmaciei-mobl1 4.11.8-2-default #1 SMP PREEMPT Thu Jun 29 14:37:33 UTC 2017 (42bd7a0) x86_64 x86_64 x86_64 GNU/Linux

The kernel is OpenSUSE's build, which appears to contain patch "drm/i915: Fix S4 resume breakage" to drivers/gpu/drm/i915. I don't see any other patches that affect i915.
Comment 2 Thiago Macieira 2017-07-31 20:08:02 UTC
Oops, forgot the HW description. It's a XPS 13 9350.

00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 540 (rev 0a)
        Subsystem: Dell Device 0704
        Flags: bus master, fast devsel, latency 0, IRQ 131
        Memory at db000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 90000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=64]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [d0] Power Management version 2
        Capabilities: [100] Process Address Space ID (PASID)
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [300] Page Request Interface (PRI)
        Kernel driver in use: i915
        Kernel modules: i915

model name      : Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz

Mesa-17.1.5 (though chrome may have been running against 17.1.4 [started before upgrade]).
Comment 3 Chris Wilson 2017-08-01 18:56:03 UTC
It didn't respond to the write into the ELSP and so got stuck. As the reset failed, that suggests it got stuck pretty hard, and just stopped responding to mmio entirely. Oh well, first step is to try and establist a pattern. Please do report any more odd hangs.
Comment 4 Thiago Macieira 2017-08-01 19:22:47 UTC
I think this has happened before (once). It usually happens when I plug the USB-C dock. One other time, I had the network stack freeze up when I connected the USB-C.

Maybe interesting: the entire UI was frozen, except for the mouse pointer. That still moved. There were no processes in "D" in ps.
Comment 5 Elizabeth 2017-08-04 17:16:24 UTC
Changing to NEEDINFO while more odd hangs are reported. Thank you.
Comment 6 Thiago Macieira 2017-08-09 17:31:48 UTC
Created attachment 133414 [details]
i915_error_state 2017-08-09

Accompanying dmesg:

[470554.774098] pci 0000:01:00.0: [8086:1576] type 01 class 0x060400
...
[471250.681874] drm/i915: Resetting chip after gpu hang
[471251.390960] [drm:gen8_reset_engines [i915]] *ERROR* render ring: reset request timeout
[471251.390996] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

The last relevant action was the USB-C dock being plugged, 1250-554 = 696 seconds (11 minutes and 36 seconds) before the hang. Chrome was on screen again.

xrandr was:

Screen 0: minimum 8 x 8, current 7040 x 2160, maximum 32767 x 32767
eDP1 connected primary 3200x1800+0+0 (normal left inverted right x axis y axis) 290mm x 170mm
   3200x1800     59.98*+
   2560x1440     60.00  
   2048x1536     60.00  
   1920x1440     60.00  
   1856x1392     60.01  
   1792x1344     60.01  
   2048x1152     60.00  
   1920x1080     60.00  
   1600x1200     60.00  
   1400x1050     59.98  
   1600x900      60.00  
   1280x1024     60.02  
   1280x960      60.00  
   1368x768      60.00  
   1280x720      60.00  
   1024x768      60.00  
   1024x576      60.00  
   960x540       60.00  
   800x600       60.32    56.25  
   864x486       60.00  
   640x480       59.94  
   720x405       60.00  
   640x360       60.00  
DP1 disconnected (normal left inverted right x axis y axis)
DP1-1 connected 3840x2160+3200+0 (normal left inverted right x axis y axis) 600mm x 340mm
   3840x2160     30.00*+  25.00    24.00    29.97    23.98  
   1920x1200     59.95  
   1920x1080     60.00    50.00    59.94    24.00    23.98  
   1600x1200     60.00  
   1680x1050     59.88  
   1280x1024     75.02    60.02  
   1280x800      59.91  
   1152x864      75.00  
   1280x720      60.00    50.00    59.94  
   1024x768      75.03    60.00  
   800x600       75.00    60.32  
   720x576       50.00  
   720x480       60.00    59.94  
   640x480       75.00    60.00    59.94  
   720x400       70.08  
DP1-2 disconnected (normal left inverted right x axis y axis)
DP1-3 disconnected (normal left inverted right x axis y axis)
DP2 disconnected (normal left inverted right x axis y axis)
HDMI1 disconnected (normal left inverted right x axis y axis)
HDMI2 disconnected (normal left inverted right x axis y axis)
VIRTUAL1 disconnected (normal left inverted right x axis y axis)
Comment 7 Thiago Macieira 2017-08-30 19:15:46 UTC
Created attachment 133892 [details]
card0_error

Another error state, this time the GPU hang happened immediately after resume from hibernation. There was an earlier hang a few days ago, but I didn't capture the log -- I mistook the blank screen for a failed resume, as opposed to a GPU hang.

The i915_error_state file came up empty, but it's likely I made a mistake capturing it or because the FS didn't sync properly when rebooting. Don't read too much into that.

dmesg:

[256003.269843] [drm] GPU HANG: ecode 9:0:0x8ad3e08a, in kmail [3457], reason: Hang on rcs, action: reset
[256003.269845] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[256003.269846] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[256003.269846] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[256003.269847] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[256003.269848] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[256003.269899] drm/i915: Resetting chip after gpu hang
[256003.977305] [drm:gen8_reset_engines [i915]] *ERROR* rcs: reset request timeout
[256003.977319] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

$ uname -a
Linux tjmaciei-mobl1 4.12.8-1-default #1 SMP PREEMPT Thu Aug 17 05:30:12 UTC 2017 (4d7933a) x86_64 x86_64 x86_64 GNU/Linux
Comment 8 Thiago Macieira 2017-09-20 01:13:22 UTC
Created attachment 134351 [details]
card0_error 2017-09-19

New crash.

[314545.822232] asynchronous wait on fence i915:X[4343]/0:5ff4fe timed out
[314547.720790] [drm] GPU HANG: ecode 9:0:0x8b0152ff, in chrome [92664], reason: Hang on rcs, action: reset
[314547.720792] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[314547.720792] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[314547.720793] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[314547.720793] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[314547.720794] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[314547.720847] drm/i915: Resetting chip after gpu hang
[314548.426677] [drm:gen8_reset_engines [i915]] *ERROR* rcs: reset request timeout
[314548.426736] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

This was 4.12.11. That was the last 4.12 crash, I am now running 4.13.
Comment 9 Thiago Macieira 2017-09-25 17:22:33 UTC
GPU hang on 4.13. Couldn't capture the card0 error log this time because it happened while no other networking was available.

How many more hangs are necessary?

[drm] GPU HANG: ecode 9:0:0x2296f923, in kmail [56266], reason: Hang on rcs0, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
Comment 10 Thiago Macieira 2017-10-02 16:14:54 UTC
Another hang today. The error file just says "No error state collected"

[332618.574549] drm/i915: Resetting chip after gpu hang
[332619.282459] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[332619.282512] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
Comment 11 Elizabeth 2017-10-02 16:44:58 UTC
(In reply to Thiago Macieira from comment #9)
> ... How many more hangs are necessary?...
Hello Thiago, the idea is to identify a pattern. Any event that triggers the hang, hangs ecode repeating constantly, similar or same dmesg warnings/errors/messages repeating just before the hang happening, etc.

So far first error state hangs on render ring hangs due to the ELSP comment #3:
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4368325000, 304160 ms ago
  ELSP[0]:  pid 6476, ban score 0, seqno       31:00e19461, emitted 306200ms ago, head 00000000, tail 00000078

The second hangs on blitter ring:
  hangcheck stall: yes
  hangcheck action: active head
  hangcheck action timestamp: 4333247497, 318116380 ms ago
  ELSP[0]:  pid 2065, ban score 0, seqno        1:00872687, emitted 318102344ms ago, head 00003f10, tail 00003f60

The third and forth on render ring again but these reach further than the first:
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4358891744, 156332 ms ago
  ELSP[0]:  pid 3457, ban score 0, seqno        e:00993440, emitted 159048ms ago, head 00000000, tail 00000080
  ELSP[1]:  pid 2463, ban score 0, seqno        1:00993441, emitted 158976ms ago, head 00000080, tail 000000f8
  Active context: kmail[3457] user_handle 3 hw_id 14, ban score 0 guilty 0 active 0

and 

  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4373527248, 210700 ms ago
  ELSP[0]:  pid 92664, ban score 0, seqno       14:0143a658, emitted 212956ms ago, head 00000000, tail 00000080
  ELSP[1]:  pid 4343, ban score 0, seqno        1:0143a659, emitted 212900ms ago, head 00000180, tail 000001f8
  Active context: chrome[92664] user_handle 3 hw_id 20, ban score 0 guilty 0 active 0

Maybe it would be possible to get a dmesg with debug info since it seems to repeat. Just add drm.debug=0x1e log_bug_len=2M to the grub.
Comment 12 Thiago Macieira 2017-10-02 18:37:52 UTC
(In reply to Elizabeth from comment #11)
> (In reply to Thiago Macieira from comment #9)
> > ... How many more hangs are necessary?...
> Hello Thiago, the idea is to identify a pattern. Any event that triggers the
> hang, hangs ecode repeating constantly, similar or same dmesg
> warnings/errors/messages repeating just before the hang happening, etc.

The only pattern I've identified so far is that it happens within 2 hours of resuming from hibernation. I'll try to narrow the time down.

I can tell you it's not related to the USB-C dock being plugged, since this morning it happened without the dock being in use (hadn't been for over 60 hours).

> Maybe it would be possible to get a dmesg with debug info since it seems to 
> repeat. Just add drm.debug=0x1e log_bug_len=2M to the grub.

I'll do that.
Comment 13 Thiago Macieira 2017-10-05 16:06:16 UTC
Created attachment 134686 [details]
card0_error 2017-10-05

Within 5 of resuming from hibernation.
Comment 14 Thiago Macieira 2017-11-11 08:07:09 UTC
Created attachment 135392 [details]
card0_error 2017-11-10

Over the past month, I've avoided doing hibernation to see if the hangs went away. I did hibernate twice in this period: once, with no ill effect. The other was today, causing a GPU hang as soon as I resumed from hibernation.

[  +0,803738] [drm] GPU HANG: ecode 9:0:0x89f8e6a2, in kmail [128749], reason: Hang on rcs0, action: reset
[  +0,000003] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  +0,000002] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  +0,000002] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  +0,000002] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  +0,000002] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  +0,000097] drm/i915: Resetting chip after gpu hang
[  +0,707253] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[  +0,000136] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
Comment 15 Chris Wilson 2017-11-23 21:26:56 UTC
commit 4f0aa1fa3e3849caee450ee5d14fcc289cf16703
Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
Date:   Thu Nov 9 10:51:43 2017 -0800

    drm/i915/dmc: DMC 1.04 for Kabylake

and dmc-1.04 required.
Comment 16 Thiago Macieira 2017-11-23 21:47:50 UTC
How do I verify which dmc I have?

Please note: I'm on Skylake, not Kabylake.

 # ls -l /lib/firmware/i915/*dmc*
lrwxrwxrwx 1 root root   19 Oct  9 10:03 /lib/firmware/i915/bxt_dmc_ver1.bin -> bxt_dmc_ver1_07.bin
-rw-r--r-- 1 root root 8380 Oct  9 10:03 /lib/firmware/i915/bxt_dmc_ver1_07.bin
lrwxrwxrwx 1 root root   19 Oct  9 10:03 /lib/firmware/i915/kbl_dmc_ver1.bin -> kbl_dmc_ver1_01.bin
-rw-r--r-- 1 root root 8616 Oct  9 10:03 /lib/firmware/i915/kbl_dmc_ver1_01.bin
lrwxrwxrwx 1 root root   19 Oct  9 10:03 /lib/firmware/i915/skl_dmc_ver1.bin -> skl_dmc_ver1_26.bin
-rw-r--r-- 1 root root 8824 Oct  9 10:03 /lib/firmware/i915/skl_dmc_ver1_23.bin
-rw-r--r-- 1 root root 8928 Oct  9 10:03 /lib/firmware/i915/skl_dmc_ver1_26.bin
Comment 17 Thiago Macieira 2017-11-23 21:49:25 UTC
Ah, from dmesg:

# dmesg | grep -i dmc
[    3.773463] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_26.bin (v1.26)

So how does KBL DMC version 1.04 help me?
Comment 18 Imre Deak 2017-11-24 10:13:27 UTC
(In reply to Thiago Macieira from comment #17)
> Ah, from dmesg:
> 
> # dmesg | grep -i dmc
> [    3.773463] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_26.bin
> (v1.26)
> 
> So how does KBL DMC version 1.04 help me?

Yes, SKL needs its own fix (version 1.27). Currently waiting for it to be pulled to linux-firmware.git (after which the corresponding kernel patch needs to be merged).

The pull request for the firmware is at:
https://lists.freedesktop.org/archives/intel-gfx/2017-November/146326.html

The firmware is in the following branch:
https://github.com/anushasr/linux-firmware/commits/SKL_DMC

The kernel patch:
https://patchwork.freedesktop.org/patch/187559/
Comment 19 Thiago Macieira 2018-02-06 02:41:31 UTC
New hang in bug 104959, not related to hibernating.
Comment 20 Thiago Macieira 2018-05-02 02:36:39 UTC
I CANNOT verify that this is fixed. I've just got a new hang, though the behaviour after hanging is different.

For that reason, I've opened a new bug report: Bug 106342.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.