Created attachment 133163 [details] i915_error_state dmesg: [293734.793602] [drm] GPU HANG: ecode 9:0:0xeede0199, in chrome [6476], reason: Hang on render ring, action: reset [293734.793647] drm/i915: Resetting chip after gpu hang [293735.502144] [drm:gen8_reset_engines [i915]] *ERROR* render ring: reset request timeout [293735.502190] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 i915_error_state attached.
$ uname -a Linux tjmaciei-mobl1 4.11.8-2-default #1 SMP PREEMPT Thu Jun 29 14:37:33 UTC 2017 (42bd7a0) x86_64 x86_64 x86_64 GNU/Linux The kernel is OpenSUSE's build, which appears to contain patch "drm/i915: Fix S4 resume breakage" to drivers/gpu/drm/i915. I don't see any other patches that affect i915.
Oops, forgot the HW description. It's a XPS 13 9350. 00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 540 (rev 0a) Subsystem: Dell Device 0704 Flags: bus master, fast devsel, latency 0, IRQ 131 Memory at db000000 (64-bit, non-prefetchable) [size=16M] Memory at 90000000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [100] Process Address Space ID (PASID) Capabilities: [200] Address Translation Service (ATS) Capabilities: [300] Page Request Interface (PRI) Kernel driver in use: i915 Kernel modules: i915 model name : Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz Mesa-17.1.5 (though chrome may have been running against 17.1.4 [started before upgrade]).
It didn't respond to the write into the ELSP and so got stuck. As the reset failed, that suggests it got stuck pretty hard, and just stopped responding to mmio entirely. Oh well, first step is to try and establist a pattern. Please do report any more odd hangs.
I think this has happened before (once). It usually happens when I plug the USB-C dock. One other time, I had the network stack freeze up when I connected the USB-C. Maybe interesting: the entire UI was frozen, except for the mouse pointer. That still moved. There were no processes in "D" in ps.
Changing to NEEDINFO while more odd hangs are reported. Thank you.
Created attachment 133414 [details] i915_error_state 2017-08-09 Accompanying dmesg: [470554.774098] pci 0000:01:00.0: [8086:1576] type 01 class 0x060400 ... [471250.681874] drm/i915: Resetting chip after gpu hang [471251.390960] [drm:gen8_reset_engines [i915]] *ERROR* render ring: reset request timeout [471251.390996] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 The last relevant action was the USB-C dock being plugged, 1250-554 = 696 seconds (11 minutes and 36 seconds) before the hang. Chrome was on screen again. xrandr was: Screen 0: minimum 8 x 8, current 7040 x 2160, maximum 32767 x 32767 eDP1 connected primary 3200x1800+0+0 (normal left inverted right x axis y axis) 290mm x 170mm 3200x1800 59.98*+ 2560x1440 60.00 2048x1536 60.00 1920x1440 60.00 1856x1392 60.01 1792x1344 60.01 2048x1152 60.00 1920x1080 60.00 1600x1200 60.00 1400x1050 59.98 1600x900 60.00 1280x1024 60.02 1280x960 60.00 1368x768 60.00 1280x720 60.00 1024x768 60.00 1024x576 60.00 960x540 60.00 800x600 60.32 56.25 864x486 60.00 640x480 59.94 720x405 60.00 640x360 60.00 DP1 disconnected (normal left inverted right x axis y axis) DP1-1 connected 3840x2160+3200+0 (normal left inverted right x axis y axis) 600mm x 340mm 3840x2160 30.00*+ 25.00 24.00 29.97 23.98 1920x1200 59.95 1920x1080 60.00 50.00 59.94 24.00 23.98 1600x1200 60.00 1680x1050 59.88 1280x1024 75.02 60.02 1280x800 59.91 1152x864 75.00 1280x720 60.00 50.00 59.94 1024x768 75.03 60.00 800x600 75.00 60.32 720x576 50.00 720x480 60.00 59.94 640x480 75.00 60.00 59.94 720x400 70.08 DP1-2 disconnected (normal left inverted right x axis y axis) DP1-3 disconnected (normal left inverted right x axis y axis) DP2 disconnected (normal left inverted right x axis y axis) HDMI1 disconnected (normal left inverted right x axis y axis) HDMI2 disconnected (normal left inverted right x axis y axis) VIRTUAL1 disconnected (normal left inverted right x axis y axis)
Created attachment 133892 [details] card0_error Another error state, this time the GPU hang happened immediately after resume from hibernation. There was an earlier hang a few days ago, but I didn't capture the log -- I mistook the blank screen for a failed resume, as opposed to a GPU hang. The i915_error_state file came up empty, but it's likely I made a mistake capturing it or because the FS didn't sync properly when rebooting. Don't read too much into that. dmesg: [256003.269843] [drm] GPU HANG: ecode 9:0:0x8ad3e08a, in kmail [3457], reason: Hang on rcs, action: reset [256003.269845] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [256003.269846] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [256003.269846] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [256003.269847] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [256003.269848] [drm] GPU crash dump saved to /sys/class/drm/card0/error [256003.269899] drm/i915: Resetting chip after gpu hang [256003.977305] [drm:gen8_reset_engines [i915]] *ERROR* rcs: reset request timeout [256003.977319] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 $ uname -a Linux tjmaciei-mobl1 4.12.8-1-default #1 SMP PREEMPT Thu Aug 17 05:30:12 UTC 2017 (4d7933a) x86_64 x86_64 x86_64 GNU/Linux
Created attachment 134351 [details] card0_error 2017-09-19 New crash. [314545.822232] asynchronous wait on fence i915:X[4343]/0:5ff4fe timed out [314547.720790] [drm] GPU HANG: ecode 9:0:0x8b0152ff, in chrome [92664], reason: Hang on rcs, action: reset [314547.720792] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [314547.720792] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [314547.720793] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [314547.720793] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [314547.720794] [drm] GPU crash dump saved to /sys/class/drm/card0/error [314547.720847] drm/i915: Resetting chip after gpu hang [314548.426677] [drm:gen8_reset_engines [i915]] *ERROR* rcs: reset request timeout [314548.426736] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 This was 4.12.11. That was the last 4.12 crash, I am now running 4.13.
GPU hang on 4.13. Couldn't capture the card0 error log this time because it happened while no other networking was available. How many more hangs are necessary? [drm] GPU HANG: ecode 9:0:0x2296f923, in kmail [56266], reason: Hang on rcs0, action: reset [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [drm] GPU crash dump saved to /sys/class/drm/card0/error drm/i915: Resetting chip after gpu hang [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
Another hang today. The error file just says "No error state collected" [332618.574549] drm/i915: Resetting chip after gpu hang [332619.282459] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [332619.282512] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
(In reply to Thiago Macieira from comment #9) > ... How many more hangs are necessary?... Hello Thiago, the idea is to identify a pattern. Any event that triggers the hang, hangs ecode repeating constantly, similar or same dmesg warnings/errors/messages repeating just before the hang happening, etc. So far first error state hangs on render ring hangs due to the ELSP comment #3: hangcheck stall: yes hangcheck action: dead hangcheck action timestamp: 4368325000, 304160 ms ago ELSP[0]: pid 6476, ban score 0, seqno 31:00e19461, emitted 306200ms ago, head 00000000, tail 00000078 The second hangs on blitter ring: hangcheck stall: yes hangcheck action: active head hangcheck action timestamp: 4333247497, 318116380 ms ago ELSP[0]: pid 2065, ban score 0, seqno 1:00872687, emitted 318102344ms ago, head 00003f10, tail 00003f60 The third and forth on render ring again but these reach further than the first: hangcheck stall: yes hangcheck action: dead hangcheck action timestamp: 4358891744, 156332 ms ago ELSP[0]: pid 3457, ban score 0, seqno e:00993440, emitted 159048ms ago, head 00000000, tail 00000080 ELSP[1]: pid 2463, ban score 0, seqno 1:00993441, emitted 158976ms ago, head 00000080, tail 000000f8 Active context: kmail[3457] user_handle 3 hw_id 14, ban score 0 guilty 0 active 0 and hangcheck stall: yes hangcheck action: dead hangcheck action timestamp: 4373527248, 210700 ms ago ELSP[0]: pid 92664, ban score 0, seqno 14:0143a658, emitted 212956ms ago, head 00000000, tail 00000080 ELSP[1]: pid 4343, ban score 0, seqno 1:0143a659, emitted 212900ms ago, head 00000180, tail 000001f8 Active context: chrome[92664] user_handle 3 hw_id 20, ban score 0 guilty 0 active 0 Maybe it would be possible to get a dmesg with debug info since it seems to repeat. Just add drm.debug=0x1e log_bug_len=2M to the grub.
(In reply to Elizabeth from comment #11) > (In reply to Thiago Macieira from comment #9) > > ... How many more hangs are necessary?... > Hello Thiago, the idea is to identify a pattern. Any event that triggers the > hang, hangs ecode repeating constantly, similar or same dmesg > warnings/errors/messages repeating just before the hang happening, etc. The only pattern I've identified so far is that it happens within 2 hours of resuming from hibernation. I'll try to narrow the time down. I can tell you it's not related to the USB-C dock being plugged, since this morning it happened without the dock being in use (hadn't been for over 60 hours). > Maybe it would be possible to get a dmesg with debug info since it seems to > repeat. Just add drm.debug=0x1e log_bug_len=2M to the grub. I'll do that.
Created attachment 134686 [details] card0_error 2017-10-05 Within 5 of resuming from hibernation.
Created attachment 135392 [details] card0_error 2017-11-10 Over the past month, I've avoided doing hibernation to see if the hangs went away. I did hibernate twice in this period: once, with no ill effect. The other was today, causing a GPU hang as soon as I resumed from hibernation. [ +0,803738] [drm] GPU HANG: ecode 9:0:0x89f8e6a2, in kmail [128749], reason: Hang on rcs0, action: reset [ +0,000003] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ +0,000002] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ +0,000002] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ +0,000002] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ +0,000002] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ +0,000097] drm/i915: Resetting chip after gpu hang [ +0,707253] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout [ +0,000136] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
commit 4f0aa1fa3e3849caee450ee5d14fcc289cf16703 Author: Anusha Srivatsa <anusha.srivatsa@intel.com> Date: Thu Nov 9 10:51:43 2017 -0800 drm/i915/dmc: DMC 1.04 for Kabylake and dmc-1.04 required.
How do I verify which dmc I have? Please note: I'm on Skylake, not Kabylake. # ls -l /lib/firmware/i915/*dmc* lrwxrwxrwx 1 root root 19 Oct 9 10:03 /lib/firmware/i915/bxt_dmc_ver1.bin -> bxt_dmc_ver1_07.bin -rw-r--r-- 1 root root 8380 Oct 9 10:03 /lib/firmware/i915/bxt_dmc_ver1_07.bin lrwxrwxrwx 1 root root 19 Oct 9 10:03 /lib/firmware/i915/kbl_dmc_ver1.bin -> kbl_dmc_ver1_01.bin -rw-r--r-- 1 root root 8616 Oct 9 10:03 /lib/firmware/i915/kbl_dmc_ver1_01.bin lrwxrwxrwx 1 root root 19 Oct 9 10:03 /lib/firmware/i915/skl_dmc_ver1.bin -> skl_dmc_ver1_26.bin -rw-r--r-- 1 root root 8824 Oct 9 10:03 /lib/firmware/i915/skl_dmc_ver1_23.bin -rw-r--r-- 1 root root 8928 Oct 9 10:03 /lib/firmware/i915/skl_dmc_ver1_26.bin
Ah, from dmesg: # dmesg | grep -i dmc [ 3.773463] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_26.bin (v1.26) So how does KBL DMC version 1.04 help me?
(In reply to Thiago Macieira from comment #17) > Ah, from dmesg: > > # dmesg | grep -i dmc > [ 3.773463] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_26.bin > (v1.26) > > So how does KBL DMC version 1.04 help me? Yes, SKL needs its own fix (version 1.27). Currently waiting for it to be pulled to linux-firmware.git (after which the corresponding kernel patch needs to be merged). The pull request for the firmware is at: https://lists.freedesktop.org/archives/intel-gfx/2017-November/146326.html The firmware is in the following branch: https://github.com/anushasr/linux-firmware/commits/SKL_DMC The kernel patch: https://patchwork.freedesktop.org/patch/187559/
New hang in bug 104959, not related to hibernating.
I CANNOT verify that this is fixed. I've just got a new hang, though the behaviour after hanging is different. For that reason, I've opened a new bug report: Bug 106342.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.