Bug 111812 - i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Summary: i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-25 08:49 UTC by Tom
Modified: 2019-10-15 06:50 UTC (History)
2 users (show)

See Also:
i915 platform: CFL
i915 features: GPU hang


Attachments
dmesg log (3.94 MB, text/plain)
2019-09-26 12:22 UTC, Tom
no flags Details
crash dump - /sys/class/drm/card0/error (16.22 KB, text/plain)
2019-10-03 12:17 UTC, mo-son
no flags Details
dmesg, kernel log (2.28 MB, text/plain)
2019-10-03 12:18 UTC, mo-son
no flags Details
additional crash dump (16.41 KB, text/plain)
2019-10-03 19:59 UTC, csw
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-10-03 23:07 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-03 23:07 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-10-03 23:08 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-03 23:08 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.18 KB, text/plain)
2019-10-04 07:36 UTC, Kenneth C
no flags Details
Complete dmesg with crash info around 3975 (73.42 KB, text/plain)
2019-10-14 09:19 UTC, Tom
no flags Details
[drm] GPU crash dump saved to /sys/class/drm/card0/error (16.58 KB, text/plain)
2019-10-14 09:21 UTC, Tom
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom 2019-09-25 08:49:03 UTC
I am on latest Arch, all recent. Just experience a GUI-hang with mostly Terminals, Firefox and Emacs open. All on sway / Wayland. An external 4k-Monitor was attached via USB-C.

HW is a recent Lenovo X390.
[ 9225.720061] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 9225.720062] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 9225.720063] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 9225.720063] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 9225.720063] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 9225.720064] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 9225.721091] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

/sys/class/drm/card0/error is empty but was created on hang. This is how the dir looks like:

card0 l
total 0
drwxr-xr-x 9 root root    0 Sep 25 08:05 .
drwxr-xr-x 4 root root    0 Sep 25 08:05 ..
drwxr-xr-x 5 root root    0 Sep 25 08:05 card0-DP-1
drwxr-xr-x 5 root root    0 Sep 25 08:05 card0-DP-2
drwxr-xr-x 6 root root    0 Sep 25 08:05 card0-eDP-1
drwxr-xr-x 3 root root    0 Sep 25 08:05 card0-HDMI-A-1
drwxr-xr-x 3 root root    0 Sep 25 08:05 card0-HDMI-A-2
-r--r--r-- 1 root root 4.0K Sep 25 10:46 dev
lrwxrwxrwx 1 root root    0 Sep 25 08:05 device -> ../../../0000:00:02.0
-rw------- 1 root root    0 Sep 25 10:46 error
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_act_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_boost_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_cur_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_max_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_min_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RP0_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RP1_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RPn_freq_mhz
drwxr-xr-x 3 root root    0 Sep 25 10:46 metrics
drwxr-xr-x 2 root root    0 Sep 25 08:05 power
lrwxrwxrwx 1 root root    0 Sep 25 08:05 subsystem -> ../../../../../class/drm
-rw-r--r-- 1 root root 4.0K Sep 25 08:05 uevent
Comment 1 Tom 2019-09-25 08:52:03 UTC
I did not experience this before updating to Kernel 5.3.
Comment 2 Lakshmi 2019-09-25 09:45:13 UTC
(In reply to Tom from comment #0)
> I am on latest Arch, all recent. Just experience a GUI-hang with mostly
> Terminals, Firefox and Emacs open. All on sway / Wayland. An external
> 4k-Monitor was attached via USB-C.
> 
> HW is a recent Lenovo X390.
> [ 9225.720061] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on
> rcs0
> [ 9225.720062] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
> stack, including userspace.
> [ 9225.720063] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [ 9225.720063] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [ 9225.720063] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [ 9225.720064] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Can you please attach the crash dump file? 
Also, can you please attach the dmesg from boot when the issue is seen? Ensure that you set the kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 3 Tom 2019-09-26 12:22:07 UTC
cat /sys/class/drm/card0/error
No error state collected

The attached dmesg is collected after a reboot with drm.debug=0x1e log_buf_len=4M set.
Comment 4 Tom 2019-09-26 12:22:37 UTC
Created attachment 145526 [details]
dmesg log
Comment 5 CI Bug Log 2019-09-27 13:36:00 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: igt@kms_psr2_su@frontbuffer - fail - Failed assertion: result,  No matching selective update blocks read from debugfs
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6963/re-tgl1-display/igt@kms_psr2_su@frontbuffer.html
Comment 6 Lakshmi 2019-09-27 14:27:22 UTC
(In reply to CI Bug Log from comment #5)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * TGL: igt@kms_psr2_su@frontbuffer - fail - Failed assertion: result,  No
> matching selective update blocks read from debugfs
>   -
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6963/re-tgl1-display/
> igt@kms_psr2_su@frontbuffer.html

Please ignore this comment.
Comment 7 Lakshmi 2019-09-30 07:07:37 UTC
(In reply to Tom from comment #4)
> Created attachment 145526 [details]
> dmesg log

Attached log doesn't contain GPU Hang. How often GPU hang occurs? Is there any pattern that causes hang? In any case, error log is needed to look in to this issue further. Can you check error log if hang occurs once again.
Also, please attach the dmesg from boot.
Comment 8 Tom 2019-09-30 08:16:11 UTC
Hi lakshmi,

the error log is empty and I don't have the dmesg from the hang any more (I think the ring buffer is deleted on every reboot??).
I have not experienced another hang, probably a neutrino :)
Please close this bug for now, I will open a new one once the machine hangs again and I can collect meaningful logs. Thanks!
Comment 9 mo-son 2019-10-03 12:17:33 UTC
Created attachment 145624 [details]
crash dump - /sys/class/drm/card0/error
Comment 10 mo-son 2019-10-03 12:18:09 UTC
Created attachment 145625 [details]
dmesg, kernel log
Comment 11 mo-son 2019-10-03 12:19:35 UTC
Same issue here.
I've attached the kernel log (full log is too large, attached the portion where the crash occured) and crash dump. Error occurs at 14:05:19.
Was watching a movie for about 7 minutes (mpv, hw-decoding).

Seems it only happens with kernel 5.3

files: crash dump - /sys/class/drm/card0/error; dmesg, kernel log
Comment 12 csw 2019-10-03 19:59:12 UTC
Created attachment 145630 [details]
additional crash dump

Crash dump of the same problem here, but I'm using i3wm on Archlinux Kernel 5.3.1
Comment 13 Kenneth C 2019-10-03 23:07:27 UTC
Created attachment 145634 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 14 Kenneth C 2019-10-03 23:07:50 UTC
Created attachment 145635 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 15 Kenneth C 2019-10-03 23:08:09 UTC
Created attachment 145636 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 16 Kenneth C 2019-10-03 23:08:25 UTC
Created attachment 145637 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 17 Kenneth C 2019-10-04 07:36:31 UTC
Created attachment 145638 [details]
/sys/class/drm/card0/error

This was encouraging; I'm running the latest drm-tip and this time, it managed to recover:

----
Oct  4 00:32:54 hp-x360n kernel: [10308.045206] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.045210] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Oct  4 00:32:54 hp-x360n kernel: [10308.045212] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Oct  4 00:32:54 hp-x360n kernel: [10308.045213] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Oct  4 00:32:54 hp-x360n kernel: [10308.045214] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Oct  4 00:32:54 hp-x360n kernel: [10308.045216] GPU crash dump saved to /sys/class/drm/card0/error
Oct  4 00:32:54 hp-x360n kernel: [10308.046223] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.046988] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.047094] i915 0000:00:02.0: Resetting chip for hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.048105] [drm] GuC communication stopped
Oct  4 00:32:54 hp-x360n kernel: [10308.048847] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.049582] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.051162] [drm] GuC communication enabled
Oct  4 00:32:54 hp-x360n kernel: [10308.051208] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
Oct  4 00:32:54 hp-x360n kernel: [10308.051212] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
Oct  4 00:33:02 hp-x360n kernel: [10316.044654] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  4 00:33:10 hp-x360n kernel: [10324.044128] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----

Error report is attached.
Comment 18 Francesco Balestrieri 2019-10-10 06:18:35 UTC
"This is happening to me at least twice daily now." - setting severity to major based on this.
Comment 19 Tom 2019-10-14 09:17:35 UTC
It happened again. Latest arch, X390 as originally stated. Will upload card0_error and dmesg now.
Comment 20 Tom 2019-10-14 09:19:54 UTC
Created attachment 145731 [details]
Complete dmesg with crash info around 3975

Crash parts:
[ 3975.559717] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 3975.559721] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3975.559723] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 3975.559725] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3975.559726] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 3975.559728] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 3975.560776] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Comment 21 Tom 2019-10-14 09:21:21 UTC
Created attachment 145732 [details]
[drm] GPU crash dump saved to /sys/class/drm/card0/error


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.