Bug 111812 - i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Summary: i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-25 08:49 UTC by Tom
Modified: 2019-11-29 19:34 UTC (History)
8 users (show)

See Also:
i915 platform: CFL
i915 features: GPU hang


Attachments
dmesg log (3.94 MB, text/plain)
2019-09-26 12:22 UTC, Tom
no flags Details
crash dump - /sys/class/drm/card0/error (16.22 KB, text/plain)
2019-10-03 12:17 UTC, moson
no flags Details
dmesg, kernel log (2.28 MB, text/plain)
2019-10-03 12:18 UTC, moson
no flags Details
additional crash dump (16.41 KB, text/plain)
2019-10-03 19:59 UTC, csw
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-10-03 23:07 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-03 23:07 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-10-03 23:08 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-03 23:08 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.18 KB, text/plain)
2019-10-04 07:36 UTC, Kenneth C
no flags Details
Complete dmesg with crash info around 3975 (73.42 KB, text/plain)
2019-10-14 09:19 UTC, Tom
no flags Details
[drm] GPU crash dump saved to /sys/class/drm/card0/error (16.58 KB, text/plain)
2019-10-14 09:21 UTC, Tom
no flags Details
GPU HANG: ecode 9:0:0x00000000, hang on rcs0 (16.05 KB, text/plain)
2019-11-02 13:15 UTC, stoffel.010170
no flags Details
/sys/class/drm/card0/error (16.44 KB, text/plain)
2019-11-19 01:15 UTC, Michael
no flags Details

Description Tom 2019-09-25 08:49:03 UTC
I am on latest Arch, all recent. Just experience a GUI-hang with mostly Terminals, Firefox and Emacs open. All on sway / Wayland. An external 4k-Monitor was attached via USB-C.

HW is a recent Lenovo X390.
[ 9225.720061] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 9225.720062] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 9225.720063] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 9225.720063] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 9225.720063] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 9225.720064] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 9225.721091] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

/sys/class/drm/card0/error is empty but was created on hang. This is how the dir looks like:

card0 l
total 0
drwxr-xr-x 9 root root    0 Sep 25 08:05 .
drwxr-xr-x 4 root root    0 Sep 25 08:05 ..
drwxr-xr-x 5 root root    0 Sep 25 08:05 card0-DP-1
drwxr-xr-x 5 root root    0 Sep 25 08:05 card0-DP-2
drwxr-xr-x 6 root root    0 Sep 25 08:05 card0-eDP-1
drwxr-xr-x 3 root root    0 Sep 25 08:05 card0-HDMI-A-1
drwxr-xr-x 3 root root    0 Sep 25 08:05 card0-HDMI-A-2
-r--r--r-- 1 root root 4.0K Sep 25 10:46 dev
lrwxrwxrwx 1 root root    0 Sep 25 08:05 device -> ../../../0000:00:02.0
-rw------- 1 root root    0 Sep 25 10:46 error
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_act_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_boost_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_cur_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_max_freq_mhz
-rw-r--r-- 1 root root 4.0K Sep 25 10:46 gt_min_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RP0_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RP1_freq_mhz
-r--r--r-- 1 root root 4.0K Sep 25 10:46 gt_RPn_freq_mhz
drwxr-xr-x 3 root root    0 Sep 25 10:46 metrics
drwxr-xr-x 2 root root    0 Sep 25 08:05 power
lrwxrwxrwx 1 root root    0 Sep 25 08:05 subsystem -> ../../../../../class/drm
-rw-r--r-- 1 root root 4.0K Sep 25 08:05 uevent
Comment 1 Tom 2019-09-25 08:52:03 UTC
I did not experience this before updating to Kernel 5.3.
Comment 2 Lakshmi 2019-09-25 09:45:13 UTC
(In reply to Tom from comment #0)
> I am on latest Arch, all recent. Just experience a GUI-hang with mostly
> Terminals, Firefox and Emacs open. All on sway / Wayland. An external
> 4k-Monitor was attached via USB-C.
> 
> HW is a recent Lenovo X390.
> [ 9225.720061] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on
> rcs0
> [ 9225.720062] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
> stack, including userspace.
> [ 9225.720063] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [ 9225.720063] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [ 9225.720063] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [ 9225.720064] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Can you please attach the crash dump file? 
Also, can you please attach the dmesg from boot when the issue is seen? Ensure that you set the kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 3 Tom 2019-09-26 12:22:07 UTC
cat /sys/class/drm/card0/error
No error state collected

The attached dmesg is collected after a reboot with drm.debug=0x1e log_buf_len=4M set.
Comment 4 Tom 2019-09-26 12:22:37 UTC
Created attachment 145526 [details]
dmesg log
Comment 5 CI Bug Log 2019-09-27 13:36:00 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: igt@kms_psr2_su@frontbuffer - fail - Failed assertion: result,  No matching selective update blocks read from debugfs
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6963/re-tgl1-display/igt@kms_psr2_su@frontbuffer.html
Comment 6 Lakshmi 2019-09-27 14:27:22 UTC
(In reply to CI Bug Log from comment #5)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * TGL: igt@kms_psr2_su@frontbuffer - fail - Failed assertion: result,  No
> matching selective update blocks read from debugfs
>   -
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6963/re-tgl1-display/
> igt@kms_psr2_su@frontbuffer.html

Please ignore this comment.
Comment 7 Lakshmi 2019-09-30 07:07:37 UTC
(In reply to Tom from comment #4)
> Created attachment 145526 [details]
> dmesg log

Attached log doesn't contain GPU Hang. How often GPU hang occurs? Is there any pattern that causes hang? In any case, error log is needed to look in to this issue further. Can you check error log if hang occurs once again.
Also, please attach the dmesg from boot.
Comment 8 Tom 2019-09-30 08:16:11 UTC
Hi lakshmi,

the error log is empty and I don't have the dmesg from the hang any more (I think the ring buffer is deleted on every reboot??).
I have not experienced another hang, probably a neutrino :)
Please close this bug for now, I will open a new one once the machine hangs again and I can collect meaningful logs. Thanks!
Comment 9 moson 2019-10-03 12:17:33 UTC
Created attachment 145624 [details]
crash dump - /sys/class/drm/card0/error
Comment 10 moson 2019-10-03 12:18:09 UTC
Created attachment 145625 [details]
dmesg, kernel log
Comment 11 moson 2019-10-03 12:19:35 UTC
Same issue here.
I've attached the kernel log (full log is too large, attached the portion where the crash occured) and crash dump. Error occurs at 14:05:19.
Was watching a movie for about 7 minutes (mpv, hw-decoding).

Seems it only happens with kernel 5.3

files: crash dump - /sys/class/drm/card0/error; dmesg, kernel log
Comment 12 csw 2019-10-03 19:59:12 UTC
Created attachment 145630 [details]
additional crash dump

Crash dump of the same problem here, but I'm using i3wm on Archlinux Kernel 5.3.1
Comment 13 Kenneth C 2019-10-03 23:07:27 UTC
Created attachment 145634 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 14 Kenneth C 2019-10-03 23:07:50 UTC
Created attachment 145635 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 15 Kenneth C 2019-10-03 23:08:09 UTC
Created attachment 145636 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 16 Kenneth C 2019-10-03 23:08:25 UTC
Created attachment 145637 [details]
/sys/class/drm/card0/error

This is happening to me at least twice daily now. I have several crash dumps, will upload them all
Comment 17 Kenneth C 2019-10-04 07:36:31 UTC
Created attachment 145638 [details]
/sys/class/drm/card0/error

This was encouraging; I'm running the latest drm-tip and this time, it managed to recover:

----
Oct  4 00:32:54 hp-x360n kernel: [10308.045206] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.045210] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Oct  4 00:32:54 hp-x360n kernel: [10308.045212] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Oct  4 00:32:54 hp-x360n kernel: [10308.045213] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Oct  4 00:32:54 hp-x360n kernel: [10308.045214] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Oct  4 00:32:54 hp-x360n kernel: [10308.045216] GPU crash dump saved to /sys/class/drm/card0/error
Oct  4 00:32:54 hp-x360n kernel: [10308.046223] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.046988] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.047094] i915 0000:00:02.0: Resetting chip for hang on rcs0
Oct  4 00:32:54 hp-x360n kernel: [10308.048105] [drm] GuC communication stopped
Oct  4 00:32:54 hp-x360n kernel: [10308.048847] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.049582] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Oct  4 00:32:54 hp-x360n kernel: [10308.051162] [drm] GuC communication enabled
Oct  4 00:32:54 hp-x360n kernel: [10308.051208] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
Oct  4 00:32:54 hp-x360n kernel: [10308.051212] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
Oct  4 00:33:02 hp-x360n kernel: [10316.044654] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Oct  4 00:33:10 hp-x360n kernel: [10324.044128] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----

Error report is attached.
Comment 18 Francesco Balestrieri 2019-10-10 06:18:35 UTC
"This is happening to me at least twice daily now." - setting severity to major based on this.
Comment 19 Tom 2019-10-14 09:17:35 UTC
It happened again. Latest arch, X390 as originally stated. Will upload card0_error and dmesg now.
Comment 20 Tom 2019-10-14 09:19:54 UTC
Created attachment 145731 [details]
Complete dmesg with crash info around 3975

Crash parts:
[ 3975.559717] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[ 3975.559721] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3975.559723] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 3975.559725] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3975.559726] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 3975.559728] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 3975.560776] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Comment 21 Tom 2019-10-14 09:21:21 UTC
Created attachment 145732 [details]
[drm] GPU crash dump saved to /sys/class/drm/card0/error
Comment 22 stoffel.010170 2019-11-02 13:15:25 UTC
Created attachment 145877 [details]
GPU HANG: ecode 9:0:0x00000000, hang on rcs0

[Nov 2 12:38] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[  +0,000006] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including u>
[  +0,000001] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/>
[  +0,000000] [drm] drm/i915 developers can then reassign to the right component if it's not a>
[  +0,000001] [drm] The gpu crash dump is required to analyze gpu hangs, so please always atta>
[  +0,000001] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  +0,001072] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Nov 2 14:03] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0


Fedora 31:
intel-media-driver-19.3.0-1.fc31.x86_64
libva-intel-driver-2.3.0-5.fc31.x86_64
mesa-vulkan-drivers-19.2.2-1.fc31.x86_64

Linux voyager 5.3.8-300.fc31.x86_64 #1 SMP Tue Oct 29 14:28:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Comment 23 arek.burdach 2019-11-09 11:30:40 UTC
The same situation on:
- mesa: 19.2.1
- i915 module built from 41eb27f39e60d822edc75e6aaeb416b72bc1dcf2 drm-tip (it has some other i915 fixes: https://bugzilla.kernel.org/show_bug.cgi?id=205229)

In my case it usually occurs when I use Intellij Idea, so it looks like something from openjdk stack makes it happen. Idea uses:
- openjdk: 11.0.4+10-b304.77

Any thoughts what feature from this stack can break things? I can prepare some tests if only know what should I test...
Comment 24 Leho Kraav (:macmaN :lkraav) 2019-11-09 12:22:47 UTC
> In my case it usually occurs when I use Intellij Idea, so it looks like something from openjdk stack makes it happen. Idea uses:
> - openjdk: 11.0.4+10-b304.77

It's interesting you would point out IntelliJ IDEA, because I made the same observation at another rcs0 hang bug https://bugs.freedesktop.org/show_bug.cgi?id=111805#c37

Are these bugs potentially duplicates of each other?
Comment 25 arek.burdach 2019-11-09 12:43:22 UTC
(In reply to Leho Kraav (:macmaN :lkraav) from comment #24)
> It's interesting you would point out IntelliJ IDEA, because I made the same
> observation at another rcs0 hang bug
> https://bugs.freedesktop.org/show_bug.cgi?id=111805#c37
> 
> Are these bugs potentially duplicates of each other?

Maybe... I'm not the original reporter of this issue. I've just find out this issue and written down my experience causing the same dmesg output.

I see that both you and Kenneth are testing now on drm-tip. Which exactly commit you've built? As I wrote down I've tested on  41eb27f39e60d822edc75e6aaeb416b72bc1dcf2 (only i915 and drm modules) and didn't helped me in resolving this bug.

I'll build the whole kernel from current drm-tip (e7de48a8b1161a99f4b8e4483bc1bb85f5d31039) and will see what will happen. Will also add drm.debug. In my case this hanging is quite easy to reproduce. After about 1h of work on Intellij IDEA it always hang...
Comment 26 Leho Kraav (:macmaN :lkraav) 2019-11-09 12:57:50 UTC
My current session is on:

* 41eb27f39e60 - (drm-tip/drm-tip) drm-tip: 2019y-11m-07d-17h-06m-16s UTC integration manifest (2 days ago) <Chris Wilson>

Already 40h of uptime.

On 5.4.0-rc6 vanilla, I would get the first hang usually around 3-4 days uptime. Felt like something in the stack takes its time to fill up / leak, then maybe overflow and we get a hang.
Comment 27 Michael 2019-11-19 01:15:24 UTC
Created attachment 145996 [details]
/sys/class/drm/card0/error

I have experienced the same hang a few times. (just got one scrolling this bug tracker).

* I am also running arch linux and i3wm (kernel 5.3.11.1-1).
* I am also on a relatively modern lenovo machine (X1 carbon)
* I have not been running Intellij Idea or much of anything (chrome and a terminal)
* I have not had significant uptime for my machine.
* I first noticed the issue today a couple of hours after upgrading both intel-ucode (20191113-1 -> 20191115-1) and xf86-video-intel (1:2.99.917+893+gbff5eca4-1 -> 1:2.99.917+895+gcb6bff95-1)
* The hang has occured 6 times today (once every half-hour or hour). I hadn't noticed it before, but grepping my logs, I have seen the same log lines once on each of the days of Oct 24, Nov 4th, Nov 13th. (i.e. it had happened before very infrequently, now it is happening frequently.)
* The hang just occured a 7th time while writing this :)
* I will attach a card error dump as well.

dmseg output:
Nov 18 20:10:21 archlinux kernel: i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Nov 18 20:10:21 archlinux kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Nov 18 20:10:21 archlinux kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Nov 18 20:10:21 archlinux kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Nov 18 20:10:21 archlinux kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Nov 18 20:10:21 archlinux kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Nov 18 20:10:21 archlinux kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Subsequent log lines repeat the last one:
Nov 19 01:04:40 archlinux kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Can I provide or generate any more useful information?
Comment 28 Michael 2019-11-19 01:31:56 UTC
Also, I am wondering if this bug is a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=111970


(sorry if there is a process for marking something as a potential duplicate, I'm new to this bug tracker).
Comment 29 Lakshmi 2019-11-20 08:47:23 UTC
(In reply to Michael from comment #27)
> Created attachment 145996 [details]
> /sys/class/drm/card0/error
> 
> I have experienced the same hang a few times. (just got one scrolling this
> bug tracker).
> 
> * I am also running arch linux and i3wm (kernel 5.3.11.1-1).
> * I am also on a relatively modern lenovo machine (X1 carbon)
> * I have not been running Intellij Idea or much of anything (chrome and a
> terminal)
> * I have not had significant uptime for my machine.
> * I first noticed the issue today a couple of hours after upgrading both
> intel-ucode (20191113-1 -> 20191115-1) and xf86-video-intel
> (1:2.99.917+893+gbff5eca4-1 -> 1:2.99.917+895+gcb6bff95-1)
> * The hang has occured 6 times today (once every half-hour or hour). I
> hadn't noticed it before, but grepping my logs, I have seen the same log
> lines once on each of the days of Oct 24, Nov 4th, Nov 13th. (i.e. it had
> happened before very infrequently, now it is happening frequently.)
> * The hang just occured a 7th time while writing this :)
> * I will attach a card error dump as well.
> 
> dmseg output:
> Nov 18 20:10:21 archlinux kernel: i915 0000:00:02.0: GPU HANG: ecode
> 9:0:0x00000000, hang on rcs0
> Nov 18 20:10:21 archlinux kernel: [drm] GPU hangs can indicate a bug
> anywhere in the entire gfx stack, including userspace.
> Nov 18 20:10:21 archlinux kernel: [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> Nov 18 20:10:21 archlinux kernel: [drm] drm/i915 developers can then
> reassign to the right component if it's not a kernel issue.
> Nov 18 20:10:21 archlinux kernel: [drm] The gpu crash dump is required to
> analyze gpu hangs, so please always attach it.
> Nov 18 20:10:21 archlinux kernel: [drm] GPU crash dump saved to
> /sys/class/drm/card0/error
> Nov 18 20:10:21 archlinux kernel: i915 0000:00:02.0: Resetting rcs0 for hang
> on rcs0
> 
> Subsequent log lines repeat the last one:
> Nov 19 01:04:40 archlinux kernel: i915 0000:00:02.0: Resetting rcs0 for hang
> on rcs0
> 
> Can I provide or generate any more useful information?

Michal, this issue is on KBL whereas original issue is on CFL. Likely this could be a different issue. Can you please reproduce this issue with drmtip (https://cgit.freedesktop.org/drm-tip)? 
Please report a new issue if seen on drmtip.

@Tom, are you still able to reproduce the issue with some information in crash dump file? I would recommend to verify the issue with drmtip.
Comment 30 Martin Peres 2019-11-29 19:34:55 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/451.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.