111805 – Constant "Resetting rcs0 for hang on rcs0" and machine lockup

Bug 111805 - Constant "Resetting rcs0 for hang on rcs0" and machine lockup

Summary: Constant "Resetting rcs0 for hang on rcs0" and machine lockup

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2019-09-25 01:04 UTC by Kenneth C
Modified:	2019-12-09 06:25 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:	CFL
i915 features:	GPU hang

Attachments
egrep -r . /sys/kernel/debug/dri (537.58 KB, text/plain) 2019-09-25 01:04 UTC, Kenneth C	no flags	Details
Relevant dump (again) from /var/log/syslog (10.91 KB, text/plain) 2019-09-25 01:06 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.21 KB, text/plain) 2019-09-27 22:19 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.19 KB, text/plain) 2019-10-01 21:53 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.21 KB, text/plain) 2019-10-03 17:44 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.19 KB, text/plain) 2019-10-03 19:30 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.18 KB, text/plain) 2019-10-04 16:48 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.15 KB, text/plain) 2019-10-11 20:05 UTC, Kenneth C	no flags	Details
/sys/class/drm/card0/error (5.15 KB, text/plain) 2019-10-12 06:02 UTC, Kenneth C	no flags	Details
dmesg.txt, non-debug (69.91 KB, text/plain) 2019-10-30 12:07 UTC, Leho Kraav (:macmaN :lkraav)	no flags	Details
sys-class-drm-card0-error.txt (4.72 KB, text/plain) 2019-11-07 19:41 UTC, Leho Kraav (:macmaN :lkraav)	no flags	Details
sys-class-drm-card0-error.txt (4.72 KB, text/plain) 2019-11-09 17:49 UTC, Leho Kraav (:macmaN :lkraav)	no flags	Details
java_error during Resetting rcs0 for preemption time out (8.24 KB, text/x-log) 2019-11-09 19:22 UTC, arek.burdach	no flags	Details
View All

Description Kenneth C 2019-09-25 01:04:58 UTC

Created attachment 145504 [details]
egrep -r . /sys/kernel/debug/dri

I'm running the latest tip of Linus' tree, which incorporates the DRM/i915 changes of Thursday September 19th.

Since then, I've had my box (HP Spectre X360) lock up hard several times, usually when a secondary monitor is connected. I have OOPSes logged to pstore, but they never leave any OOPs, plus SysRq is unresponsive, so I have to hard-power-cycle the machine.

I've tried not enabling GuC/HuC, no difference.

However, the last time it happened, I was able to get it to SysRq "S" and was able to save some of the OOPS into /var/log/syslog. Unfortunately I don't have a /sys/class/drm/card0/error file (if there's a way to trigger creation, I can add it as an additional comment).

I have attached the output of "egrep -r . /sys/kernel/debug/dri".

From the syslog, I do have this:

----
Sep 25 16:47:48 hp-x360n kernel: [ 5527.664338] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664340] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664341] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664342] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664342] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664343] GPU crash dump saved to /sys/class/drm/card0/error
Sep 24 16:47:48 hp-x360n kernel: [ 5527.665348] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.666072] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:47:48 hp-x360n kernel: [ 5527.675065] i915 0000:00:02.0: Resetting chip for hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.676851] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:47:48 hp-x360n kernel: [ 5527.677568] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423700] INFO: task kworker/7:1H:28983 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423702]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423703] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423704] kworker/7:1H    D    0 28983      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423730] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423731] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423736]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423737]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423739]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423741]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423759]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423762]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423780]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423783]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423785]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423786]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423788]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423789]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423790]  ret_from_fork+0x1f/0x30
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423792] INFO: task kworker/1:0H:30982 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423793]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423794] kworker/1:0H    D    0 30982      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423817] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423818] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423820]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423821]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423823]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423824]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423826]  ? set_next_entity+0x98/0x1a0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423843]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423845]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423861]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423863]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423865]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423866]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423867]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423868]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423869]  ret_from_fork+0x1f/0x30
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423885] INFO: task kworker/u16:4:17890 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423886]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423887] kworker/u16:4   D    0 17890      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423905] Workqueue: i915 __i915_gem_free_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423906] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423908]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423909]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423910]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423912]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423928]  ? i915_global_objects_shrink+0x20/0x20 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423943]  __i915_gem_free_objects+0x66/0x1b0 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423946]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423947]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423948]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423950]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423951]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423952]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304145] INFO: task kworker/7:1H:28983 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304152]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304154] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304158] kworker/7:1H    D    0 28983      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304243] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304247] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304261]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304267]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304273]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304279]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304347]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304356]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304420]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304429]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304435]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304439]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304444]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304448]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304473]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304478] INFO: task kworker/1:0H:30982 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304480]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304483] kworker/1:0H    D    0 30982      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304528] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304530] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304535]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304539]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304543]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304547]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304551]  ? set_next_entity+0x98/0x1a0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304594]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304599]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304640]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304644]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304648]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304651]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304654]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304657]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304660]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304680] INFO: task kworker/u16:4:17890 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304682]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304683] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304685] kworker/u16:4   D    0 17890      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304728] Workqueue: i915 __i915_gem_free_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304730] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304734]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304738]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304742]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304746]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304787]  ? i915_global_objects_shrink+0x20/0x20 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304827]  __i915_gem_free_objects+0x66/0x1b0 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304832]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304836]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304839]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304842]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304845]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304848]  ret_from_fork+0x1f/0x30
----

Comment 1 Kenneth C 2019-09-25 01:06:32 UTC

Created attachment 145505 [details]
Relevant dump (again) from /var/log/syslog

Added the relevant syslog parts from the bug description (cleaner to see)

Comment 2 Lakshmi 2019-09-25 06:45:33 UTC

Can you please attach this file /sys/class/drm/card0/error here?
What is the impact of this issue to you? How do you recover from this situation?

Can you please verify the issue with drmtip and provide the feedback? (https://cgit.freedesktop.org/drm-tip).  

@Chris, any further suggestions?

Comment 3 Chris Wilson 2019-09-25 07:40:47 UTC

It all starts with the error state.

Comment 4 Kenneth C 2019-09-25 17:38:32 UTC

(In reply to Lakshmi from comment #2)

> Can you please attach this file /sys/class/drm/card0/error here?

When it hangs (and it did when I tried to reply to this just now, BTW) I'm left without a usable display, so I can't switch to a VT to save off the error file.

It seems to be most reproducible if I have an external monitor connected.

> What is the impact of this issue to you?
> How do you recover from this situation?

It locks my system hard and I require a hard power-cycle reboot to recover.

I don't have another machine to attempt to SSH into this one, either.

Is there a mechanism to save the error file to non-volatile storage?

> Can you please verify the issue with drmtip and provide the feedback?
> (https://cgit.freedesktop.org/drm-tip).  

I'll try that next.

Comment 5 Kenneth C 2019-09-25 21:44:45 UTC

I'm running drm-tip right now (as of late-afternoon PST). 

So far, so good- but usually the best way to get something to break is to declare it "fixed", so here goes.

Comment 6 Kenneth C 2019-09-25 22:45:07 UTC

(In reply to Kenneth C from comment #5)

> ...  but usually the best way to get something to break is to declare it fixed", so here goes.

*ugh* ... never fails:
----
Sep 25 15:35:22 hp-x360n kernel: [12908.352199] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.352203] GPU hangs can indicate a bug anywhere in the entire gfx stack, including
userspace.
Sep 25 15:35:22 hp-x360n kernel: [12908.352205] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM
/Intel
Sep 25 15:35:22 hp-x360n kernel: [12908.352206] drm/i915 developers can then reassign to the right component if it's not
a kernel issue.
Sep 25 15:35:22 hp-x360n kernel: [12908.352207] The GPU crash dump is required to analyze GPU hangs, so please always att
ach it.
Sep 25 15:35:22 hp-x360n kernel: [12908.352209] GPU crash dump saved to /sys/class/drm/card0/error
Sep 25 15:35:22 hp-x360n kernel: [12908.353216] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.353969] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.354079] i915 0000:00:02.0: Resetting chip for hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.355089] [drm] GuC communication stopped
Sep 25 15:35:22 hp-x360n kernel: [12908.355831] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.356549] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.358073] [drm] GuC communication enabled
Sep 25 15:35:22 hp-x360n kernel: [12908.358112] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
Sep 25 15:35:22 hp-x360n kernel: [12908.358113] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
...
Sep 25 15:37:06 hp-x360n kernel: [13012.350433] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:14 hp-x360n kernel: [13020.350394] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:16 hp-x360n kernel: [13022.334380] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:18 hp-x360n kernel: [13024.318362] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:20 hp-x360n kernel: [13026.302373] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:22 hp-x360n kernel: [13028.350328] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:24 hp-x360n kernel: [13030.334352] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:26 hp-x360n kernel: [13032.318344] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:28 hp-x360n kernel: [13034.302337] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:30 hp-x360n kernel: [13036.350297] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:32 hp-x360n kernel: [13038.334281] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----

Comment 7 Kenneth C 2019-09-25 22:48:01 UTC

Is there any way to get the "/sys/class/drm/card0/error" file into non-volatile storage, or dumped to the log_buf so I can get to it after a reboot?

At least now with the DRI tip code I can SysRq-S instead of the hard lockup before.

Comment 8 Kenneth C 2019-09-27 22:19:47 UTC

Created attachment 145559 [details]
/sys/class/drm/card0/error

Comment 9 Kenneth C 2019-09-27 23:07:46 UTC

Finally captured the error state, see above

Comment 10 Kenneth C 2019-10-01 21:52:45 UTC

It happened again, was able to get error state (I had to use "Sysrq-K" to kill off Kwin/Plasma then I could log in again):

----
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[Tue Oct  1 14:28:09 2019] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Tue Oct  1 14:28:09 2019] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Tue Oct  1 14:28:09 2019] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Tue Oct  1 14:28:09 2019] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[Tue Oct  1 14:28:09 2019] GPU crash dump saved to /sys/class/drm/card0/error
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: Resetting chip for hang on rcs0
[Tue Oct  1 14:28:09 2019] [drm] GuC communication stopped
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] [drm] GuC communication enabled
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[Tue Oct  1 14:28:15 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:23 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:25 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:27 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:29 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:31 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:33 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:35 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:37 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:39 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:41 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:43 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:45 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:47 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:49 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:51 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:53 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:55 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:57 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:59 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:01 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:03 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:05 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:07 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:09 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:11 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:13 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:15 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:17 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
[Tue Oct  1 14:29:19 2019] [drm] GuC communication stopped
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: Resetting chip for hang on rcs0
[Tue Oct  1 14:29:19 2019] [drm] GuC communication enabled
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[Tue Oct  1 14:30:00 2019] sysrq: Keyboard mode set to system default
----

Comment 11 Kenneth C 2019-10-01 21:53:14 UTC

Created attachment 145609 [details]
/sys/class/drm/card0/error

Comment 12 Kenneth C 2019-10-01 21:55:17 UTC

Running the tip of Linus' tree (54ecb8f7028c5e) merged with drm-tip/drm-tip (9300459553e8c1032f10).

Comment 13 Kenneth C 2019-10-03 17:44:48 UTC

Created attachment 145628 [details]
/sys/class/drm/card0/error

It KEEPS happening, ever since the DRM updates were merged to Linus' master.

Is anyone reading thru these card0/error reports? Any clues? Anything I can/should try?

Daily I merge drm-tip and remain optimistic, but it's been two weeks of unstable operation (and because of the hardware I'm running, I need Linus' tip for power-management, and platform fixes).

FWIW, I was able to kill off X and run a hibernate; it brought the card back from constantly hanging (no doubt due to the power cycle) but I saw this (one on the way down, the latter on the way back up):

----
[102207.765555] i915 0000:00:02.0: Failed to idle engines, declaring wedged!
...
[102208.753786] i915 0000:00:02.0: Failed to idle engines, declaring wedged!
----

Comment 14 Kenneth C 2019-10-03 19:30:41 UTC

Created attachment 145629 [details]
/sys/class/drm/card0/error

:(

Comment 15 Lakshmi 2019-10-04 09:41:06 UTC

(In reply to Kenneth C from comment #14)
> Created attachment 145629 [details]
> /sys/class/drm/card0/error
> 
> :(

Can you disable the GUC and verify the issue? If the issue persists can you attach the error log?

Comment 16 Kenneth C 2019-10-04 16:48:31 UTC

Created attachment 145649 [details]
/sys/class/drm/card0/error

Thank you for looking at my error traces again.

I've tried it once before without the GuC loaded, and it still had hung, but I'll try it again.

Ironically enough, entering text on this website seems to trigger this bug- go figure. I was in the middle of typing this when it locked up again (which has happened before on this site). Error report attached. I had to unplug/replug my secondary monitor to unwedge the GPU again.

Comment 17 Kenneth C 2019-10-05 18:36:27 UTC

It's been about 24 hours without the GuC loaded, and it hasn't happened yet ... this is while running the drm-tip (as of 42dcf5adc9c4).

I'll let it go another day or so before saying that fixed it (that, or a combination of the stuff in the drm-tip) and if so, I'll try turning on GuC(/HuC) again and trying that as a control.

But what am I giving up by not using GuC(/HuC)? I run KDE/Plasma (with the compositor), rarely view videos outside of Plex and YouTube, VMWare with 3D turned on, but I never run games.

Comment 18 Kenneth C 2019-10-07 02:08:59 UTC

This was a hang without GuC(/HuC) enabled, as requested. Had to reboot to clear it up.

(ETA: apparently I cannot add attachments anymore; I'm getting an error when I hit "Submit". I have an error state for the non-GuC case I'd like to attach)

Comment 19 Kenneth C 2019-10-08 00:28:53 UTC

Another Non-GuC hangup, posted to https://bugs.freedesktop.org/show_bug.cgi?id=111920

Comment 20 Francesco Balestrieri 2019-10-10 06:21:04 UTC

Changing component to GuC given the feedback.

Comment 21 Lakshmi 2019-10-10 07:01:48 UTC

(In reply to Kenneth C from comment #16)
> Created attachment 145649 [details]
> /sys/class/drm/card0/error
> 
> Thank you for looking at my error traces again.
> 
> I've tried it once before without the GuC loaded, and it still had hung, but
> I'll try it again.
> 
> Ironically enough, entering text on this website seems to trigger this bug-
> go figure. I was in the middle of typing this when it locked up again (which
> has happened before on this site). Error report attached. I had to
> unplug/replug my secondary monitor to unwedge the GPU again.

Can you also attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ?

Comment 22 Kenneth C 2019-10-10 11:07:41 UTC

(In reply to Francesco Balestrieri from comment #20)
> Changing component to GuC given the feedback.

See https://bugs.freedesktop.org/show_bug.cgi?id=111920 ; it happens with WITHOUT GuC enabled as well.

Comment 23 Kenneth C 2019-10-10 11:28:31 UTC

(In reply to Lakshmi from comment #21)

> Can you also attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ?

Guys, I appreciate the work and effort being done in the i915 driver (I've spent a lot of time looking at the code thinking I could help fix this and it's highly complex) but it's been three weeks now and this regression is killing my workflow- so I "fixed" the issue by reverting Sept 19th's drm-next merge from Linus' master[1], and I've had reliable operation again for days now.

I'll keep watching for the next DRM update, and I really hope this is happening in enough places to give you guys and idea of what's been happening so it can get fixed, but I can't beta-test this code any longer ... sorry.

I still have the branch with the faulty DRM code and the next time I reboot I'll try to remember to add "drm.debug=0x1e" to the cmdline and boot the faulty branch, though so I can upload the dmesg.




[1] - Turned out to be less painful than I'd thought, too- if anyone else needs to do this, it's "git revert  -m 1 574cc4539762", checking out to "HEAD" of all the conflicted devices not i915, fixing up a minor conflict in .../i915/ and cherry-picking 72e67f0463

Comment 24 Kenneth C 2019-10-11 17:01:14 UTC

I see there's a number of commits pushed to Linus' tip for the i915 today, some of which seem to be relevant to this issue, so I'll try them out.

Fingers crossed ....

Comment 25 Kenneth C 2019-10-11 20:05:04 UTC

Created attachment 145715 [details]
/sys/class/drm/card0/error

... nope :(

At least this time it recovered.

Comment 26 Kenneth C 2019-10-12 06:02:11 UTC

Created attachment 145717 [details]
/sys/class/drm/card0/error

Wow. This time I wasn't even doing anything, came back to it after a couple of hours to an unresponsive system. Back to my discarded-drm-next branch :(

Comment 27 Leho Kraav (:macmaN :lkraav) 2019-10-30 11:51:48 UTC

Hi. I'm hitting the same problem now with 5.4.0-rc4+ series, Latitude 7400 CFL.

Mucking around in Firefox, suddenly graphics hang. Fortunately I can switch to console, so I can gracefully terminate some apps, but no way to get back into GUI unless reboot.

I just had 38 days perfectly stable uptime on 5.3.0, before upgrading to 5.4.0-rc series for improvements in S0ix suspend residency.

Unfortunately some kind of a serious graphics regression has happened here.

```
...
okt   30 13:18:23 papaya org.gnome.Shell.desktop[1581]: Fontconfig error: Cannot load default config file
okt   30 13:21:02 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:02 papaya kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
okt   30 13:21:02 papaya kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
okt   30 13:21:02 papaya kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
okt   30 13:21:02 papaya kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
okt   30 13:21:02 papaya org.gnome.Shell.desktop[1581]: Window manager warning: last_user_time (27091247) is greater than comparison timestamp (27084779).  This most likely represents a buggy client sending inaccurate timestamps in messa>
okt   30 13:21:02 papaya org.gnome.Shell.desktop[1581]: Window manager warning: 0x260108f appears to be one of the offending windows with a timestamp of 27091247.  Working around...
okt   30 13:21:10 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:18 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:20 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:22 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:24 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:26 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:28 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:30 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:32 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:34 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:36 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:38 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:40 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:42 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:44 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:46 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:48 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:50 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:52 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:54 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:56 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:21:58 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:00 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:02 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:04 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:06 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:08 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:10 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:12 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:14 papaya kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
okt   30 13:22:14 papaya kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
okt   30 13:22:16 papaya kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
okt   30 13:22:16 papaya kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
okt   30 13:22:30 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0
okt   30 13:22:38 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:46 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:48 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:50 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:52 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
okt   30 13:22:54 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
...
```

Comment 28 Leho Kraav (:macmaN :lkraav) 2019-10-30 12:00:39 UTC

PS 2 times this has happend, both incidents were with USB-C 5K monitor connected.

Comment 29 Leho Kraav (:macmaN :lkraav) 2019-10-30 12:07:30 UTC

Created attachment 145842 [details]
dmesg.txt, non-debug

Regular boot dmesg attached.

I updated libdrm 2.4.99 -> 2.4.100, mesa 19.1.7 -> 19.2.2

If I still get another hang here, I'll add debug params to kernel cmdline.

Comment 30 Jon Ewins 2019-10-30 20:22:16 UTC

Removing firmware/guc i915/feature label as not guc specific.

Comment 31 Kenneth C 2019-10-31 00:02:50 UTC

(In reply to Leho Kraav (:macmaN :lkraav) from comment #27)

> I just had 38 days perfectly stable uptime on 5.3.0, before upgrading to
> 5.4.0-rc series for improvements in S0ix suspend residency.
> Unfortunately some kind of a serious graphics regression has happened here.

If you build your own kernel and wish to carve out drm-next from Linus' tip, see my post above here: https://bugs.freedesktop.org/show_bug.cgi?id=111805#c23

You may have to also do "git reset <linus-tip> -- drivers/gpu/drm/i915" as there's some other changes that have come along since drm-next was merged.

Did this a couple of weeks ago and my box has been rock-solid again ever since.

Comment 32 Kenneth C 2019-10-31 00:34:34 UTC

> You may have to also do ..

"git checkout HEAD -- drivers/gpu/drm/i915" pre-git-merge <linus-tip> commit, that is

Comment 33 Lakshmi 2019-10-31 07:21:43 UTC

(In reply to Kenneth C from comment #31)
> (In reply to Leho Kraav (:macmaN :lkraav) from comment #27)
> 
> > I just had 38 days perfectly stable uptime on 5.3.0, before upgrading to
> > 5.4.0-rc series for improvements in S0ix suspend residency.
> > Unfortunately some kind of a serious graphics regression has happened here.
> 
> If you build your own kernel and wish to carve out drm-next from Linus' tip,
> see my post above here:
> https://bugs.freedesktop.org/show_bug.cgi?id=111805#c23
> 
> You may have to also do "git reset <linus-tip> -- drivers/gpu/drm/i915" as
> there's some other changes that have come along since drm-next was merged.
> 
> Did this a couple of weeks ago and my box has been rock-solid again ever
> since.

Kenneth, Do you mean to say the issue is not seen on Linus' tip?

Comment 34 Kenneth C 2019-10-31 18:41:34 UTC

(In reply to Lakshmi from comment #33)

> Kenneth, Do you mean to say the issue is not seen on Linus' tip?

No that's not what I'm saying, please re-read the comments carefully.

Linus' tip has the drm-next of mid-September pushed to it in commit 574cc4539762; it was this commit that has rendered i915 DRM completely hang-ridden and unstable for me and others. My instructions above show how to UNMERGE the drm-next changes so that if someone needs to run Linus' tip (or several released kernels) for whatever upstream benefit yet go back to the working DRM/i915.

Apparently you guys were warned of this issue regularly occurring by what appears to be an Intel Mesa(?) developer, and he'd even bisected it to a range of commits: https://bugs.freedesktop.org/show_bug.cgi?id=111385#c12

... and while no resolution to that bug was given, and apparently easily reproducible, yet this was merged into the Linux tree. Those commits must be in Intel-local repos, as I can't see those commits in any git repo I have.

Comment 35 Kenneth C 2019-11-03 23:16:32 UTC

Since Chris W has fixed an issue with the drm-tip I'd had with increased current draw when in S0 (see https://bugs.freedesktop.org/show_bug.cgi?id=111909) in commit c601cb2135, I'd decided to try drm-tip again.

So last night, I took the drm-tip as of 9d229bec4f5, merged Linus' tip with that, and have been running with an external monitor and running continuous videos for hours now, and have run several suspend cycles (which seems to have exacerbated this issue).

I'm glad to say I haven't seen a hiccup in over 15 hours and only a couple of  "Atomic update failure" and "Failed to enable link training" messages in dmesg (which appear to be normal, but I can post them if needed).

So to make sure, apparently the best way to get something to break is to publicly announce it "fixed", so here I go: I think this is now fixed in the drm-tip.

... fingers crossed

Comment 36 Kenneth C 2019-11-07 07:41:18 UTC

Well, I'm satisfied. I don't know what commit fixed it, but I haven't seen any hangs, recoverable or otherwise since returning to drm-tip a few days ago. I'm willing to say "fixed".

Comment 37 Leho Kraav (:macmaN :lkraav) 2019-11-07 19:41:32 UTC

Created attachment 145910 [details]
sys-class-drm-card0-error.txt

I'm still crashing on 5.4.0-rc6. I should probably try drm-tip next per Kenneth's experience.

Regardless, I got a card0 error report now to attach. (Had to enable relevant kernel option)

Surprisingly, while I did have to SysRq-E terminate active processes, GPU then managed to recover itself without a reboot. I was even able to re-launch gdm and log back in to post this.

PS I'm working with Java-based PhpStorm a lot, this was the active app when the hang occurred. Maybe somehow relevant to what parts of the stack may be triggering this. System uptime was ~3.5 days, which I think is also visible in error.txt.

Comment 38 Leho Kraav (:macmaN :lkraav) 2019-11-08 12:01:52 UTC

Kenneth, this might be a lot to ask, but since you seem to have reached stable uptime, do you have any interest in trying to bisect this problem. You have a `bisect good` point to go off of.

My `bisect good` is in 5.3.0, where going back would suck for power management.

I'm now running 5.4.0-rc6-drm-tip since yesterday, we'll see - maybe this also proves stable.

Comment 39 Kenneth C 2019-11-08 12:42:51 UTC

(In reply to Leho Kraav (:macmaN :lkraav) from comment #38)

> do you have any interest in trying to bisect this problem.

I've reported a couple of bugs here that I've determined the bad commit via sometimes lengthy bisection sessions, but those have been unique where they've been easily or uniquely reproducible; the problem with this set of GPU HANGs was they could happen with anywhere from 20 mins to 20 hours of uptime, and in the time between when I'd given up on the stuff in Linus's trees and when it was fixed there's been so many commits the series could easily be 15+ bisection trials, and I unfortunately don't have the time.

But IIRC, like you you need Linus' master trees for PM improvements (for me, it's 'cause of s01x stuff I need), and if that's the case, you can always do what I do, and that's maintain several git remotes in your kernel tree- I keep Linus' tip, drm-tip, and RJW's PM tip as remotes. I can then pull a branch for drm-tip, then "git merge" Linus' master on top of that (and merge or cherry-pick from the PM tree from time-to-time); I don't recall much in the way of fixups, and I'm now running a kernel that's got the latest from both trees.

Comment 40 Leho Kraav (:macmaN :lkraav) 2019-11-09 17:49:52 UTC

Created attachment 145926 [details]
sys-class-drm-card0-error.txt

Just got a hang again, on drm-tip:

* 41eb27f39e60 - (drm-tip/drm-tip) drm-tip: 2019y-11m-07d-17h-06m-16s UTC integration manifest (2 days ago) <Chris Wilson>

Both IntelliJ and Firefox open, on USB-C 5K monitor. Firefox was animating its sidebar drawer close, when it suddenly froze.

Surprisingly, GPU recovered and I'm writing this message in Firefox.

Althought I see now's it constantly spewing "kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0" every 3 seconds.

DRM card0 error log attached.

EDIT while I was writing this message, machine suddenly went autonomously into suspend and, I think simultaneously, gnome-shell died with "i965: Failed to submit batchbuffer: Input/output error"

```
nov   09 19:42:53 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:42:56 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:42:59 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:02 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:05 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:08 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:11 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:14 papaya kernel: i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0                                                                                                
nov   09 19:43:14 papaya org.gnome.Shell.desktop[10561]: i965: Failed to submit batchbuffer: Input/output error                                                                                 
nov   09 19:43:14 papaya polkitd[9832]: Unregistered Authentication Agent for unix-session:3 (system bus name :1.72, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale et_EE.u
tf8) (disconnected from bus)                                                                                                                                                                    
nov   09 19:43:14 papaya gnome-session[10493]: gnome-session-binary[10493]: WARNING: App 'org.gnome.Shell.desktop' exited with code 1                                                           
nov   09 19:43:14 papaya gnome-session-binary[10493]: WARNING: App 'org.gnome.Shell.desktop' exited with code 1
```

After gdm restart, kernel log is staying clean, so looks like some kind of a graphics stack reset happened.

(PS All hail Firefox, who successfully saved all typed work in the form, when the session died.)

Comment 41 Kenneth C 2019-11-09 17:53:29 UTC

Now I'm wondering if it's a combination of some kernel fix combined with the drm-tip; I've been flawless for over a week now.

Can you try merging Linus' master and see how that goes?

Comment 42 arek.burdach 2019-11-09 18:22:30 UTC

(In reply to Kenneth C from comment #41)
> Now I'm wondering if it's a combination of some kernel fix combined with the
> drm-tip; I've been flawless for over a week now.

I have the same problem.
Tried kernel from Linus tree (5.4-rc6) - ended up with GPU HANG id dmesg,
and now on today's drm-tip (e7de48a8b1161a99f4b8e4483bc1bb85f5d31039) - ended up with "Resetting rcs0 for hang on rcs0".

I'm working on xps 13 7390 2-in-1, Intel Ice Lake i7-1065G7
mesa: 19.2.1
Always when I use IntelliJ IDEA, on openjdk 11.0.4+10-b304.77.

Comment 43 arek.burdach 2019-11-09 19:19:03 UTC

(In reply to arek.burdach from comment #42)
> Tried kernel from Linus tree (5.4-rc6) - ended up with GPU HANG id dmesg,
> and now on today's drm-tip (e7de48a8b1161a99f4b8e4483bc1bb85f5d31039) -
> ended up with "Resetting rcs0 for hang on rcs0".

My mistake, on 5.4-rc6 it was:
>  GPU HANG: ecode 9:0:0x00000000, hang on rcs0

And on drm-tip it is (from syslog):
> Nov  9 14:07:03 arek-xps13 kernel: [  705.899311] i915 0000:00:02.0: Resetting rcs0 for preemption time out
> Nov  9 14:07:03 arek-xps13 /usr/lib/gdm3/gdm-x-session[1555]: i965: Failed to submit batchbuffer: Input/output error
> Nov  9 14:07:03 arek-xps13 gnome-terminal-[3004]: gnome-terminal-server: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
> Nov  9 14:07:03 arek-xps13 evolution-alarm[2152]: evolution-alarm-notify: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
> Nov  9 14:07:03 arek-xps13 pulseaudio[1543]: X connection to :0 broken (explicit kill or server shutdown).
> Nov  9 14:07:03 arek-xps13 at-spi-bus-launcher[1721]: X connection to :0 broken (explicit kill or server shutdown).
> Nov  9 14:07:03 arek-xps13 update-notifier[3351]: update-notifier: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: #
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: # A fatal error has been detected by the Java Runtime Environment:
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: #
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: #  SIGSEGV (0xb) at pc=0x00007f04aa039b76, pid=4644, tid=4694
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: #
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: # JRE version: OpenJDK Runtime Environment (11.0.4+10) (build 11.0.4+10-b304.77)
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: # Java VM: OpenJDK 64-Bit Server VM (11.0.4+10-b304.77, mixed mode, tiered, compressed oops, 
concurrent mark sweep gc, linux-amd64)
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: # Problematic frame:
> Nov  9 14:07:03 arek-xps13 jetbrains-idea.desktop[1809]: # C  [libc.so.6+0x49b76]

I'll attach java_error.log. It looks like drm get some "hiccups" which cause hang on 5.4-rc6 and is better handled on drm-tip.

I see in log, that the problem was during rendering MemoryUsagePanel @Leho do you also have it enabled? I'll check if it will be the same stack trace another time.

Comment 44 arek.burdach 2019-11-09 19:22:06 UTC

Created attachment 145927 [details]
java_error during Resetting rcs0 for preemption time out

Comment 45 Leho Kraav (:macmaN :lkraav) 2019-11-10 08:03:01 UTC

Yeah, looks like your stack ends up in exactly the same place as mine.

I've been running IntelliJ with system OpenJDK 8 (icedtea-bin-3.13.0). Good to know that their new embedded OpenJDK 11 doesn't do anything to alleviate this issue.

Comment 46 Kenneth C 2019-11-10 19:16:39 UTC

Is there a quick-start guide to running "IntelliJ"? I can see if I can reproduce you guys' issue with my hybrid kernel.

Comment 47 arek.burdach 2019-11-10 19:27:48 UTC

You just need to download it from: https://www.jetbrains.com/idea/download/#section=linux . I use Ultimate Edition, but should be the same situation using Community Edition. It is rather "intuitive" tool so I think that you won't have a problem in using it after download. Just open some project or import it from sources. After a few minutes of work with it, running some tests, editing files, you should got this "hang".

A few important things:
1. Probable you need to enable (disabled by default) memory indicator. It is in: Settings -> Appearance & Behavior -> Appearance -> Windows options -> Memory option
2. I have this situation only when I use internal UHD display. On external FHD display I haven't got it during two weeks of work.

Comment 48 Lakshmi 2019-11-11 15:35:02 UTC

@Kenneth, If you don't see any issues I would like to close this issue considering the original issue is not reproducible with drmtip. 

Other issues reported in this bug has to be investigated separately but not part of this bug.

Comment 49 Kenneth C 2019-11-11 16:52:31 UTC

(In reply to Lakshmi from comment #48)

> @Kenneth, If you don't see any issues I would like to close this issue
> considering the original issue is not reproducible with drmtip. 

I personally am satisfied; I'm GPU HANG free over several monitor configurations and many suspend-resume cycles.

> Other issues reported in this bug has to be investigated separately but not
> part of this bug.

I understand that, and it makes sense.

Comment 50 arek.burdach 2019-11-11 18:08:55 UTC

I aggree that after changes in drm-tip gpu driver works much more reliable. There are still some troubles in using IntelliJ IDEA but I think that we should treat it separately. It can be some corner case like some wrong handling of hidpi on Intllij code. I saw in stack trace that there was a code computing dimensions and shift based on scaling.

Comment 51 Lakshmi 2019-11-12 12:57:06 UTC

Thanks for the feedback. CLosing this bug as WORKSFORME with latest drmtip.

Comment 52 Tox Mi 2019-12-09 05:48:53 UTC

I'm newbie here, but I still have this problem time to time (almost once a day) in my laptop (+connected to another monitor). 

I've `Linux Hexasus 5.3.13-arch1-1` and using i3wm.

Comment 53 elizabeth789 2019-12-09 06:05:57 UTC

The machine starts entering suspend but comes back online immediately when phone charges through USB-C.
see https://vshare.ninja

Comment 54 Kenneth C 2019-12-09 06:25:58 UTC

(In reply to elizabeth789 from comment #53)

> The machine starts entering suspend but comes back online immediately when
> phone charges through USB-C.

This isn't a DRM/i915 issue, but try this and see if it stops this from happening:

$ echo "XHC" | sudo dd of=/proc/acpi/wakeup

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.