Bug 111805 - Constant "Resetting rcs0 for hang on rcs0" and machine lockup
Summary: Constant "Resetting rcs0 for hang on rcs0" and machine lockup
Status: NEEDINFO
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-25 01:04 UTC by Kenneth C
Modified: 2019-10-12 06:02 UTC (History)
4 users (show)

See Also:
i915 platform: CFL
i915 features: firmware/guc


Attachments
egrep -r . /sys/kernel/debug/dri (537.58 KB, text/plain)
2019-09-25 01:04 UTC, Kenneth C
no flags Details
Relevant dump (again) from /var/log/syslog (10.91 KB, text/plain)
2019-09-25 01:06 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-09-27 22:19 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-01 21:53 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.21 KB, text/plain)
2019-10-03 17:44 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.19 KB, text/plain)
2019-10-03 19:30 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.18 KB, text/plain)
2019-10-04 16:48 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.15 KB, text/plain)
2019-10-11 20:05 UTC, Kenneth C
no flags Details
/sys/class/drm/card0/error (5.15 KB, text/plain)
2019-10-12 06:02 UTC, Kenneth C
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kenneth C 2019-09-25 01:04:58 UTC
Created attachment 145504 [details]
egrep -r . /sys/kernel/debug/dri

I'm running the latest tip of Linus' tree, which incorporates the DRM/i915 changes of Thursday September 19th.

Since then, I've had my box (HP Spectre X360) lock up hard several times, usually when a secondary monitor is connected. I have OOPSes logged to pstore, but they never leave any OOPs, plus SysRq is unresponsive, so I have to hard-power-cycle the machine.

I've tried not enabling GuC/HuC, no difference.

However, the last time it happened, I was able to get it to SysRq "S" and was able to save some of the OOPS into /var/log/syslog. Unfortunately I don't have a /sys/class/drm/card0/error file (if there's a way to trigger creation, I can add it as an additional comment).

I have attached the output of "egrep -r . /sys/kernel/debug/dri".

From the syslog, I do have this:

----
Sep 25 16:47:48 hp-x360n kernel: [ 5527.664338] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664340] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664341] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664342] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664342] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Sep 24 16:47:48 hp-x360n kernel: [ 5527.664343] GPU crash dump saved to /sys/class/drm/card0/error
Sep 24 16:47:48 hp-x360n kernel: [ 5527.665348] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.666072] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:47:48 hp-x360n kernel: [ 5527.675065] i915 0000:00:02.0: Resetting chip for hang on rcs0
Sep 24 16:47:48 hp-x360n kernel: [ 5527.676851] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:47:48 hp-x360n kernel: [ 5527.677568] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423700] INFO: task kworker/7:1H:28983 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423702]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423703] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423704] kworker/7:1H    D    0 28983      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423730] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423731] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423736]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423737]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423739]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423741]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423759]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423762]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423780]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423783]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423785]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423786]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423788]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423789]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423790]  ret_from_fork+0x1f/0x30
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423792] INFO: task kworker/1:0H:30982 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423793]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423794] kworker/1:0H    D    0 30982      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423817] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423818] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423820]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423821]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423823]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423824]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423826]  ? set_next_entity+0x98/0x1a0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423843]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423845]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423861]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423863]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423865]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423866]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423867]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423868]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423869]  ret_from_fork+0x1f/0x30
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423885] INFO: task kworker/u16:4:17890 blocked for more than 122 seconds.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423886]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423887] kworker/u16:4   D    0 17890      2 0x80004000
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423905] Workqueue: i915 __i915_gem_free_work [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423906] Call Trace:
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423908]  ? __schedule+0x293/0x530
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423909]  schedule+0x36/0xc0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423910]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423912]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423928]  ? i915_global_objects_shrink+0x20/0x20 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423943]  __i915_gem_free_objects+0x66/0x1b0 [i915]
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423946]  process_one_work+0x1fb/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423947]  worker_thread+0x2d/0x3d0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423948]  kthread+0x10c/0x130
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423950]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423951]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:49:54 hp-x360n kernel: [ 5653.423952]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304145] INFO: task kworker/7:1H:28983 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304152]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304154] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304158] kworker/7:1H    D    0 28983      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304243] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304247] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304261]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304267]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304273]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304279]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304347]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304356]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304420]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304429]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304435]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304439]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304444]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304448]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304473]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304478] INFO: task kworker/1:0H:30982 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304480]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304483] kworker/1:0H    D    0 30982      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304528] Workqueue: events_highpri intel_atomic_cleanup_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304530] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304535]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304539]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304543]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304547]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304551]  ? set_next_entity+0x98/0x1a0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304594]  intel_cleanup_plane_fb+0x2d/0x80 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304599]  drm_atomic_helper_cleanup_planes+0x4f/0x70
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304640]  intel_atomic_cleanup_work+0x1f/0x50 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304644]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304648]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304651]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304654]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304657]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304660]  ret_from_fork+0x1f/0x30
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304680] INFO: task kworker/u16:4:17890 blocked for more than 245 seconds.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304682]       Tainted: G     U     O      5.3.0-Kenny+ #3
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304683] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304685] kworker/u16:4   D    0 17890      2 0x80004000
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304728] Workqueue: i915 __i915_gem_free_work [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304730] Call Trace:
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304734]  ? __schedule+0x293/0x530
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304738]  schedule+0x36/0xc0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304742]  schedule_preempt_disabled+0x11/0x20
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304746]  __mutex_lock.isra.10+0x2f0/0x4f0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304787]  ? i915_global_objects_shrink+0x20/0x20 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304827]  __i915_gem_free_objects+0x66/0x1b0 [i915]
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304832]  process_one_work+0x1fb/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304836]  worker_thread+0x2d/0x3d0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304839]  kthread+0x10c/0x130
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304842]  ? process_one_work+0x3e0/0x3e0
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304845]  ? kthread_create_on_node+0x60/0x60
Sep 24 16:51:57 hp-x360n kernel: [ 5776.304848]  ret_from_fork+0x1f/0x30
----
Comment 1 Kenneth C 2019-09-25 01:06:32 UTC
Created attachment 145505 [details]
Relevant dump (again) from /var/log/syslog

Added the relevant syslog parts from the bug description (cleaner to see)
Comment 2 Lakshmi 2019-09-25 06:45:33 UTC
Can you please attach this file /sys/class/drm/card0/error here?
What is the impact of this issue to you? How do you recover from this situation?

Can you please verify the issue with drmtip and provide the feedback? (https://cgit.freedesktop.org/drm-tip).  

@Chris, any further suggestions?
Comment 3 Chris Wilson 2019-09-25 07:40:47 UTC
It all starts with the error state.
Comment 4 Kenneth C 2019-09-25 17:38:32 UTC
(In reply to Lakshmi from comment #2)

> Can you please attach this file /sys/class/drm/card0/error here?

When it hangs (and it did when I tried to reply to this just now, BTW) I'm left without a usable display, so I can't switch to a VT to save off the error file.

It seems to be most reproducible if I have an external monitor connected.

> What is the impact of this issue to you?
> How do you recover from this situation?

It locks my system hard and I require a hard power-cycle reboot to recover.

I don't have another machine to attempt to SSH into this one, either.

Is there a mechanism to save the error file to non-volatile storage?

> Can you please verify the issue with drmtip and provide the feedback?
> (https://cgit.freedesktop.org/drm-tip).  

I'll try that next.
Comment 5 Kenneth C 2019-09-25 21:44:45 UTC
I'm running drm-tip right now (as of late-afternoon PST). 

So far, so good- but usually the best way to get something to break is to declare it "fixed", so here goes.
Comment 6 Kenneth C 2019-09-25 22:45:07 UTC
(In reply to Kenneth C from comment #5)

> ...  but usually the best way to get something to break is to declare it fixed", so here goes.

*ugh* ... never fails:
----
Sep 25 15:35:22 hp-x360n kernel: [12908.352199] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.352203] GPU hangs can indicate a bug anywhere in the entire gfx stack, including
userspace.
Sep 25 15:35:22 hp-x360n kernel: [12908.352205] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM
/Intel
Sep 25 15:35:22 hp-x360n kernel: [12908.352206] drm/i915 developers can then reassign to the right component if it's not
a kernel issue.
Sep 25 15:35:22 hp-x360n kernel: [12908.352207] The GPU crash dump is required to analyze GPU hangs, so please always att
ach it.
Sep 25 15:35:22 hp-x360n kernel: [12908.352209] GPU crash dump saved to /sys/class/drm/card0/error
Sep 25 15:35:22 hp-x360n kernel: [12908.353216] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.353969] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.354079] i915 0000:00:02.0: Resetting chip for hang on rcs0
Sep 25 15:35:22 hp-x360n kernel: [12908.355089] [drm] GuC communication stopped
Sep 25 15:35:22 hp-x360n kernel: [12908.355831] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.356549] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {re
quest: 00000001, RESET_CTL: 00000001}
Sep 25 15:35:22 hp-x360n kernel: [12908.358073] [drm] GuC communication enabled
Sep 25 15:35:22 hp-x360n kernel: [12908.358112] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
Sep 25 15:35:22 hp-x360n kernel: [12908.358113] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
...
Sep 25 15:37:06 hp-x360n kernel: [13012.350433] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:14 hp-x360n kernel: [13020.350394] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:16 hp-x360n kernel: [13022.334380] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:18 hp-x360n kernel: [13024.318362] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:20 hp-x360n kernel: [13026.302373] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:22 hp-x360n kernel: [13028.350328] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:24 hp-x360n kernel: [13030.334352] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:26 hp-x360n kernel: [13032.318344] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:28 hp-x360n kernel: [13034.302337] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:30 hp-x360n kernel: [13036.350297] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Sep 25 15:37:32 hp-x360n kernel: [13038.334281] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----
Comment 7 Kenneth C 2019-09-25 22:48:01 UTC
Is there any way to get the "/sys/class/drm/card0/error" file into non-volatile storage, or dumped to the log_buf so I can get to it after a reboot?

At least now with the DRI tip code I can SysRq-S instead of the hard lockup before.
Comment 8 Kenneth C 2019-09-27 22:19:47 UTC
Created attachment 145559 [details]
/sys/class/drm/card0/error
Comment 9 Kenneth C 2019-09-27 23:07:46 UTC
Finally captured the error state, see above
Comment 10 Kenneth C 2019-10-01 21:52:45 UTC
It happened again, was able to get error state (I had to use "Sysrq-K" to kill off Kwin/Plasma then I could log in again):

----
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[Tue Oct  1 14:28:09 2019] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Tue Oct  1 14:28:09 2019] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Tue Oct  1 14:28:09 2019] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Tue Oct  1 14:28:09 2019] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[Tue Oct  1 14:28:09 2019] GPU crash dump saved to /sys/class/drm/card0/error
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: Resetting chip for hang on rcs0
[Tue Oct  1 14:28:09 2019] [drm] GuC communication stopped
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[Tue Oct  1 14:28:09 2019] [drm] GuC communication enabled
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[Tue Oct  1 14:28:09 2019] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[Tue Oct  1 14:28:15 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:23 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:25 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:27 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:29 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:31 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:33 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:35 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:37 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:39 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:41 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:43 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:45 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:47 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:49 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:51 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:53 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:55 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:57 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:28:59 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:01 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:03 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:05 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:07 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:09 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:11 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:13 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:15 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:17 2019] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
[Tue Oct  1 14:29:19 2019] [drm] GuC communication stopped
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: Resetting chip for hang on rcs0
[Tue Oct  1 14:29:19 2019] [drm] GuC communication enabled
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[Tue Oct  1 14:29:19 2019] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
[Tue Oct  1 14:30:00 2019] sysrq: Keyboard mode set to system default
----
Comment 11 Kenneth C 2019-10-01 21:53:14 UTC
Created attachment 145609 [details]
/sys/class/drm/card0/error
Comment 12 Kenneth C 2019-10-01 21:55:17 UTC
Running the tip of Linus' tree (54ecb8f7028c5e) merged with drm-tip/drm-tip (9300459553e8c1032f10).
Comment 13 Kenneth C 2019-10-03 17:44:48 UTC
Created attachment 145628 [details]
/sys/class/drm/card0/error

It KEEPS happening, ever since the DRM updates were merged to Linus' master.

Is anyone reading thru these card0/error reports? Any clues? Anything I can/should try?

Daily I merge drm-tip and remain optimistic, but it's been two weeks of unstable operation (and because of the hardware I'm running, I need Linus' tip for power-management, and platform fixes).

FWIW, I was able to kill off X and run a hibernate; it brought the card back from constantly hanging (no doubt due to the power cycle) but I saw this (one on the way down, the latter on the way back up):

----
[102207.765555] i915 0000:00:02.0: Failed to idle engines, declaring wedged!
...
[102208.753786] i915 0000:00:02.0: Failed to idle engines, declaring wedged!
----
Comment 14 Kenneth C 2019-10-03 19:30:41 UTC
Created attachment 145629 [details]
/sys/class/drm/card0/error

:(
Comment 15 Lakshmi 2019-10-04 09:41:06 UTC
(In reply to Kenneth C from comment #14)
> Created attachment 145629 [details]
> /sys/class/drm/card0/error
> 
> :(

Can you disable the GUC and verify the issue? If the issue persists can you attach the error log?
Comment 16 Kenneth C 2019-10-04 16:48:31 UTC
Created attachment 145649 [details]
/sys/class/drm/card0/error

Thank you for looking at my error traces again.

I've tried it once before without the GuC loaded, and it still had hung, but I'll try it again.

Ironically enough, entering text on this website seems to trigger this bug- go figure. I was in the middle of typing this when it locked up again (which has happened before on this site). Error report attached. I had to unplug/replug my secondary monitor to unwedge the GPU again.
Comment 17 Kenneth C 2019-10-05 18:36:27 UTC
It's been about 24 hours without the GuC loaded, and it hasn't happened yet ... this is while running the drm-tip (as of 42dcf5adc9c4).

I'll let it go another day or so before saying that fixed it (that, or a combination of the stuff in the drm-tip) and if so, I'll try turning on GuC(/HuC) again and trying that as a control.

But what am I giving up by not using GuC(/HuC)? I run KDE/Plasma (with the compositor), rarely view videos outside of Plex and YouTube, VMWare with 3D turned on, but I never run games.
Comment 18 Kenneth C 2019-10-07 02:08:59 UTC
This was a hang without GuC(/HuC) enabled, as requested. Had to reboot to clear it up.

(ETA: apparently I cannot add attachments anymore; I'm getting an error when I hit "Submit". I have an error state for the non-GuC case I'd like to attach)
Comment 19 Kenneth C 2019-10-08 00:28:53 UTC
Another Non-GuC hangup, posted to https://bugs.freedesktop.org/show_bug.cgi?id=111920
Comment 20 Francesco Balestrieri 2019-10-10 06:21:04 UTC
Changing component to GuC given the feedback.
Comment 21 Lakshmi 2019-10-10 07:01:48 UTC
(In reply to Kenneth C from comment #16)
> Created attachment 145649 [details]
> /sys/class/drm/card0/error
> 
> Thank you for looking at my error traces again.
> 
> I've tried it once before without the GuC loaded, and it still had hung, but
> I'll try it again.
> 
> Ironically enough, entering text on this website seems to trigger this bug-
> go figure. I was in the middle of typing this when it locked up again (which
> has happened before on this site). Error report attached. I had to
> unplug/replug my secondary monitor to unwedge the GPU again.

Can you also attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ?
Comment 22 Kenneth C 2019-10-10 11:07:41 UTC
(In reply to Francesco Balestrieri from comment #20)
> Changing component to GuC given the feedback.

See https://bugs.freedesktop.org/show_bug.cgi?id=111920 ; it happens with WITHOUT GuC enabled as well.
Comment 23 Kenneth C 2019-10-10 11:28:31 UTC
(In reply to Lakshmi from comment #21)

> Can you also attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M ?

Guys, I appreciate the work and effort being done in the i915 driver (I've spent a lot of time looking at the code thinking I could help fix this and it's highly complex) but it's been three weeks now and this regression is killing my workflow- so I "fixed" the issue by reverting Sept 19th's drm-next merge from Linus' master[1], and I've had reliable operation again for days now.

I'll keep watching for the next DRM update, and I really hope this is happening in enough places to give you guys and idea of what's been happening so it can get fixed, but I can't beta-test this code any longer ... sorry.

I still have the branch with the faulty DRM code and the next time I reboot I'll try to remember to add "drm.debug=0x1e" to the cmdline and boot the faulty branch, though so I can upload the dmesg.




[1] - Turned out to be less painful than I'd thought, too- if anyone else needs to do this, it's "git revert  -m 1 574cc4539762", checking out to "HEAD" of all the conflicted devices not i915, fixing up a minor conflict in .../i915/ and cherry-picking 72e67f0463
Comment 24 Kenneth C 2019-10-11 17:01:14 UTC
I see there's a number of commits pushed to Linus' tip for the i915 today, some of which seem to be relevant to this issue, so I'll try them out.

Fingers crossed ....
Comment 25 Kenneth C 2019-10-11 20:05:04 UTC
Created attachment 145715 [details]
/sys/class/drm/card0/error

... nope :(

At least this time it recovered.
Comment 26 Kenneth C 2019-10-12 06:02:11 UTC
Created attachment 145717 [details]
/sys/class/drm/card0/error

Wow. This time I wasn't even doing anything, came back to it after a couple of hours to an unresponsive system. Back to my discarded-drm-next branch :(


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.