Bug 111978

Summary: GPU Hang and Failed to reset chip.
Product: DRI Reporter: Yoshinori Gento <oss.linuxpf>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: not set CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: SKL i915 features: GPU hang
Attachments:
Description Flags
/sys/class/drm/card0/error none

Description Yoshinori Gento 2019-10-11 10:16:44 UTC
Created attachment 145706 [details]
/sys/class/drm/card0/error

[Environment]
CPU: SkyLake(core i5 6500TE)
Distribution: debian(customised)
Kernel: 4.19.57
Mesa: 18.3.6
libdrm: 2.4.89

[dmesg]
[10524.095632] [drm] GPU HANG: ecode 9:0:0x85dffffb, in xxxx [2606], reason: hang on rcs0, action: reset
[10524.096671] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[160603.626044] mod_lipc:lipc_write_lipc() destination is full, tid=3275, comm=CcmTimer
[160608.822588] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0, bcs0
[160608.825117] i915 0000:00:02.0: Resetting chip for hang on rcs0, bcs0
[160608.829851] [drm:gen8_reset_engines] *ERROR* bcs0: reset request timeout
[160608.940281] [drm:gen8_reset_engines] *ERROR* bcs0: reset request timeout
[160609.048277] [drm:gen8_reset_engines] *ERROR* bcs0: reset request timeout
[160609.153847] i915 0000:00:02.0: Failed to reset chip
[160609.158573] [drm:gen8_reset_engines] *ERROR* bcs0: reset request timeout
i965: Failed to submit batchbuffer: Input/output error

[description]
It occurred only once while about total 400days operation.(not continuous)
It seems GPU Hang occurred twice in this machine.
First was recovered by reset rcs.
But second cannot be recovered by reset chip.
I attached error file.
This has only first information.
I do not know whether second is related to first.
Comment 1 Chris Wilson 2019-10-11 20:07:17 UTC
The GPU dying as a result of an invalid sequence of instructions is not entirely impossible. It could be a result of a missed application of a workaround after the reset, my memory suggests that there was such a bug on Skylake circa v4.19. I would strongly suggest checking with a later kernel. However, that is only likely to fix the subsequent lockup...
Comment 2 Yoshinori Gento 2019-10-16 01:33:47 UTC
> The GPU dying as a result of an invalid sequence of instructions is not
> entirely impossible. It could be a result of a missed application of a
> workaround after the reset, my memory suggests that there was such a bug on
> Skylake circa v4.19. I would strongly suggest checking with a later kernel.
> However, that is only likely to fix the subsequent lockup...

I will try new kernel.
It seems to be rare bug.
I close ticket as fixed and if I saw same symptom, I will re-open.
Thank you.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.