107866 – GPU Hangs Abruptly

Bug 107866 - GPU Hangs Abruptly

Summary: GPU Hangs Abruptly

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high blocker
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-09-08 12:24 UTC by bluePain
Modified:	2019-09-25 19:13 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Intel crash log read from /sys/class/drm/card0/error (55.44 KB, text/x-log) 2018-09-08 12:24 UTC, bluePain	Details
2. Intel crash log read from /sys/class/drm/card0/error (84.89 KB, text/x-log) 2018-09-08 12:24 UTC, bluePain	Details
3. Intel crash log read from /sys/class/drm/card0/error (77.99 KB, text/x-log) 2018-09-11 11:25 UTC, bluePain	Details
glxinfo -B output as required by Denis (1.08 KB, text/plain) 2018-09-11 21:37 UTC, bluePain	Details
4. Intel crash log read from /sys/class/drm/card0/error (63.90 KB, text/plain) 2018-10-10 07:11 UTC, bluePain	Details
Error in dmesg (2.73 KB, text/plain) 2018-10-27 02:19 UTC, lefteye	Details
Crash log found at /sys/class/drm/card0/error (76.60 KB, text/plain) 2018-10-27 02:20 UTC, lefteye	Details
glxinfo -B (1.06 KB, text/plain) 2018-10-27 02:22 UTC, lefteye	Details
Crash log found at /sys/class/drm/card0/error (31/10/2018) (57.21 KB, text/plain) 2018-10-31 15:17 UTC, lefteye	Details
Supertuxkart i915/i965 Apitrace dump (3.73 KB, text/plain) 2018-10-31 15:19 UTC, lefteye	Details
Error in dmesg (31/10/2018) (642 bytes, text/plain) 2018-10-31 15:21 UTC, lefteye	Details
/sys/class/drm/card0/error from Mike (15.31 KB, text/plain) 2019-04-04 07:33 UTC, Mike Kuznetsov	Details
View All

Description bluePain 2018-09-08 12:24:13 UTC

Created attachment 141481 [details]
Intel crash log read from /sys/class/drm/card0/error

GPU hangs suddenly. It happened twice since last 1.5 months. Background applications keep running, such as spotify. DE is not responding at all for a short time (~30 secs) and it comes back with a crash dump.

Linux 4.18.5-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 24 12:48:58 UTC 2018 x86_64 GNU/Linux

00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)

Comment 1 bluePain 2018-09-08 12:24:49 UTC

Created attachment 141482 [details]
2. Intel crash log read from /sys/class/drm/card0/error

Comment 2 bluePain 2018-09-08 12:30:24 UTC

dmesg output when the GPU has crashed:

[150585.225717] PPP generic driver version 2.4.2
[150585.486803] PPP BSD Compression module registered
[150585.488316] PPP Deflate Compression module registered
[151567.369673] [drm] GPU HANG: ecode 9:0:0x87f9fff9, in chrome [3234], reason: hang on rcs0, action: reset
[151567.369674] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[151567.369674] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[151567.369675] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[151567.369675] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[151567.369675] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[151567.369769] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[151570.350999] asynchronous wait on fence i915:gnome-shell[5831]/1:860d0 timed out
[151570.351004] asynchronous wait on fence i915:gnome-shell[5831]/1:860d0 timed out
[151575.257790] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[151583.364426] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[151591.257657] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[151599.364228] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 3 bluePain 2018-09-11 11:24:34 UTC

Today GPU hanged again with following DMESG outputs.

(Log file attached)

[95759.685883] [drm] GPU HANG: ecode 9:0:0x87f9fff9, in chrome [17795], reason: hang on rcs0, action: reset
[95759.685885] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[95759.685885] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[95759.685885] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[95759.685885] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[95759.685886] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[95759.685971] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[95767.787768] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[95775.684168] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[95783.787272] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[95791.680429] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 4 bluePain 2018-09-11 11:25:07 UTC

Created attachment 141525 [details]
3. Intel crash log read from /sys/class/drm/card0/error

Comment 5 bluePain 2018-09-11 11:26:41 UTC

I'll try to run applications which cause GPU hangs with dGPU using optirun.

Comment 6 Denis 2018-09-11 13:42:51 UTC

hello. Provide please your mesa version in use (glxinfo -B)
Also, could you please try to downgrade kernel to 4.17 or 4.15 - and try on it?

Comment 7 bluePain 2018-09-11 21:37:51 UTC

Created attachment 141531 [details]
glxinfo -B output as required by Denis

Comment 8 bluePain 2018-09-11 21:47:10 UTC

Hello Denis,

I attached the glxinfo as per you requested.

Today I got another crash while resuming after suspend. Here is the only output I got from dmesg:

[10081.255459] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe B FIFO underrun
[10081.335164] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun


gdm couldn't restore the session and it was like a login loop. I was able to see the login screen but the session was hanging after I enter my password. I tried to kill all open gdm sessions and restart the service but no luck. I got the same errors when I restarted the gdm 2 times. Then I rebooted and the problem was gone. 

As I mentioned in my previous comment I started using optirun with the apps, such as chrome, which was causing i915 to crash but nothing changed. Chrome was running on dGpu before I put the device into the suspend mode.

Comment 9 bluePain 2018-09-12 06:54:14 UTC

Here is another error regarding i915 after resuming from suspend:

[31086.679530] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=15769 end=15770) time 646 us, min 1073, max 1079, scanline start 1056, end 1098

Comment 10 bluePain 2018-10-10 07:02:41 UTC

Today got another hang with following errors on dmesg.

[66562.914106] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe B (start=1242982 end=1242983) time 1329 us, min 1073, max 1079, scanline start 1033, end 0
[87748.971802] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun


This time, fortunately, gnome successfully restarted the session after 2 or 3 minutes of black screen, but all applications from previous session was closed.

Comment 11 bluePain 2018-10-10 07:11:25 UTC

Created attachment 141971 [details]
4. Intel crash log read from /sys/class/drm/card0/error

Comment 12 lefteye 2018-10-27 02:18:28 UTC

Hi,
I'm having the exact same issue on my laptop (Core i7 Sandybridge with HD3000), which runs fine and clean. It occurs almost everytime I play a game like 0ad or supertuxkart. Everything is fine until the whole desktop freezes during 30 seconds while music is still playing. Several seconds later, everything is back to normal. Sometimes, the game just crashes, but the rest of the desktop always ends up showing as if nothing happened... I tried to reproduce with other intensive apps with no success.

My dmesg shows the same messages as yours. There's also an Intel crash log (see attachments).

Comment 13 lefteye 2018-10-27 02:19:55 UTC

Created attachment 142224 [details]
Error in dmesg

Comment 14 lefteye 2018-10-27 02:20:41 UTC

Created attachment 142225 [details]
Crash log found at /sys/class/drm/card0/error

Comment 15 lefteye 2018-10-27 02:22:20 UTC

Created attachment 142226 [details]
glxinfo -B

Comment 16 Denis 2018-10-30 12:59:22 UTC

hi. How long do you play on the game for the hang?
I played about 1 hour on "supertuxkart" - nothing(
I used same mesa and kernel version with you, 18.2.3
SNB CPU (Intel Core i5-2520M	Intel® HD Graphics 3000)

Any additional information would be helpful. Also if it is stable on your PC, could you make an apitrace? https://github.com/apitrace/apitrace

Comment 17 lefteye 2018-10-30 19:10:33 UTC

Hi Denis,

Thanks for your message.
Well, the GPU sometimes hangs a few seconds or a few minutes after I launch supertuxkart. It happens faster if some other application that needs the GPU is running. For example, with SMPlayer (mpv). While the computer is frozen, I can still hear SMPlayer playing. There is no overheating (I fully cleaned my laptop last summer, even changed the thermal paste). CPU tops at 65ºC under load, and 35-43ºC when idle.

I tried a few things : disabling IOMMU and other stuff in the kernel parameters, recompiling a custom kernel for Arch, creating a new user profile... Nothing helped. I tried in latest Debian testing: same result. I'm starting to suspect VA-API or any GL/DRM related stuff, so I'll keep on testing but I'm a bit lost. By the way, I always use KDE Plasma.
Unfortunately, I have reinstalled the whole computer and now use Gentoo. But the issue also occurred when I tried before trying in Arch. So I'll post the result of the command of the apitrace when it happens in Gentoo (the kernel and Mesa versions are different).
Cheers,
Chris

Comment 18 lefteye 2018-10-31 15:14:44 UTC

(In reply to Denis from comment #16)

Hi Denis,
I ran a few tests today in Gentoo. Supertuxkart still crashes with the same kind of messages, although a bit different.
I have attached the apitrace dump, dmesg messages and the error in /sys/class/drm/card0.
Thanks.

Chris

Comment 19 lefteye 2018-10-31 15:17:07 UTC

Created attachment 142304 [details]
Crash log found at /sys/class/drm/card0/error (31/10/2018)

Comment 20 lefteye 2018-10-31 15:19:38 UTC

Created attachment 142305 [details]
Supertuxkart i915/i965 Apitrace dump

Comment 21 lefteye 2018-10-31 15:21:23 UTC

Created attachment 142306 [details]
Error in dmesg (31/10/2018)

Comment 22 lefteye 2018-10-31 19:45:06 UTC

I have just realized that the supertuxkart apitrace is 700MB big... I have uploaded the trace to my Google Drive at :

https://drive.google.com/file/d/18sVMRL7VpWvh8-1KzBHODLR-iEfIFPS_/view?usp=sharing

Comment 23 lefteye 2018-10-31 19:46:46 UTC

Comment on attachment 142305 [details]
Supertuxkart i915/i965 Apitrace dump

This is only the first page of the supertuxkart apitrace dump.
Full apitrace downloadable here:
https://drive.google.com/file/d/18sVMRL7VpWvh8-1KzBHODLR-iEfIFPS_/view?usp=sharing

Comment 24 Denis 2018-11-01 10:46:40 UTC

thank you, will check. Forgot to mention the big size of the apitraces.

Comment 25 Denis 2018-11-01 17:34:42 UTC

ok... looks like I reproduced the issue with provided apitrace. But I haven't ideas - how... I mean, that issue is not straight and stable. I launched browser with an openGL rendering on it (on of the available demo's) - and then provided trace - maybe for 3 or 5 times.

And, according to the dmesg, I got 1 hang:

[ 7991.390603] powercap intel-rapl:0: package locked by BIOS, monitoring only
[ 8214.539084] perf: interrupt took too long (3142 > 3137), lowering kernel.perf_event_max_sample_rate to 63500
[ 8360.400165] [drm] GPU HANG: ecode 6:0:0x85fffffc, in glretrace [20413], reason: hang on rcs0, action: reset
[ 8360.400167] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 8360.400168] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 8360.400168] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 8360.400169] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 8360.400170] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 8360.400214] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 8456.495729] intel_powerclamp: Start idle injection to reduce power


Investigating

Comment 26 Denis 2018-11-02 09:36:54 UTC

I also re-checked exist bugs for SNB, and looks like this bug also has the same roots
https://bugs.freedesktop.org/show_bug.cgi?id=102379

At least, ecode is the same with our's

Comment 27 lefteye 2018-11-02 13:13:16 UTC

(In reply to Denis from comment #26)
Thanks Denis, if there is something else that you need me to report, please tell me.

Chris

Comment 28 Denis 2018-11-05 14:40:19 UTC

BTW, Chris, I checked logs one more time, and I think, your crash is differ from topic starter... You have SNB, he has KBL. And error codes are also different.

So I think need to create separate bug report.

Comment 29 Mike Kuznetsov 2019-04-04 07:32:12 UTC

My GPU freezes after opening some page with many images (20+) in Chromium. 
/sys/class/drm/card0/error attached

mesa: 19.0.1-1ubuntu1

mike ~$ uname -a
Linux delorean 5.0.0-7-generic #8-Ubuntu SMP Mon Mar 4 16:27:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


mike ~$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel Open Source Technology Center (0x8086)
    Device: Mesa DRI Intel(R) Ironlake Mobile  (0x46)
    Version: 19.0.1
    Accelerated: yes
    Video memory: 1536MB
    Unified memory: yes
    Preferred profile: compat (0x2)
    Max core profile version: 0.0
    Max compat profile version: 2.1
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 2.0
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) Ironlake Mobile 
OpenGL version string: 2.1 Mesa 19.0.1
OpenGL shading language version string: 1.20

OpenGL ES profile version string: OpenGL ES 2.0 Mesa 19.0.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 1.0.16

mike ~$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=19.04
DISTRIB_CODENAME=disco
DISTRIB_DESCRIPTION="Ubuntu Disco Dingo (development branch)"
mike ~$ dmesg | tail 
[288446.586620] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288454.590794] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288462.586720] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288470.586669] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288478.586768] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288486.586821] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288496.570617] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288506.586681] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288514.586677] i915 0000:00:02.0: Resetting chip for hang on rcs0
[288522.586699] i915 0000:00:02.0: Resetting chip for hang on rcs0

Comment 30 Mike Kuznetsov 2019-04-04 07:33:12 UTC

Created attachment 143860 [details]
/sys/class/drm/card0/error from Mike

Comment 31 Denis 2019-04-04 14:32:10 UTC

Mike create please separate issue, because your case (steps) and HW/SW quite different from current problem.

Comment 32 GitLab Migration User 2019-09-25 19:13:53 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1757.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.