Bug 111544 - GPU HANG: ecode 6:1:0xfffffffe, in spring-main [10578], hang on rcs0
Summary: GPU HANG: ecode 6:1:0xfffffffe, in spring-main [10578], hang on rcs0
Status: RESOLVED DUPLICATE of bug 102379
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-03 09:15 UTC by Christian Rebischke
Modified: 2019-09-06 10:45 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
/sys/class/drm/card0/error (136.68 KB, text/plain)
2019-09-03 09:15 UTC, Christian Rebischke
no flags Details

Description Christian Rebischke 2019-09-03 09:15:45 UTC
Created attachment 145249 [details]
/sys/class/drm/card0/error

[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
Comment 1 Christian Rebischke 2019-09-03 09:18:18 UTC
More Information:

Hardware: Thinkpad X220
OS: Arch Linux
Windowmanager: Sway (Wayland + XWayland)

Additional information:
The CPU was all the time near 97 degrees C + 100% usage due to playing games with the spring RTS engine.

Here is more dmesg output:
[13980.504547] mce: CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[13980.504548] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
[13980.504550] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[13980.504553] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[13980.504571] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[13980.504572] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[13980.505517] mce: CPU0: Core temperature/speed normal
[13980.505518] mce: CPU1: Core temperature/speed normal
[13980.505520] mce: CPU2: Package temperature/speed normal
[13980.505521] mce: CPU3: Package temperature/speed normal
[13980.505522] mce: CPU1: Package temperature/speed normal
[13980.505523] mce: CPU0: Package temperature/speed normal
[14280.506447] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 34472)
[14280.506448] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 34472)
[14280.506460] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 34472)
[14280.506461] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 34472)
[14324.705510] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 2838)
[14324.705511] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 2838)
[14324.707529] mce: CPU2: Core temperature/speed normal
[14324.707530] mce: CPU3: Core temperature/speed normal
[14580.502402] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 100030)
[14580.502404] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 100030)
[14580.502420] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 100030)
[14580.502422] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 100030)
[14591.465253] i915 0000:00:02.0: GPU HANG: ecode 6:1:0xfffffffe, in spring-main [10578], hang on rcs0
[14591.465258] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[14591.465259] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[14591.465260] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[14591.465260] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[14591.465262] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[14591.466173] i915 0000:00:02.0: Resetting chip for hang on rcs0
[14880.500328] mce: CPU1: Package temperature/speed normal
[14880.500329] mce: CPU0: Package temperature/speed normal
[14880.500342] mce: CPU2: Package temperature/speed normal
[14880.500344] mce: CPU3: Package temperature/speed normal
[14988.455723] i915 0000:00:02.0: Resetting chip for hang on rcs0
[15180.454085] i915 0000:00:02.0: Resetting chip for hang on rcs0
[15180.498204] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 213447)
[15180.498206] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 213447)
[15180.498208] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 213447)
[15180.498210] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 213447)
[15180.532334] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 81692)
[15180.532336] mce: CPU1: Core temperature above threshold, cpu clock throttled (total events = 81692)
[15180.533319] mce: CPU1: Core temperature/speed normal
[15180.533323] mce: CPU0: Core temperature/speed normal
[15480.496118] mce: CPU1: Package temperature/speed normal
[15480.496120] mce: CPU0: Package temperature/speed normal
[15480.496144] mce: CPU3: Package temperature/speed normal
[15480.496146] mce: CPU2: Package temperature/speed normal
[15480.529257] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 125082)
[15480.529264] mce: CPU1: Core temperature above threshold, cpu clock throttled (total events = 125082)
[15480.530242] mce: CPU1: Core temperature/speed normal
[15480.530243] mce: CPU0: Core temperature/speed normal
[15588.558999] i915 0000:00:02.0: Resetting chip for hang on rcs0
[15590.478834] Asynchronous wait on fence i915:sway[1328]:163300 timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[15596.452130] i915 0000:00:02.0: Resetting chip for hang on rcs0
[15604.558768] i915 0000:00:02.0: Resetting chip for hang on rcs0
Comment 2 Denis 2019-09-03 09:34:49 UTC
Hi Christian, what was the game actually? Also, provide please mesa version and reproducibility - it was only one time or stable which you can reproduce?
Comment 3 Chris Wilson 2019-09-03 11:28:32 UTC
Was the system seeing any swap used at the time? There's been a couple of Sandybridge bugs that correlate with swapping, hence checking if this might fit that pattern.
Comment 4 Christian Rebischke 2019-09-03 16:10:00 UTC
No there was no swapping involved. There was enough RAM and I have no swap partition nor a swap file.

The game is called `spring1944` but it also happens with a spring unrelated game called 0ad.

Mesa version is: 19.1.5-1

Basically it happens when ever I get high CPU load + graphical applications.
My guess is that the CPU throttles down and somehow the GPU get confused about this. No idea if this is related.

It happens everytime during a game. But it's difficult to reproduce, because the screen freezes are not logged, only when the game crashes ultimately due to GPU hang there is actually something like a log in dmesg.

I tried capturing a GPU backtrace with spring long ago, but it generates over 5GB data and if just screen freezes occur (and no whole game crash) there is nothing logged about it.

I guess this is also related to: https://bugs.freedesktop.org/show_bug.cgi?id=102379
and https://bugs.freedesktop.org/show_bug.cgi?id=110971

Sorry, I just realized now that I opened so much bug reports for the same thing. Feel free to close two of them, but as you can see the problem exists since 2017.. I just forgot that I have reported this already. Sorry.
Comment 5 Denis 2019-09-04 15:03:53 UTC
oh, yes, you already reported this issue https://bugs.freedesktop.org/show_bug.cgi?id=102379


>Basically it happens when ever I get high CPU load + graphical applications.
>My guess is that the CPU throttles down and somehow the GPU get confused about >this. No idea if this is related.

This is 100% related. When I tried to reproduce this issue with your apitrace or with a game, I got 1 or 2 hangs, exactly when I loaded cpu/gpu as match as I could. But reproducibility is so bad, that it is impossible to debug.

Also I am suspect all these issues to have the same root-cause (but couldn't proof):

BZ id's 105288,105219,105116,104180,104044,103745,101822,101604,100396,100103,99864,93402,97271,107866,102379,106495,104822,93842

I suggest to close current one as duplicate of mentioned higher ticket.
Comment 6 Lakshmi 2019-09-06 10:45:45 UTC
(In reply to Denis from comment #5)
> oh, yes, you already reported this issue
> https://bugs.freedesktop.org/show_bug.cgi?id=102379
> 
> 
> >Basically it happens when ever I get high CPU load + graphical applications.
> >My guess is that the CPU throttles down and somehow the GPU get confused about >this. No idea if this is related.
> 
> This is 100% related. When I tried to reproduce this issue with your
> apitrace or with a game, I got 1 or 2 hangs, exactly when I loaded cpu/gpu
> as match as I could. But reproducibility is so bad, that it is impossible to
> debug.
> 
> Also I am suspect all these issues to have the same root-cause (but couldn't
> proof):
> 
> BZ id's
> 105288,105219,105116,104180,104044,103745,101822,101604,100396,100103,99864,
> 93402,97271,107866,102379,106495,104822,93842
> 
> I suggest to close current one as duplicate of mentioned higher ticket.

Denis, Thanks for your assessment. Resolved as duplicate.

*** This bug has been marked as a duplicate of bug 102379 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.