Bug 102435 - [skl,kbl] [drm] GPU hang in Valve games based on Source 1
Summary: [skl,kbl] [drm] GPU hang in Valve games based on Source 1
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 17.2
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Jason Ekstrand
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: bisected, regression
: 103973 104223 104324 (view as bug list)
Depends on:
Blocks: mesa-17.3
  Show dependency treegraph
 
Reported: 2017-08-27 18:46 UTC by Robert
Modified: 2018-02-16 16:18 UTC (History)
7 users (show)

See Also:
i915 platform: KBL
i915 features: GPU hang


Attachments
GPU dump file, CSGO dump file and dmesg output (187.51 KB, application/gzip)
2017-08-27 18:46 UTC, Robert
Details
2nd set - Triplet of Crashes (406.43 KB, application/gzip)
2017-09-01 01:16 UTC, Robert
Details
hack (739 bytes, patch)
2017-11-02 09:22 UTC, Tapani Pälli
Details | Splinter Review
Error state from sklgt4 (68.27 KB, application/octet-stream)
2017-11-02 23:21 UTC, Jordan Justen
Details
crash dump (90.38 KB, text/plain)
2018-02-16 15:04 UTC, Yahor Berdnikau
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert 2017-08-27 18:46:18 UTC
Created attachment 133817 [details]
GPU dump file, CSGO dump file and dmesg output

CSGO crashed after playing ~2 hours in and out of matches.  The following was reported in dmesg:

[ 7987.649974] [drm] GPU HANG: ecode 9:0:0x86df7cf9, in csgo_linux64 [4947], reason: Hang on rcs, action: reset
[ 7987.649976] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 7987.649978] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 7987.649979] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 7987.649980] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 7987.649981] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 7987.650057] drm/i915: Resetting chip after gpu hang
[ 7987.650622] [drm] RC6 on
[ 8001.652386] drm/i915: Resetting chip after gpu hang
[ 8001.652537] [drm] RC6 on
[ 8013.652392] drm/i915: Resetting chip after gpu hang
[ 8013.652531] [drm] RC6 on
[ 8027.636176] drm/i915: Resetting chip after gpu hang
[ 8027.636314] [drm] RC6 on
[ 8038.644153] drm/i915: Resetting chip after gpu hang
[ 8038.644306] [drm] RC6 on
[ 8038.843763] show_signal_msg: 65 callbacks suppressed
[ 8038.843765] csgo_linux64[5008]: segfault at 1338 ip 00007f04bfe3f2a9 sp 00007f0444182710 error 6 in client_client.so[7f04bf1c6000+17cf000]

I've included this as well as the GPU crash dump in the attachment.
Comment 1 Robert 2017-08-27 18:53:48 UTC
I'd also like to mention:

Dell XPS 13 9360 DE
Ubuntu(Xubuntu) 17.10 (in development with current updates)

I will gladly provide more information if there are any other questions regarding packages or version numbers.
Comment 2 Robert 2017-08-27 19:16:00 UTC
Mesa:  17.2.0~rc4-0ubuntu3
xerver-xorg-video-intel:  2:2.99.917+git20170309-0ubuntu1
Comment 3 Tapani Pälli 2017-08-28 03:48:49 UTC
possibly maybe related with bug #102226
Comment 4 Robert 2017-09-01 01:16:37 UTC
Created attachment 133915 [details]
2nd set - Triplet of Crashes
Comment 5 Robert 2017-09-01 03:10:15 UTC
FYI, it seems I've cornered the cause of the crashing.  If I set Multisampling Anti-Aliasing Mode to None, every time I join a server and view the first in game [Continue] banner screen, the game will crash.  If I set Multisampling Anti-Aliasing Mode to 2xMSAA, the game will load and play just fine.
Comment 6 Robert 2017-09-07 01:15:55 UTC
Issue still present with 17.2.0-0ubuntu1
Comment 7 Robert 2017-09-20 01:41:24 UTC
There are reports of possibly the same issue being solved by downgrading from 17.2.0 to 17.1.8-2:

https://github.com/ValveSoftware/csgo-osx-linux/issues/1523

Maybe that helps in narrowing down the issue to a change between versions.
Comment 8 Robert 2017-09-20 02:05:02 UTC
Modifying Multisampling Anti-Aliasing Mode from 2xMSAA (Which works fine) to None in Team Fortress 2 also crashes that game immediately.
Comment 9 Tapani Pälli 2017-09-21 07:42:28 UTC
FYI reproduced on my KBL too, Team Fortress 2 seems to trigger this quite easily.
Comment 10 Kerrick Staley 2017-09-27 08:34:25 UTC
Confirming what others have said: this issue is only present with Mesa 17.2.0 with MSAA disabled in the game settings. The issue is not present with Mesa 17.1.8 or with MSAA set to "2x MSAA" in the game settings.
Comment 11 Robert 2017-10-21 22:33:42 UTC
Still present in the offiial release of Ubuntu 17.10 along with Mesa 17.2.2
Comment 12 MD 2017-10-24 19:06:34 UTC
Can confirm exact same issue for Counter-Strike: Source via Steam on Intel Corporation HD Graphics 520 (i915) on Ubuntu 17.10 (kernel 4.13.0-16-generic). Setting Aliasing Mode to 2xMSAA also helped (how did you even find out that it does?).
Comment 13 Robert 2017-10-24 23:19:52 UTC
I was trying various settings to see if anything helped alleviate the crashes, and got lucky rather early with making that single change. :)
Comment 14 Tapani Pälli 2017-10-27 06:23:46 UTC
Could someone else give a test with latest Mesa master? I played TF2 for a while now without MSAA and did not reproduce the hang.
Comment 15 Tapani Pälli 2017-10-27 08:04:24 UTC
(In reply to Tapani Pälli from comment #14)
> Could someone else give a test with latest Mesa master? I played TF2 for a
> while now without MSAA and did not reproduce the hang.

Forget about that, just reproduced it again :/ Will attempt a bisect later.
Comment 16 Robert 2017-11-02 00:53:16 UTC
Kisak-valve had performed a bisect here:
https://github.com/ValveSoftware/csgo-osx-linux/issues/1509#issuecomment-339126634

Also looks like he contacted a dev?
Comment 17 Matt Turner 2017-11-02 01:15:44 UTC
(In reply to Robert from comment #16)
> Kisak-valve had performed a bisect here:
> https://github.com/ValveSoftware/csgo-osx-linux/issues/1509#issuecomment-
> 339126634
> 
> Also looks like he contacted a dev?

The commit he bisected to is:

commit 3e57e9494c2279580ad6a83ab8c065d01e7e634e
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Wed Jun 21 21:35:07 2017 -0700

    i965: Enable regular fast-clears (CCS_D) on gen9+

Reassigning.
Comment 18 Tapani Pälli 2017-11-02 09:22:18 UTC
Created attachment 135206 [details] [review]
hack

This hack applies on top of bisected commit. With CCS_E commented out, cannot reproduce the hang.
Comment 19 Tapani Pälli 2017-11-02 11:35:27 UTC
(In reply to Tapani Pälli from comment #18)
> Created attachment 135206 [details] [review] [review]
> hack
> 
> This hack applies on top of bisected commit. With CCS_E commented out,
> cannot reproduce the hang.

with the caveat that I was not playing for ~2 hours .. but with Team Fortress 2 this happens typically very fast.
Comment 20 Jordan Justen 2017-11-02 23:21:05 UTC
Created attachment 135218 [details]
Error state from sklgt4
Comment 21 Jason Ekstrand 2017-11-09 01:14:11 UTC
I did a little looking at this and can repro with TF2.  I pulled two error states and both seem to be the third PIPE_CONTROL after a stream of 3DPRIMITIVE calls each of which draws a single quad.  The 3DPRIMITIVE is writing PS depth count.  I have no idea how much of that information is useful yet.
Comment 22 Mark Janes 2017-11-27 21:22:27 UTC
Jason has suggested reverting 3e57e9494c2279580ad6a83ab8c065d01e7e634e for mesa 17.3
Comment 23 Kenneth Graunke 2017-11-30 01:35:10 UTC
This one still needs a partial-revert from Jason
Comment 24 Mark Janes 2017-12-01 16:30:58 UTC
*** Bug 103973 has been marked as a duplicate of this bug. ***
Comment 25 Kenneth Graunke 2017-12-01 18:21:35 UTC
This should be fixed by:

commit ee57b15ec764736e2d5360beaef9fb2045ed0f68
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Wed Nov 29 16:22:42 2017 -0800

    i965: Disable regular fast-clears (CCS_D) on gen9+
    
    This partially reverts commit 3e57e9494c2279580ad6a83ab8c065d01e7e634e
    which caused a bunch of GPU hangs on several Source titles.  To date, we
    have no clue why these hangs are actually happening.  This undoes the
    final effect of 3e57e9494c227 and gets us back to not hanging.  Tested
    with Team Fortress 2.
    
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102435
    Fixes: 3e57e9494c2279580ad6a83ab8c065d01e7e634e
    Cc: mesa-stable@lists.freedesktop.org

If not, please reopen.  Thanks for the reports and your patience!
Comment 26 Chris Wilson 2017-12-12 13:11:09 UTC
*** Bug 104223 has been marked as a duplicate of this bug. ***
Comment 27 Jason Ekstrand 2017-12-14 01:56:27 UTC
A proper fix for this now on the mailing list:

https://patchwork.freedesktop.org/series/35325/

With that, I can now run TF2 just fine on SKL with CCS_E re-enabled for sRGB.
Comment 28 Horst Schirmeier 2017-12-19 07:36:59 UTC
*** Bug 104324 has been marked as a duplicate of this bug. ***
Comment 29 omega 2017-12-19 08:20:10 UTC
We also have this one:

https://bugs.freedesktop.org/show_bug.cgi?id=103509
Comment 30 Yahor Berdnikau 2018-02-16 15:02:07 UTC
Just have similar crash:

[90130.822757] [drm] GPU HANG: ecode 9:0:0x84dfbffc, in X [10017], reason: Hang on rcs0, action: reset
[90130.822759] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[90130.822760] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[90130.822760] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[90130.822760] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[90130.822761] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[90130.822766] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[90142.849781] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[90156.801766] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[90166.849852] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[90178.817831] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[90179.535886] nouveau 0000:01:00.0: disp: 0x00006820[0]: INIT_GENERIC_CONDITON: unknown 0x07
[90180.623289] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Gentoo linux (Lenovo Thinkpad P51):
mesa-17.3.3
gentoo-sources-4.15.3
xorg-server-1.19.5

Using modesetting driver.
Comment 31 Yahor Berdnikau 2018-02-16 15:04:19 UTC
Created attachment 137394 [details]
crash dump
Comment 32 Jason Ekstrand 2018-02-16 16:17:13 UTC
Unless this happened while playing a Valve game it is likely completely unrelated.  Please file a new bug and include enough details to reproduce.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.