99561 – [SKL] GPU HANG: ecode 9:0:0x85dffffb, in portal2_linux [20076], reason: Hang on render ring, action: reset

Bug 99561 - [SKL] GPU HANG: ecode 9:0:0x85dffffb, in portal2_linux [20076], reason: Hang on render ring, action: reset

Summary: [SKL] GPU HANG: ecode 9:0:0x85dffffb, in portal2_linux [20076], reason: Hang ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Joel
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (2):	100906 101389 (view as bug list)
Depends on:
Blocks:

Reported:	2017-01-27 05:48 UTC by Joel
Modified:	2017-12-01 18:49 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
/sys/class/drm/card0/error (39.71 KB, application/x-bzip) 2017-01-27 05:48 UTC, Joel	Details
/sys/class/drm/card0/error (37.91 KB, application/x-bzip) 2017-03-16 00:39 UTC, Robert	Details
/sys/class/drm/card0/error (80.88 KB, text/plain) 2017-03-16 18:47 UTC, André Stein	Details
/sys/class/drm/card0/error (53.02 KB, text/plain) 2017-04-02 01:39 UTC, Nicolas Dufresne	Details
Latest dmesg (20.05 KB, text/plain) 2017-10-16 21:59 UTC, Robert	Details
Latest /sys/class/drm/card0/error (426.37 KB, text/plain) 2017-10-16 22:00 UTC, Robert	Details
View All

Description Joel 2017-01-27 05:48:16 UTC

Created attachment 129174 [details]
/sys/class/drm/card0/error

I got this in my dmesg while playing portal 2. The game would hang for about 5 seconds, recover, and then repeat after a random amount of time.

[36111.153352] [drm] GPU HANG: ecode 9:0:0x85dffffb, in portal2_linux [20076], reason: Hang on render ring, action: reset
[36111.153356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[36111.153358] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[36111.153359] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[36111.153361] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[36111.153363] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[36111.153487] drm/i915: Resetting chip after gpu hang
[36111.153920] [drm] GuC firmware load skipped
[36113.070947] [drm] RC6 on
[36157.230576] drm/i915: Resetting chip after gpu hang
[36157.231028] [drm] GuC firmware load skipped
[36159.150428] [drm] RC6 on
[36224.537792] thinkpad_acpi: EC reports that Thermal Table has changed

Comment 1 yann 2017-01-30 14:37:32 UTC

There were improvements pushed in kernel, xf86-video-intel and Mesa that will benefit to your system, so please re-test with latest kernel, xf86-video-intel (in case you use SNA) & Mesa to see if this issue is still occurring. 
Mark as REOPENED if you can reproduce (please capture and upload an apitrace (https://github.com/apitrace/apitrace) so that we can easily reproduce as well.) and RESOLVED/* if you cannot reproduce.
In both case, please confirm your environment (see below)


* Details:
- Kernel: 4.8.13-1-ARCH
- Platform: Skylake (PCI ID: 0x1916, PCI Revision: 0x07, PCI Subsystem: 17aa:504a)
- Mesa: [Please confirm your version]
- xf86-video-intel: [Please confirm your version]

From this error dump, hung is happening in render ring batch with active head at 0xef313088, with 0x7a000004 (PIPE_CONTROL) as IPEHR.

Batch extract (around 0xef313088):

0xef313014:      0x00000000: MI_NOOP
0xef313018:      0x00000000: MI_NOOP
0xef31301c:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xef313020:      0x00000001: MI_NOOP
0xef313024:      0x00000000: MI_NOOP
0xef313028:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xef31302c:      0x00000002: MI_NOOP
0xef313030:      0x00000000: MI_NOOP
0xef313034:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xef313038:      0x00000003: MI_NOOP
0xef31303c:      0x00000000: MI_NOOP
0xef313040:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xef313044:      0x00000004: MI_NOOP
0xef313048:      0x00000000: MI_NOOP
0xef31304c:      0x780c0000: 3D UNKNOWN: 3d_965 opcode = 0x780c
0xef313050:      0x00000000: MI_NOOP
Bad length 7 in (null), expected 6-6
0xef313054:      0x7b000005: 3DPRIMITIVE: fail sequential
0xef313058:      0x00000104:    vertex count
0xef31305c:      0x00000fba:    start vertex
0xef313060:      0x00000000:    instance count
0xef313064:      0x00000001:    start instance
0xef313068:      0x00000000:    index bias
0xef31306c:      0x00000000: MI_NOOP
Bad count in PIPE_CONTROL
0xef313070:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0xef313074:      0x00101001:    destination address
0xef313078:      0x00000000:    immediate dword low
0xef31307c:      0x00000000:    immediate dword high
Bad count in PIPE_CONTROL
0xef313088:      0x7a000004: PIPE_CONTROL: no write, no depth stall, no RC write flush, no inst flush
0xef31308c:      0x00000408:    destination address
0xef313090:      0x00000000:    immediate dword low
0xef313094:      0x00000000:    immediate dword high
0xef3130a0:      0x78300000: 3D UNKNOWN: 3d_965 opcode = 0x7830

Comment 2 Robert 2017-03-16 00:39:33 UTC

Created attachment 130249 [details]
/sys/class/drm/card0/error

I think I'm having the same issue with Kabylake Laptop

Dell XPS 13 9360 Dev Edition
Ubuntu 16.04 LTS w/ HWE stack running
4.8.0-42-generic #45~16.04.1-Ubuntu SMP Thu Mar 9 14:10:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

/sys/class/drm/card0/error

Comment 3 Robert 2017-03-16 14:00:55 UTC

Mesa:   v12.0.6-0ubuntu0.16.04.1
xserver-xorg-video-intel:   2:2.99.917+git20160325-1ubuntu1.2

Although I have the 16.04.1 Hardware Enablement (HWE) kernel, it appears there is an HWE version of xserver-xorg-video-intel I was not aware of, and which was not automatically selected with the HWE kernel.  I will try that now.

xserver-xorg-video-intel-hwe-16.04:   2:2.99.917+git20160706-1ubuntu1~16.04.1

Comment 4 Robert 2017-03-16 15:15:43 UTC

Still crashes with the following:

Dell XPS 13 9360 Dev Edition (Kabylake)
Ubuntu 16.04 LTS w/ HWE stack running
Kernel:   4.8.0-42-generic #45~16.04.1-Ubuntu SMP Thu Mar 9 14:10:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Mesa:   v12.0.6-0ubuntu0.16.04.1
xserver-xorg-video-intel-hwe-16.04:   2:2.99.917+git20160706-1ubuntu1~16.04.1

Comment 5 André Stein 2017-03-16 18:47:15 UTC

I have the same problem and Portal 2 crashes quite quickly after a few minutes of playing.

I had the problem with Ubuntu 16.10 and upgraded to 17.04 beta yesterday to check with newest kernels and Mesa.

- Kernel: 4.10.0-11-generic
- Platform: Skylake and Intel Corporation Iris Pro Graphics 580 (rev 09)
- Mesa: 17.0.1-1ubuntu1
- xf86-video-intel: 2.99.917+git20160706-1ubuntu1

Unfortunately I had some trouble to get apitrace to run with portal2. I will try again tomorrow because the problem is easily reproducible. Uploading my card0/error file too.

Comment 6 André Stein 2017-03-16 18:47:54 UTC

Created attachment 130269 [details]
/sys/class/drm/card0/error

Comment 7 André Stein 2017-03-18 18:52:31 UTC

I still didn't get apitrace to work because portal2 is 32-bit and the apitrace distributed with Ubuntu doesn't support 32-bit easily. If anyone  by chance has a 32-bit system or 32-bit apitrace version here are the commands to trace portal2 properly (which was quite fidely because of the tons of library paths added):

# in portal 2 folder
$ cd $HOME/.steam/steam/steamapps/common/Portal 2

$ LD_LIBRARY_PATH=":$HOME/.steam/steam/steamapps/common/Portal 2:$HOME/.steam/steam/steamapps/common/Portal 2/bin:$HOME/.steam/ubuntu12_32:$HOME/.steam/ubuntu12_32/panorama:$HOME/.steam/ubuntu12_32/steam-runtime/amd64/lib:$HOME/.steam/ubuntu12_32/steam-runtime/amd64/lib/x86_64-linux-gnu:$HOME/.steam/ubuntu12_32/steam-runtime/amd64/usr/lib:$HOME/.steam/ubuntu12_32/steam-runtime/amd64/usr/lib/x86_64-linux-gnu:$HOME/.steam/ubuntu12_32/steam-runtime/i386/lib:$HOME/.steam/ubuntu12_32/steam-runtime/i386/lib/i386-linux-gnu:$HOME/.steam/ubuntu12_32/steam-runtime/i386/usr/lib:$HOME/.steam/ubuntu12_32/steam-runtime/i386/usr/lib/i386-linux-gnu:$HOME/.steam/ubuntu12_64:/lib:/lib/i386-linux-gnu:/lib/x86_64-linux-gnu:/usr/lib:/usr/lib/i386-linux-gnu:/usr/lib/i386-linux-gnu/mesa:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libfakeroot:/usr/lib/x86_64-linux-gnu/mesa:/usr/lib/x86_64-linux-gnu/mesa-egl:/usr/local/lib:" apitrace trace -a gl ./portal2_linux -game portal2 -steam

Comment 8 Nicolas Dufresne 2017-04-02 01:39:11 UTC

Created attachment 130635 [details]
/sys/class/drm/card0/error

I believe I got the same with a NUC Skull Canyon (Iris 580, Skylake), this time it's on Fedora 25

Kernel: 4.10.6-200.fc25.x86_64
Mesa: 13.0.4 (also tried 17.1.0-devel (git-31970ab))
Xorg: xorg-x11-drv-intel-2.99.917-26.20160929

Is an apitrace what is missing to help with this issue ? If so, let me know, I'll manually rebuild apitrace in 32bit.

Comment 9 Nicolas Dufresne 2017-04-02 15:35:23 UTC

Ok, made some progress, I have apitrace in 32bits working now. Initially to reproduce I was raising the video quality level, for the test I've set that to maximum, and it leads to a crash in mesa (will report seperatly later). I'll keep tweaking the settings until I can reproduce the hang. Meanwhile, if any one needs apitrace 32bit, I can provide instructions, or even the binary.

Comment 10 Nicolas Dufresne 2017-04-02 16:01:18 UTC

I have a trace now, though I notice replaying the trace does not trigger the hang. Let's hope it will at least be used full to understand what is going on.

Hang with medium settings:
http://people.collabora.co.uk/~nicolas/portal2_linux.hang.trace
SHA256 a17185c9eeb322a73a9a4202a14ade9710a328bfe64837bd54bde11f1b41f28d 
806M / 844585317 bytes

Comment 11 Robert 2017-10-16 21:58:31 UTC

Seems this is still happening:

Ubuntu 17.10 (Artful)
Kernel:  4.13.0-16-generic #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Mesa:  17.2.2-0ubuntu1
xserver-xorg-video-intel:  2:2.99.917+git20170309-0ubuntu1

The crash seems to occur after a relative number of passages through portals with each passage.  This  makes the game more jittery over time until it just freezes and then crashes.

Comment 12 Robert 2017-10-16 21:59:38 UTC

Created attachment 134873 [details]
Latest dmesg

Comment 13 Robert 2017-10-16 22:00:25 UTC

Created attachment 134874 [details]
Latest /sys/class/drm/card0/error

Comment 14 Robert 2017-10-16 22:58:25 UTC

Just noticed what I encountered today might be a separate issue...

Hang on render ring, action: reset

vs

Hang on rcs0, action: reset

Comment 15 Kenneth Graunke 2017-10-16 23:01:33 UTC

Render ring, RCS, and RCS0 are all interchangable names, I think they probably just changed intel_error_decode's naming convention.

Comment 16 Robert 2017-10-16 23:42:03 UTC

Oh, thank you for clarifying that.  Good to know :)

Comment 17 Kenneth Graunke 2017-12-01 18:37:01 UTC

*** Bug 101389 has been marked as a duplicate of this bug. ***

Comment 18 Kenneth Graunke 2017-12-01 18:46:22 UTC

*** Bug 100906 has been marked as a duplicate of this bug. ***

Comment 19 Kenneth Graunke 2017-12-01 18:49:01 UTC

Some hangs affecting Portal 2 were fixed in:

commit ee57b15ec764736e2d5360beaef9fb2045ed0f68
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Wed Nov 29 16:22:42 2017 -0800

    i965: Disable regular fast-clears (CCS_D) on gen9+
    
    This partially reverts commit 3e57e9494c2279580ad6a83ab8c065d01e7e634e
    which caused a bunch of GPU hangs on several Source titles.  To date, we
    have no clue why these hangs are actually happening.  This undoes the
    final effect of 3e57e9494c227 and gets us back to not hanging.  Tested
    with Team Fortress 2.
    
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102435
    Fixes: 3e57e9494c2279580ad6a83ab8c065d01e7e634e
    Cc: mesa-stable@lists.freedesktop.org

The original report from January looks a bit different though, so there may be additional hangs.  Please reopen and attach a new error state if you still experience issues with Mesa master or 17.3.0 once it's released.  I've been testing it locally and it appears to be working fine.

Thanks for the reports, and your patience!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.