Bug 90481 - Radeonsi driver, X crash while playing "Spec ops: the line"
Summary: Radeonsi driver, X crash while playing "Spec ops: the line"
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: 11.1
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-16 14:59 UTC by Ivan Viktorov
Modified: 2018-07-04 06:00 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
kernel log with drm.debug=1 (740.14 KB, text/plain)
2015-05-16 14:59 UTC, Ivan Viktorov
Details
kernel log 4.1-rc3 (365.53 KB, text/plain)
2015-05-17 18:38 UTC, Ivan Viktorov
Details
kern.log 4.4.0-amd64 (458.09 KB, text/x-log)
2016-01-31 16:25 UTC, Xavier Sellier
Details
apitrace (16.51 MB, application/octet-stream)
2016-02-07 21:50 UTC, Aaron Paden
Details
GALLIUM_DDEBUG="800 noflush" dump (36.00 KB, text/plain)
2016-06-13 15:40 UTC, Daniel Scharrer
Details
GALLIUM_DDEBUG="pipelined 10000" dump (61.93 KB, text/plain)
2016-08-12 18:06 UTC, Daniel Scharrer
Details
Another GALLIUM_DDEBUG="pipelined 10000" dump (60.71 KB, text/plain)
2016-08-12 19:18 UTC, Daniel Scharrer
Details
Crash information (11.91 KB, text/plain)
2016-08-13 11:19 UTC, Daniel Scharrer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivan Viktorov 2015-05-16 14:59:29 UTC
Created attachment 115834 [details]
kernel log with drm.debug=1

Playing in spec ops: the line from steam causes gpu lockup.

Behaviour differs. Firstly system freeze, after a few seconds screen becomes black. Sometimes works only sysrq, sometimes i can switch to VT and log in (screen still is black, but sudo reboot works). Sometimes it unfreeze and everything works again.

System:                  Fedora 22 x86_64
kernel:                  4.0.2-300.fc22.x86_64
mesa:                    10.5.4-1.20150505.fc22
xorg-server:             1.17.1-11.fc22
libdrm:                  2.4.61-3.fc22
xorg-x11-drv-ati.x86_64: 7.5.0-3.fc22
window manager:          kwin-5.3.0-2.fc22

video card Radeon R9 270X
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN
OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.5.4
OpenGL core profile shading language version string: 3.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 10.5.4
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 10.5.4
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
OpenGL ES profile extensions:

reproduce always. After 15-40 minutes of playing.
Comment 1 Ivan Viktorov 2015-05-17 18:37:10 UTC
With kernel 4.1.0-rc3 same situation.
Comment 2 Ivan Viktorov 2015-05-17 18:38:21 UTC
Created attachment 115859 [details]
kernel log 4.1-rc3
Comment 3 Ivan Viktorov 2015-05-22 22:06:43 UTC
Issue presented with
mesa 10.6.0-0.devel.6.5a55f68.fc23
llvm 3.6.0-1.fc23
Comment 4 Xavier Sellier 2016-01-31 16:25:57 UTC
Created attachment 121426 [details]
kern.log 4.4.0-amd64
Comment 5 Xavier Sellier 2016-01-31 16:33:38 UTC
Same issue.

20-radeon.conf:
Section "Device"
        Identifier  "AMD Radeon 7850 HD"
        Driver      "radeon"
        Option      "DRI"              "3"
        Option      "TearFree"         "on"
        Option      "AccelMethod"      "glamor"
Endsection

Versions
System: Linux 4.4.0-trunk-amd64 #1 SMP Debian 4.4-1~exp1 (2016-01-19) x86_64 GNU/Linux
firmware-linux-nonfree:all/testing 20160110-1 uptodate
libgl1-mesa-dri:amd64/testing 11.1.1-2 uptodate
libgl1-mesa-dri:i386/testing 11.1.1-2 uptodate
libgl1-mesa-glx:amd64/testing 11.1.1-2 uptodate
libgl1-mesa-glx:i386/testing 11.1.1-2 uptodate
mesa-utils:amd64/testing 8.3.0-1 uptodate
mesa-utils:i386 not installed
xserver-xorg-core:amd64/unstable 2:1.18.0-3 uptodate
xserver-xorg-core:i386 not installed

Video card AMD Radeon R9 290 4Go:
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD HAWAII (DRM 2.43.0, LLVM 3.7.1)
OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.1.1
OpenGL core profile shading language version string: 4.10
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 11.1.1
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 11.1.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
OpenGL ES profile extensions:
Comment 6 Xavier Sellier 2016-01-31 17:09:02 UTC
By running radeontop, gpu usage: 100% (for everything, even when lightdm has been stopped)
Comment 7 Aaron Paden 2016-02-07 21:50:50 UTC
Created attachment 121577 [details]
apitrace

I'm uploading a trimmed apitrace of the first five and last five frames. The crash is consistent, but it takes several minutes of play, so the complete apitrace would be several GiB. Here's the output from replaying it:

86314: message: major api error 1: GL_INVALID_ENUM in glTexImage2DMultisample(internalformat=GL_RGB9_E5)
86314 @0 glTexImage2DMultisample(target = GL_TEXTURE_2D_MULTISAMPLE, samples = 2, internalformat = GL_RGB9_E5, width = 32, height = 32, fixedsamplelocations = GL_TRUE)
86314: warning: glGetError(glTexImage2DMultisample) = GL_INVALID_ENUM
96483: message: major api error 1: GL_INVALID_VALUE in glClientWaitSync (not a valid sync object)
96483 @2 glClientWaitSync(sync = 0x7e7f1ed0, flags = 0x0, timeout = 0) = GL_ALREADY_SIGNALED
96483: warning: got GL_WAIT_FAILED
96483: warning: glGetError(glClientWaitSync) = GL_INVALID_VALUE
96484: message: major api error 1: GL_INVALID_VALUE in glDeleteSync (not a valid sync object)
96484 @2 glDeleteSync(sync = 0x7e7f1ed0)
96484: warning: glGetError(glDeleteSync) = GL_INVALID_VALUE
96485: message: major api error 1: GL_INVALID_VALUE in glClientWaitSync (not a valid sync object)
96485 @2 glClientWaitSync(sync = 0x82a26dc0, flags = 0x0, timeout = 0) = GL_TIMEOUT_EXPIRED
96485: warning: glGetError(glClientWaitSync) = GL_INVALID_VALUE
97973: message: major api error 1: GL_INVALID_VALUE in glUseProgram
97973 @2 glUseProgram(program = 2147)
97973: warning: glGetError(glUseProgram) = GL_INVALID_VALUE
97975: message: major api error 1: GL_INVALID_OPERATION in glBindFramebuffer(buffer)
97975 @2 glBindFramebuffer(target = GL_FRAMEBUFFER, framebuffer = 22)
97975: warning: glGetError(glBindFramebuffer) = GL_INVALID_OPERATION
97990: message: major api error 1: GL_INVALID_OPERATION in glBindBufferRange(non-gen name)
97990 @2 glBindBufferRange(target = GL_UNIFORM_BUFFER, index = 7, buffer = 970, offset = 140112, size = 144)
97990: warning: glGetError(glBindBufferRange) = GL_INVALID_OPERATION
97991: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
97991 @2 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 4986)
97991: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
97994: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
97994 @2 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 4984)
97994: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
98013: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
98013 @2 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 5163)
98013: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
98015: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
98015 @2 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 5161)
98015: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
98018: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
98018 @2 glBindBuffer(target = GL_ARRAY_BUFFER, buffer = 4989)
98018: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
98020: message: major api error 1: GL_INVALID_OPERATION in glBindBuffer(non-gen name)
98020 @2 glBindBuffer(target = GL_ELEMENT_ARRAY_BUFFER, buffer = 4987)
98020: warning: glGetError(glBindBuffer) = GL_INVALID_OPERATION
apitrace: warning: caught signal 11
98023: error: caught an unhandled exception
/usr/bin/glretrace+0x24285c
/usr/lib/libpthread.so.0+0x10d5f
/usr/lib/libc.so.6+0x90e90
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x3a5f9b
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x600493
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x3a7e0b
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x1fde1e
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x1cdc87
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x1cdf93
/usr/lib/xorg/modules/dri/radeonsi_dri.so+0x1ceeaa
/usr/bin/glretrace+0xd0a3b
/usr/bin/glretrace+0xcccc
/usr/bin/glretrace+0xd467
/usr/bin/glretrace+0xd711
/usr/lib/libpthread.so.0+0x74a3
/usr/lib/libc.so.6: clone+0x6c
?
apitrace: info: taking default action for signal 11
Comment 8 Aaron Paden 2016-02-12 03:16:53 UTC
Still an issue with with Mesa 11.1.2 and Linux 4.5-rc3
Comment 9 Daniel Scharrer 2016-04-10 12:47:34 UTC
I'm also seeing frequent lockups with VI using git Mesa and LLVM (X unresponsive, radeontop showing everything at 100%). Nothing in dmesg, but that's probably just because (afaik) gpu reset is not implemented for amdgpu in 4.5.

GPU: R9 380X (tonga)
Mesa 11.3.0-devel (git-715e97e)
LLVM r265649
Comment 10 Bas Nieuwenhuizen 2016-06-12 15:03:57 UTC
I tried the attached apitrace, but the segfault most likely occurs because of the trimmed apitrace: the crash is because there is no index buffer bound in one of the glDrawRangeEelements, which results in interpreting the offset as a pointer and segfaults.

It is *very* unlikely that this is the bug from the original report, as it should have no side effects besides a segfault of the game.

Could you try to create an untrimmed apitrace which reproduces the issue and upload it somewhere?
Comment 11 Daniel Scharrer 2016-06-13 15:40:05 UTC
Created attachment 124508 [details]
GALLIUM_DDEBUG="800 noflush" dump

I tried to record an apitrace but could not get any lockups while recording or glretracing the traces. I also was not able to get a hang while using GALLIUM_DDEBUG="800" without noflush. Maybe the hang is framerate related, or at least much less likely to occur at really low framerates?

However, I did reproduce a hang while using GALLIUM_DDEBUG="800 noflush". Attached is the ddebug dump, not sure if it will be of any use.

Mesa 12.1.0-devel (git-a048047)
LLVM r272544
Comment 12 Nicolai Hähnle 2016-06-14 05:55:22 UTC
Based on the reported GRBM_STATUS registers, the hang is probably somewhere in the pixel pipe, since VGT_BUSY and PA_BUSY = 0. But it's difficult to say more. Framerate sensitivity is definitely possible.
Comment 13 Daniel Scharrer 2016-07-01 14:37:30 UTC
Is there anything else I could try to help track this down?

I tried running the game with R600_DEBUG=nodma and while that seemed to fix the issue at first, I still got a lockup after a couple of runs. Perhaps nodma just made the lockup less likely by slowing things down (although perf was not *that* different).

Btw, jaycee1980 in #radeon seemed open to providing AMD Mesa devs keys to Virtual Programming games, so you could try to reproduce this on your end as well.
Comment 14 at46n 2016-07-10 11:36:50 UTC
I'm also affected by this bug with my r7 260x on Ubuntu 16.04. If I'm able to switch to a tty my system give my an endless amount of messages with "radeon 000:01:00.0: ring 0 stalled for more than 10004msec". Last time I also got 
"[drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm failed
[drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed
[drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed"
Comment 15 Marek Olšák 2016-08-01 16:29:58 UTC
You can try to test with:

GALLIUM_DDEBUG="pipelined 10000"
Comment 16 Daniel Scharrer 2016-08-03 23:45:59 UTC
Unfortunately I am not able get a lockup when using GALLIUM_DDEBUG="pipelined 10000" - it seems the perf impact is still too big on my PC. I also checked that it still hangs without GALLIUM_DDEBUG.

Kernel: 4.7.0-gentoo
Mesa: git-6fb6201
LLVM: r277571
Comment 17 Marek Olšák 2016-08-10 22:03:19 UTC
Does this fix it?

https://cgit.freedesktop.org/mesa/mesa/commit/?id=947e0614d091c260651e4f3d6209bd6bcc2cfa0d

In other words, does mesa/master work?
Comment 18 Aaron Paden 2016-08-11 23:22:41 UTC
Crashed for me again after about 30 minutes of play using the latest mesa-git and Linux 4.8rc1
Comment 19 Daniel Scharrer 2016-08-12 04:12:43 UTC
I also still get hangs, but they seem to be less frequent than they used to be. However, the framerate seems to be a bit lower compared to the last time I tested - ~70 vs. (iirc) 80+ FPS in the menu - so maybe it's just that. Both the game and glamor were using the updated Mesa version.

Curiously, the first freeze I got when testing didn't look like a GPU lockup but rather a (partial) X server lockup: all blocks were at 0% in radeontop and I was able to switch to a different VT using Ctrl+Alt+F1, and while switching back to X blocked further VT switches I was able to restart the X server normally (the log indicated a clean shutdown) and everything including OpenGL seemed to work fine after that.

The freeze lockup I got was a proper GPU lockup though - Event Engine and Texture Adresser at 0%, everything else at 100%, unable to switch VTs even with chvt over ssh.

Kernel: 4.7.0-gentoo
Mesa: git-50b49d2
LLVM: r278309
Comment 20 Daniel Scharrer 2016-08-12 18:06:14 UTC
Created attachment 125752 [details]
GALLIUM_DDEBUG="pipelined 10000" dump

I played more of the game with GALLIUM_DDEBUG="pipelined 10000" and was able to eventually catch a lockup. Fewer blocks busy this time.
Comment 21 Daniel Scharrer 2016-08-12 18:07:13 UTC
The game also segfaulted a few times while playing - still need to get a backtrace of that.
Comment 22 Daniel Scharrer 2016-08-12 19:18:45 UTC
Created attachment 125754 [details]
Another GALLIUM_DDEBUG="pipelined 10000" dump
Comment 23 Daniel Scharrer 2016-08-13 11:19:11 UTC
Created attachment 125765 [details]
Crash information

One segfault I observed was due sctx->b.dma.cs->current.buf being NULL in cik_sdma.c:377 (the first radeon_emit call in that block). Attached is the full stack trace and some additional info.

Another crash didn't have any Mesa stack frames. Not sure what's going on there.

I played a bit using amdgpu-pro 16.30.3.306809 (on top of the upstream 4.7.0 amdgpu kernel module), and there were no crashes or lockups. Also, the game runs noticeable faster on the blob :(
Comment 24 Marek Olšák 2016-08-13 17:28:37 UTC
Does it hang with R600_DEBUG=nohyperz ?
Comment 25 Daniel Scharrer 2016-08-13 23:03:50 UTC
I still get lockups with R600_DEBUG=nohyperz.
Comment 26 Ryan Williams 2017-01-22 23:13:10 UTC
Tested with recent mesa git, Ubuntu 16.04 Padoka PPA, kernel 4.8.11, no gpu lockup with ~2 hour play. Probably fixed by same commit that fixed Arkham Origins in Wine and XCOM:EU.
Comment 27 Samuel Pitoiset 2017-01-23 10:52:58 UTC
(In reply to Ryan Williams from comment #26)
> Tested with recent mesa git, Ubuntu 16.04 Padoka PPA, kernel 4.8.11, no gpu
> lockup with ~2 hour play. Probably fixed by same commit that fixed Arkham
> Origins in Wine and XCOM:EU.

If you have a VI+ card and this commit e490b7812cae778c61004971d86dc8299b6cd240 in your build, that would make sense. But the original ticket is for SI. Should probably be closed because mesa 10.5 is very old though.
Comment 28 Timothy Arceri 2018-07-04 06:00:51 UTC
As per the previous comment lets close this and file a new bug if this is still an issue.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.