Bug 80419 - XCOM: Enemy Unknown Causes lockup
Summary: XCOM: Enemy Unknown Causes lockup
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: 10.1
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 80922 81576 85334 (view as bug list)
Depends on:
Blocks: 77449
  Show dependency treegraph
 
Reported: 2014-06-23 18:30 UTC by Ryan Williams
Modified: 2017-01-07 20:58 UTC (History)
21 users (show)

See Also:
i915 platform:
i915 features:


Attachments
X log (103.07 KB, text/plain)
2014-06-24 04:11 UTC, Ryan Williams
Details
Dmesg (92.51 KB, text/plain)
2014-06-24 04:28 UTC, Ryan Williams
Details
glxinfo (57.14 KB, text/plain)
2014-06-24 04:29 UTC, Ryan Williams
Details
X log.old (119.46 KB, text/plain)
2014-06-24 17:51 UTC, Ryan Williams
Details
Dmesg 2 (139.39 KB, text/plain)
2014-06-27 08:13 UTC, Ryan Williams
Details
XCOM apitrace Segfault Output (6.01 KB, text/plain)
2014-06-30 19:19 UTC, Ryan Williams
Details
dmesg xcom after lock-ups (83.25 KB, text/plain)
2014-08-03 13:26 UTC, Vladimir Usikov
Details
dmesg during lockup (5.71 KB, text/plain)
2015-02-08 20:42 UTC, Kamil Páral
Details
dmesg during lockup 2 (8.09 KB, text/plain)
2015-03-27 22:33 UTC, darkm00n
Details
dmesg for Kaveri (6.53 KB, text/plain)
2015-05-03 18:18 UTC, Andrei Slavoiu
Details
dmesg output (93.51 KB, text/plain)
2015-05-30 05:13 UTC, Alassane Maiga
Details
Syslog excerpt showing the GPU stall and kernel backtrace (32.90 KB, application/x-xz)
2015-06-16 09:08 UTC, Kai
Details
Output of GALLIUM_DDEBUG="800 noflush" and R600_DEBUG="ps,vs,gs,vm" (5.97 MB, application/x-xz)
2015-12-18 23:49 UTC, David Beswick
Details
kernel messages from xcom hang (4.48 KB, text/plain)
2015-12-19 14:37 UTC, Kamil Páral
Details
system journal containing gpu hang during apitrace replay (66.36 KB, text/plain)
2015-12-21 23:03 UTC, Kamil Páral
Details
gdb backtrace from xorg when system locks up during glretrace replay (16.85 KB, text/plain)
2015-12-22 22:38 UTC, Kamil Páral
Details
apitrace patch for honoring range in DrawRangeElementsX commands (1.55 KB, patch)
2015-12-30 03:48 UTC, Roland Scheidegger
Details | Splinter Review
more conservative apitrace patch (1.92 KB, text/plain)
2015-12-31 02:22 UTC, Nicolai Hähnle
Details
journald xcom: ew lockup (33.07 KB, text/plain)
2016-02-05 02:27 UTC, darkm00n
Details
journald xcom: ew apitrace crash (19.20 KB, text/plain)
2016-02-05 02:29 UTC, darkm00n
Details
crash (5.79 KB, text/plain)
2016-06-12 18:12 UTC, Daniel Exner
Details
crash journal kernel 4.7.0-rc6 (66.11 KB, text/plain)
2016-07-04 17:57 UTC, Daniel Exner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ryan Williams 2014-06-23 18:30:19 UTC
Running the recently released XCOM: Enemy Unknown, after a few minutes of playing the game will lockup the display completely. Mouse movement still shows but nothing is responsive. Occurs for the whole display, evidenced by running the game in windowed mode. Keyboard input is also unresponsive, forcing a hard reset.

Ubuntu 14.04
AMD HD7770 2GB
Mesa 10.1.3

Only response from Feral on the issue was to use Catalyst.
Comment 1 Michel Dänzer 2014-06-24 02:53:13 UTC
Please attach /var/log/Xorg.0.log and the output of dmesg and glxinfo.

> AMD HD7770 2GB
> Mesa 10.1.3

Which version of LLVM? Might be worth trying Mesa 10.2 or even current Git master if possible.
Comment 2 Ryan Williams 2014-06-24 04:11:00 UTC
Created attachment 101631 [details]
X log
Comment 3 Ryan Williams 2014-06-24 04:28:08 UTC
Created attachment 101632 [details]
Dmesg
Comment 4 Ryan Williams 2014-06-24 04:29:59 UTC
Created attachment 101633 [details]
glxinfo
Comment 5 Ryan Williams 2014-06-24 04:38:09 UTC
> Which version of LLVM? Might be worth trying Mesa 10.2 or even current Git
> master if possible.

I'm using llvm 3.4 (default Ubuntu). Will try Mesa 10.2 as soon as possible.
Comment 6 Ryan Williams 2014-06-24 05:43:43 UTC
Tried it with Oibaf PPA, llvm 3.4.2 and Mesa git and the issue is still there. Seemd to lock up even quicker this time.
Comment 7 Michel Dänzer 2014-06-24 06:27:51 UTC
Any chance you could try an LLVM 3.5 snapshot?

Also, if you could log in via ssh and grab dmesg after the problem occurs, that might be interesting.
Comment 8 Sylvain BERTRAND 2014-06-24 12:18:12 UTC
I have been playing xcom:enemy within for a few hours on an up-to-date x86_64 fedora rawhide.
No lock-up yet, but slow like hell. Tahiti XT.
Comment 9 Laurent carlier 2014-06-24 13:24:24 UTC
(In reply to comment #8)
> I have been playing xcom:enemy within for a few hours on an up-to-date
> x86_64 fedora rawhide.
> No lock-up yet, but slow like hell. Tahiti XT.

I can reproduce the lockup with mesa 10.2.1 with llvm-3.4.2, mesa-git with llvm-3.5svn and a radeon PITCAIRN/kernel 3.15.1/kernel 3.16rc1 (Archlinux x86_64)
Comment 10 Sylvain BERTRAND 2014-06-24 17:20:33 UTC
I stand corrected: Got the lockup and crash in the middle of a mission.

up-to-date x86_64 fedora rawhide. Tahiti XT.
Comment 11 Ryan Williams 2014-06-24 17:50:42 UTC
(In reply to comment #7)
> Any chance you could try an LLVM 3.5 snapshot?
> 
> Also, if you could log in via ssh and grab dmesg after the problem occurs,
> that might be interesting.

I don't have a second machine of my own setup to try it from but I'll see what I can do. For now, Xorg.0.log.old seems to better match up with the time of the error, and shows a mumber of EQ overflow errors that aren't in Xorg.0.log.
Comment 12 Ryan Williams 2014-06-24 17:51:41 UTC
Created attachment 101680 [details]
X log.old
Comment 13 Michel Dänzer 2014-06-25 03:49:32 UTC
(In reply to comment #11)
> For now, Xorg.0.log.old seems to better match up with the time of the error,
> and shows a mumber of EQ overflow errors that aren't in Xorg.0.log.

Those are just symptoms of the GPU hang, they don't say anything about its cause.

The corresponding dmesg output should be available in /var/log/kern.log* as well.

Also, I wonder if this is reproducible enough to create an apitrace reproducing it?
Comment 14 Ryan Williams 2014-06-27 08:13:03 UTC
Created attachment 101844 [details]
Dmesg 2
Comment 15 Ryan Williams 2014-06-27 08:16:56 UTC
SSH'd and tried dmesg like you said and attached. I've been trying apitrace but the game depends on Steam and uses a seperate launcher to start the game and can't figure out how to get a proper trace becaue of it. Launching what looks like the game binary directly (game.x86_64) and it still brings up the launcher.

Just this morning I was able to play without incident for ~ 1 hour for the first time, then locked up almost immediately in a mission again after restarting.
Comment 16 Michel Dänzer 2014-06-27 08:51:49 UTC
(In reply to comment #15)
> SSH'd and tried dmesg like you said and attached.

Thanks, but this doesn't have anything about a GPU lockup or anything like that.
Comment 17 Ryan Williams 2014-06-27 10:20:27 UTC
>Thanks, but this doesn't have anything about a GPU lockup or anything like that.

I looked through it and figured as much but attached anyway as maybe I just wasn't looking for the right thing. It's the output given when the game locked up the display, so I don't know what else to do besides apitrace.
Comment 18 higuita 2014-06-27 16:08:17 UTC
On steam, you can select the game, right click on it, select proprieties‎ , launch options and put

apitrace %command%

you can run the game with gdb, strace , etc  using this option. it will output to the console, so start steam in a terminal if you need to see the output
Comment 19 Ryan Williams 2014-06-27 23:01:45 UTC
(In reply to comment #18)
> On steam, you can select the game, right click on it, select proprieties‎ ,
> launch options and put
> 
> apitrace %command%
> 
> you can run the game with gdb, strace , etc  using this option. it will
> output to the console, so start steam in a terminal if you need to see the
> output

Thanks, but the game simply segfaults when I do this:

Dumped crashlog to /home/ryan/.local/share/feral-interactive/XCOM/crashes//50524115-5e25-7dad-395afa0f-6f10b83e.dmp
/home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh: line 39: 9211 Segmentation fault (core dumped) ${DEBUGGER} "${GAMEBINARY}" $@
Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 9211

Feral have mentioned to me that they're willing to give Mesa/RadeonSI devs Steam keys to help find and fix the issue, I just need to email them (http://steamcommunity.com/app/200510/discussions/0/648811852226640080/#c522730701427327317). If you guys are willing I'll do that.
Comment 20 Edwin Smith (Feral Interactive) 2014-06-28 20:21:15 UTC
My name is Edwin Smith and I work for Feral Interactive. We don't support the mesa drivers due to the stability issues compared to the closed source drivers however we can pass on any crash logs and if it helps some complimentary XCOM keys to the members of the driver team to help with the debugging effort.

The crash looks like a complete GPU hang that locks up the entire card. If it helps I can look into getting exact instructions on how to attach apitrace to the Steam release.
Comment 21 Ryan Williams 2014-06-29 07:50:24 UTC
(In reply to comment #20)
> The crash looks like a complete GPU hang that locks up the entire card. If
> it helps I can look into getting exact instructions on how to attach
> apitrace to the Steam release.

That would certainly help. Thanks for taking the time to do this.
Comment 22 James Legg 2014-06-30 09:10:16 UTC
To use apitrace on XCOM, follow these instructions:
     1. In the Steam client library list, right click the game
     2. Select "Properties"
     3. Switch to the "GENERAL" tab
     4. Press "SET LAUNCH OPTIONS..."
     5. Put this in the text box "DEBUGGER="apitrace trace" %command%"
     6. Press OK
     7. Close the Properties Window
     8. Hit Play
     9. Select the game to test
    10. Find the trace file in the steam library
        common/XCom-Enemy-Unknown/game.x86_64.trace or
        common/XCom-Enemy-Unknown/xew/game.x86_64.trace depending on
        which game was launched.

We've also seen the GPU hang using fglrx, but haven't reproduced a GPU hang with Intel graphics.
Comment 23 Ryan Williams 2014-06-30 15:59:19 UTC
(In reply to comment #22)
> To use apitrace on XCOM, follow these instructions:
>      1. In the Steam client library list, right click the game
>      2. Select "Properties"
>      3. Switch to the "GENERAL" tab
>      4. Press "SET LAUNCH OPTIONS..."
>      5. Put this in the text box "DEBUGGER="apitrace trace" %command%"
>      6. Press OK
>      7. Close the Properties Window
>      8. Hit Play
>      9. Select the game to test
>     10. Find the trace file in the steam library
>         common/XCom-Enemy-Unknown/game.x86_64.trace or
>         common/XCom-Enemy-Unknown/xew/game.x86_64.trace depending on
>         which game was launched.
> 
> We've also seen the GPU hang using fglrx, but haven't reproduced a GPU hang
> with Intel graphics.

Thanks, but again it won't work. This time it complains about missing libsteam_api.so with the message:

/home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/../../binaries/linux/game.x86_64: error while loading shared libraries: libsteam_api.so: cannot open shared object file: No such file or directory

Appears to be something going on with xcom.sh?
Comment 24 James Legg 2014-06-30 16:59:49 UTC
(In reply to comment #23)
> Thanks, but again it won't work. This time it complains about missing
> libsteam_api.so

It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH working variable or working directory. You can attach it with LD_PRELOAD which should prevent this. Set the launch options to something like this:
LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command%
Adjust the path to the x86_64 glxtrace.so if necessary.
Comment 25 James Legg 2014-06-30 17:00:34 UTC
(In reply to comment #24)
> It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH
> working variable

*Environment variable even.
Comment 26 Ryan Williams 2014-06-30 17:45:44 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > Thanks, but again it won't work. This time it complains about missing
> > libsteam_api.so
> 
> It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH
> working variable or working directory. You can attach it with LD_PRELOAD
> which should prevent this. Set the launch options to something like this:
> LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command%
> Adjust the path to the x86_64 glxtrace.so if necessary.

And now it segfaults again, just like before:

apitrace: redirecting dlopen("libGL.so.1", 0x102)
apitrace: tracing to /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/game.x86_64.trace
Dumped crashlog to /home/ryan/.local/share/feral-interactive/XCOM/crashes//7f31c2e8-309f-1dae-547b3bc4-458a6b0d.dmp
/home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh: line 39:  6587 Segmentation fault      (core dumped) ${DEBUGGER} "${GAMEBINARY}" $@
Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 6576
Comment 27 Laurent carlier 2014-06-30 18:33:41 UTC
(In reply to comment #26)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > Thanks, but again it won't work. This time it complains about missing
> > > libsteam_api.so
> > 
> > It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH
> > working variable or working directory. You can attach it with LD_PRELOAD
> > which should prevent this. Set the launch options to something like this:
> > LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command%
> > Adjust the path to the x86_64 glxtrace.so if necessary.
> 
> And now it segfaults again, just like before:
> 
> apitrace: redirecting dlopen("libGL.so.1", 0x102)
> apitrace: tracing to
> /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/game.
> x86_64.trace
> Dumped crashlog to
> /home/ryan/.local/share/feral-interactive/XCOM/crashes//7f31c2e8-309f-1dae-
> 547b3bc4-458a6b0d.dmp
> /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/
> linux/xcom.sh: line 39:  6587 Segmentation fault      (core dumped)
> ${DEBUGGER} "${GAMEBINARY}" $@
> Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 6576

In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory
* do a backup of xcom.sh file then edit the file
* line 79, change the line: eval "$GAMESCRIPT" $@
  into: apitrace trace "$GAMESCRIPT" $@
* then save and launch the game, now the trace is properly generated
Comment 28 Ryan Williams 2014-06-30 19:17:26 UTC
(In reply to comment #27)
> 
> In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory
> * do a backup of xcom.sh file then edit the file
> * line 79, change the line: eval "$GAMESCRIPT" $@
>   into: apitrace trace "$GAMESCRIPT" $@
> * then save and launch the game, now the trace is properly generated

Continues to segfault immediately on starting, just after the intial XCOM splash screen, though there's now output from apitrace in terminal and a trace. Is this just something wrong with my system? I've checked the game cache several times already to make sure there's no corrupted files so there's nothing wrong there.
Comment 29 Ryan Williams 2014-06-30 19:19:22 UTC
Created attachment 102029 [details]
XCOM apitrace Segfault Output
Comment 30 Laurent carlier 2014-06-30 19:32:38 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > 
> > In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory
> > * do a backup of xcom.sh file then edit the file
> > * line 79, change the line: eval "$GAMESCRIPT" $@
> >   into: apitrace trace "$GAMESCRIPT" $@
> > * then save and launch the game, now the trace is properly generated
> 
> Continues to segfault immediately on starting, just after the intial XCOM
> splash screen, though there's now output from apitrace in terminal and a
> trace. Is this just something wrong with my system? I've checked the game
> cache several times already to make sure there's no corrupted files so
> there's nothing wrong there.

Have you tried with apitrace built from git ?
Comment 31 Ryan Williams 2014-06-30 20:37:54 UTC
(In reply to comment #30)
> Have you tried with apitrace built from git ?

Git failed to build, but 5.0 built fine and worked! It's a 3.2 gb trace though, will attach once it's finished trimming.
Comment 32 Alex 2014-07-01 17:42:02 UTC
FYI, have been running XCom happily using Mesa with Sandybridge (2500k, GPU slightly overclocked to 1.4). I think this just affects Radeon.
Comment 33 nicolas 2014-07-02 18:43:25 UTC
hie, i' tried to make a trace too, but i've failed to build apitrace frome git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :(

Does it help instead if i use R600_DUMP_SHADERS= 1 ?

If yes, someone can explain me how to redirect the output to a file because >> doen't work.
Comment 34 Michel Dänzer 2014-07-03 01:07:52 UTC
(In reply to comment #33)
> Does it help instead if i use R600_DUMP_SHADERS= 1 ?

I'm afraid not.
Comment 35 Ryan Williams 2014-07-03 19:18:31 UTC
(In reply to comment #33)
> hie, i' tried to make a trace too, but i've failed to build apitrace frome
> git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :(
> 
> Does it help instead if i use R600_DUMP_SHADERS= 1 ?
> 
> If yes, someone can explain me how to redirect the output to a file because
> >> doen't work.

Compile apitrace 5.0 from here:

https://github.com/apitrace/apitrace/releases
Comment 36 nicolas 2014-07-05 10:09:20 UTC
(In reply to comment #35)
> (In reply to comment #33)
> > hie, i' tried to make a trace too, but i've failed to build apitrace frome
> > git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :(
> > 
> > Does it help instead if i use R600_DUMP_SHADERS= 1 ?
> > 
> > If yes, someone can explain me how to redirect the output to a file because
> > >> doen't work.
> 
> Compile apitrace 5.0 from here:
> 
> https://github.com/apitrace/apitrace/releases

Tanks, it works !
I have put the file here http://dl.free.fr/htMkkwBOy

I use a radeon hd4870 with r600g on ubuntu 14.04, kernel 3.15 and oibaf ppa.
The symptoms are that the game crash and return to the desktop.

more details here :

https://bugs.freedesktop.org/show_bug.cgi?id=80618
Comment 37 Michel Dänzer 2014-07-09 06:58:58 UTC
(In reply to comment #36)
> I have put the file here http://dl.free.fr/htMkkwBOy

That runs into what looks like bug 80673 to me.
Comment 38 Will H 2014-07-14 19:47:34 UTC
I'm getting a related issue at bug 80922
Comment 39 Vladimir Usikov 2014-08-03 13:26:39 UTC
Created attachment 103904 [details]
dmesg xcom after lock-ups

ArchLinux x86-64; linux 3.16rc6; mesa git; llvm svn; Radeon HD 7950

Lock-ups while playing in tactical mission. After several seconds waiting, game again playable.
Comment 40 Alex Deucher 2014-10-26 17:49:34 UTC
*** Bug 85334 has been marked as a duplicate of this bug. ***
Comment 41 Alex Deucher 2014-10-26 17:49:45 UTC
*** Bug 81576 has been marked as a duplicate of this bug. ***
Comment 42 Kamil Páral 2015-02-08 20:42:02 UTC
Created attachment 113261 [details]
dmesg during lockup

My system completely freezes while playing XCOM and I have to use SysRQ to reboot. Waiting does not help.

Radeon R9 270, Fedora 21, kernel-3.18.6-200.fc21.x86_64, mesa-dri-drivers-10.4.3-1.20150124.fc21.x86_64, xorg-x11-drv-ati-7.5.0-1.fc21.x86_64, llvm-libs-3.5.0-6.fc21.x86_64.
Comment 43 José Suárez 2015-02-09 17:39:52 UTC
I would suggest updating to a llvm-3.6 enabled mesa (and even git llvm 3.7). I had suffered from this lockup bug but the game has been pretty stable lately.
Comment 44 darkm00n 2015-03-27 21:04:47 UTC
I'm having the same lockup issue. About 15 minutes in the mission game freezes but I can move the mouse and can hear the music. After about 10 seconds of waiting the screen turns black and comes back but now the mouse is frozen, no music and the system is completely frozen.

AMD Radeon HD 7870 GHz Edition, Arch Linux x64, mesa 10.5.1-2, kernel-3.19.2-1-ARCH x86_64, xf86-video-ati 1:7.5.0-2, llvm-libs 3.6.0-3.
Comment 45 darkm00n 2015-03-27 22:33:30 UTC
Created attachment 114671 [details]
dmesg during lockup 2

dmesg added
Comment 46 darkm00n 2015-04-02 00:32:25 UTC
Tested on Ubuntu 15.04 with Oibaf PPA, same thing happens...

mesa 10.6~git1504011930.5604d7~gd~v
xserver-xorg-video-ati 1:7.5.0-1ubuntu2
libllvm3.6 1:3.6-2ubuntu1
kernel 3.19.0.11.10

The saddest part is that game runs much smoother with mesa driver than with fglrx.
Comment 47 Laurent carlier 2015-04-15 04:55:07 UTC
Please test, i cannot reproduce lockup with mesa-git 69411.05a1d84 and llvm-libs-svn 234894 (played 20 minutes)
Comment 48 Andrei Slavoiu 2015-05-03 18:18:28 UTC
Created attachment 115526 [details]
dmesg for Kaveri

Slightly different dmesg from an A10-7850K system, involving the IOMMU. Maybe because of the HSA feature?

Mesa: 10.5.4
Llvm: 3.6.0
Kernel: 4.0.0
Comment 49 Alassane Maiga 2015-05-29 16:54:08 UTC
Hello,
I created an account because 2 days ago I was having the same issue and wanted to provide my dmesg. But when I was trying to reproduce the issue yesterday, I couldn't... I played for hours and the performance was actually quite good.
I am with fedora 22 on KDE 5 now, and I did perform some updates before trying again... The other thing I did was disable the option to suspend compositing for full screen windows. Not sure what fixed it though
Comment 50 Alassane Maiga 2015-05-30 05:13:44 UTC
Created attachment 116166 [details]
dmesg output

I was able to reproduce the bug. It happens much less frequently but it still does. My apitrace is huge though. How can I trim/compress it efficiently?
here is my lspci -v. I also included a my dmesg:
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde XT [Radeon HD 7770/8760 / R7 250X] (prog-if 00 [VGA controller])
        Subsystem: Diamond Multimedia Systems Device 7770
        Flags: bus master, fast devsel, latency 0, IRQ 40
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at fbc80000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at c000 [size=256]
        Expansion ROM at fbcc0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [270] #19
        Kernel driver in use: radeon
        Kernel modules: radeon
Comment 51 Kai 2015-06-16 09:08:21 UTC
Created attachment 116529 [details]
Syslog excerpt showing the GPU stall and kernel backtrace

I can report a "me too". According to the syslog (see the attached file for more details), the crash happens in: drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0x86/0x9d with the stack detailed below.

Let me know, if you need something else, to debug this.

My current stack (Debian testing as a base):
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
Mesa: Git:master/4d35eef326
libdrm: 2.4.60-3
LLVM: SVN:trunk/r239668 (3.7 devel)
X.Org: 2:1.17.1-2
Linux: 4.0.5
Firmware: <https://secure.freedesktop.org/~agd5f/radeon_ucode/hawaii/>
> 286640da3d90d7b51bdb038b65addc47  hawaii_ce.bin
> 161105a73f7dfb2fca513327491c32d6  hawaii_mc.bin
> d6195059ea724981c9acd3abd6ee5166  hawaii_me.bin
> ad511d31a4fe3147c8d80b8f6770b8d5  hawaii_mec.bin
> 63eae3f33c77aadbc6ed1a09a2aed81e  hawaii_pfp.bin
> 5b72c73acf0cbd0cbb639302f65bc7dc  hawaii_rlc.bin
> f00de91c24b3520197e1ddb85d99c34a  hawaii_sdma1.bin
> 8e16f749d62b150d0d1f580d71bc4348  hawaii_sdma.bin
> 7b6ca5302b56bd35bf52804919d57e63  hawaii_smc.bin
> 9f2ba7e720e2af4d7605a9a4fd903513  hawaii_uvd.bin
> b0f2a043e72fbf265b2f858b8ddbdb09  hawaii_vce.bin
libclc: Git:master/5cd2688a9f
DDX: Git:master/d7c82731a8
Comment 52 Marek Olšák 2015-08-02 12:38:30 UTC
*** Bug 80922 has been marked as a duplicate of this bug. ***
Comment 53 Kai 2015-08-02 12:53:20 UTC
As written in bug 80922, comment 2 I can't seem to trigger this anylonger with the stack detailed there. But it's probably best if others report in as well and I give it a bit more than 1.5 h to happen.
Comment 54 Daniel Exner 2015-08-04 20:46:44 UTC
I'am sorry to inform you that I still see this bug.

Radeon R270X
Mesa 11.0.0-devel git-5f247a9
xf86-video-ati git-09c7cdb
llvm svn-243977
Comment 55 montsegur87 2015-09-21 14:57:47 UTC
This bug is still happening. Sometimes I can play 2 hours without a crash, sometimes 5 minutes.

Mesa 11.1.0-devel 
kernel 4.2
llvm 3.8
Comment 56 David Beswick 2015-10-03 07:19:20 UTC
Hi, I have an apitrace that hopefully shows the problem. It's 17GB at the moment -- if someone would like to give me some pointers on cutting it down or where to upload it then I can do that.

OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.1.0-devel (git-511a863 2015-09-26 vivid-oibaf-ppa)

I'll continue testing and may generate further traces.
Comment 57 Michel Dänzer 2015-10-04 03:26:46 UTC
(In reply to David Beswick from comment #56)
> Hi, I have an apitrace that hopefully shows the problem.

Hopefully? Does it reproduce the problem for you?
Comment 58 David Beswick 2015-10-04 11:15:34 UTC
I didn't think to try replaying the trace as I assumed it would have the necessary data in it, apologies. I've done so now, but the system didn't hang.

Is it possible that the last part of the trace never makes it into the file because the system locks up? If I switch to another TTY I do see "ring 0 stalled" as in the syslog excerpt attached to the bug. Anything else besides Alt+PrtSqn+REISUB that may help to capture the problematic part?

Otherwise, I'll continue gathering traces and will let you know if I get one that a replay can trigger.
Comment 59 David Beswick 2015-11-01 13:55:25 UTC
Just to update, I've captured three different traces but none have been able to reproduce the problem on replay. I've also tried the following:

* Looping a trace replay over 24 hours continuously -- no repro
* Running with a -O0 Mesa build -- hang remains
* Going directly to fallback in all cases during si_dma_copy (wild guess based on code comments) -- hang remains

I don't think traces will be a fruitful method of debugging, unless someone can suggest something I'm doing wrong. I'm continuing to look at this. If anyone has a hypothesis and would like to send a patch then I could build and test with it.
Comment 60 David Beswick 2015-11-01 13:58:02 UTC
Forgot to add that I also tried replaying traces via Steam, in case the Steam overlay somehow had something to do with it. It doesn't seem to help as I can't reproduce the problem via a trace that way either.
Comment 61 Paulo Dias 2015-11-23 03:18:22 UTC
@david,have you tried with vsync on and off? i find out that if i disable the SO sync and just leave the in-game one, the hangup takes much longer to occur,might be worth a shot?
Comment 62 Kamil Páral 2015-11-23 11:50:00 UTC
Paulo, what's "the SO sync"?
Comment 63 Paulo Dias 2015-11-23 22:13:08 UTC
sorry, my bad, it was a typo, i meant for the OS sync , like for example the kf5 (kde 5.4) vsync setting.
Comment 64 Alassane Maiga 2015-11-25 18:10:54 UTC
FWIW I noticed the game seems to freeze for a second when playing on windows, and then comes back online. Could it be related? maybe the game freezes the gpu on windows too but the windows driver succeed during the resume operation?
Comment 65 David Beswick 2015-12-18 09:45:53 UTC
Hello everyone,

Just to keep you up to date, I switched tack and have been using the "GALLIUM_DDEBUG" environment variable to try and capture data about the crash. I found out about this option via a Google search. As I understand it, it creates fences around each GPU operation to detect when and if they complete, dumping the contents of the draw call if it doesn't finish in a timely way.

Unfortunately, I haven't been able to reproduce the crash with this variable enabled, despite playing for more than 5 hours (not consecutively.) The GALLIUM_DDEBUG mode is certainly working as performance is quite severely impacted. Maybe the way GALLIUM_DDEBUG is implemented unfortunately also prevents the issue from happening.

All I can say so far is that I suspect the problem is related to vertex buffer drawing. On one occasion I disabled the fences around vertex buffer drawing while enabling GALLIUM_DDEBUG (to try and get some more performance) and I did experience a hard lock as usual. I will continue running in this mode to see if it turns up a result.

If anyone else would like to try, you can modify the "~/.steam/steamapps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh" file. On a line before the call to "${GAMEBINARY}", write "export GALLIUM_DDEBUG=800". You can probably also set this environment variable before running Steam.

Thank you Paulo for your suggestion, I will try that if I have time to see if it affects the frequency of the crashes.

The commit I tested with was 55365a7ad50c2e4547f58995a8e3411d8f2b00a2
Comment 66 Nicolai Hähnle 2015-12-18 20:38:35 UTC
Hi David, thank you for your efforts! Note that GALLIUM_DDEBUG=help explains a "noflush" option that you can also use. In any case, can you post the resulting file from ~/ddebug_dumps/?

Also, next time you do this experiment, please run with R600_DEBUG=ps,vs,gs,vm and post the output in addition to the ~/ddebug_dumps/.

Though frankly, the best change to getting this fixed would be a way to reliably reproduce it on a developer's machine. It's a pity the apitraces seem to be unreliable.
Comment 67 David Beswick 2015-12-18 23:49:35 UTC
Created attachment 120587 [details]
Output of GALLIUM_DDEBUG="800 noflush" and R600_DEBUG="ps,vs,gs,vm"
Comment 68 David Beswick 2015-12-18 23:54:04 UTC
Thank you Nicolai, I was able to reproduce the hang using the "noflush" option. DDEBUG seemed to detect the hang and killed the process, but my machine still locked up as usual.
Comment 69 Nicolai Hähnle 2015-12-19 06:36:25 UTC
Thank you for the followup. Apparently, even the very first tracepoint is not processed. This is weird.
Comment 70 Kamil Páral 2015-12-19 14:37:12 UTC
Created attachment 120594 [details]
kernel messages from xcom hang

I'd like to help with resolving this. I added GALLIUM_DDEBUG=800 and checked using environ file that it is applied. Interestingly, I see no performance degradation (am I doing something wrong?). I have played XCOM for about 3 hours (which seems to be considerably longer than usual), then it got stuck. The screen got black for 10 seconds, then image returned, and I could move the mouse pointer, but do nothing else. I was able to switch to tty3, but when I switched back, it froze completely, and I had to use sysrq to reboot. The hang is visible in kernel messages, it is attached.

There is no /home/$username/dd_dumps/ directory. I wonder whether the GALLIUM_DDEBUG variable is having any effect? Also, I can't find any documentation to it, the only thing I was able to find is this:
https://patchwork.freedesktop.org/patch/57799/

Inspired by comment 66, I tried GALLIUM_DDEBUG=help (like "GALLIUM_DDEBUG=help glxgears"), but nothing is written to stdout.

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Curacao PRO [Radeon R7 370 / R9 270/370 OEM] [1002:6811]
kernel-4.2.8-300.fc23.x86_64
mesa-dri-drivers-11.0.6-1.20151122.fc23.x86_64
xorg-x11-drv-ati-7.6.0-0.4.20150729git5510cd6.fc23.x86_64
xorg-x11-server-Xorg-1.18.0-2.fc23.x86_64
llvm-libs-3.7.0-1.fc23.x86_64
Fedora 23
Comment 71 Nicolai Hähnle 2015-12-19 15:22:05 UTC
Hi Kamil, your version of Mesa is too old: it does not contain the GALLIUM_DDEBUG feature yet.

At this point, the most helpful thing for you to do would be to reproduce this with latest development versions (i.e. Git/SVN master) of Mesa and LLVM, and see if you can get an apitrace which reliably reproduces the lockup.
Comment 72 Edwin Smith (Feral Interactive) 2015-12-21 17:12:05 UTC
A steam key has been given to Nicolai Hähnle from Feral to help with his investigations into the crash.
Comment 73 Kamil Páral 2015-12-21 22:33:43 UTC
I have managed to capture an apitrace while radeon driver locked up *and* recovered afterwards, so that I was able to exit the game normally (this almost never happens). I was hopeful that in this case it could contain all the calls triggering the lockup, compared to the case where the system locks up completely and does not recover. I've met issues replaying the trace, apitrace seems to crash for any XCOM trace I capture. I reported it here:
https://github.com/apitrace/apitrace/issues/407

Nevertheless after many attempts I managed to replay the trace twice in full length, and it did not lock up my system again, nor was the lockup visible in the replay itself (in reality my system got locked up for about 10 seconds, but in the replay it plays uninterrupted). So it seems that apitrace is not something that can be used reliably to reproduce this issue, unfortunately.
Comment 74 Kamil Páral 2015-12-21 23:03:10 UTC
Created attachment 120647 [details]
system journal containing gpu hang during apitrace replay

And according to the Murphy's law, after I posted my previous comment, I replayed the trace once again and it crashed my computer. Voila! (I can't say it happened at the exact time of the replay as the original recorded hang, because I wasn't looking at it, but it happened).

The trace file is here (1.5 GB compressed, recorded hang at the very end right before quitting the game):
https://drive.google.com/file/d/0B0Opr_geiK5nWUMwVEhJVnZPR2s/view?usp=sharing

I attach my system journal related to this "replay crash". One thing got my interest. Look at this Xorg backtrace:

(EE) Backtrace:
(EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59afb9]
(EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7fe7ee2ebb1f]
(EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb]
(EE) 3: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x108dbe) [0x7fe7e6c33efe]
(EE) 4: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x1091a3) [0x7fe7e6c345a3]
(EE) 5: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x109c02) [0x7fe7e6c35a42]
(EE) 6: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x163098) [0x7fe7e6ce84f8]
(EE) 7: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfacb7) [0x7fe7e6c17d27]
(EE) 8: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfaf23) [0x7fe7e6c181b3]
(EE) 9: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfb358) [0x7fe7e6c18b48]
(EE) 10: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x168d9) [0x7fe7e81b0b69]
(EE) 11: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x175d2) [0x7fe7e81b2372]
(EE) 12: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x4237) [0x7fe7e818bf67]
(EE) 13: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x80a3) [0x7fe7e8193be3]
(EE) 14: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x9ec9) [0x7fe7e8197529]
(EE) 15: /usr/libexec/Xorg (DamageRegionAppend+0x621) [0x51eeb1]
(EE) 16: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x1108a) [0x7fe7e81a5f7a]
(EE) 17: /usr/libexec/Xorg (AddTraps+0x4cf2) [0x519d82]
(EE) 18: /usr/libexec/Xorg (SendErrorToClient+0x2df) [0x4369bf]
(EE) 19: /usr/libexec/Xorg (remove_fs_handlers+0x453) [0x43a9e3]
(EE) 20: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7fe7ee2d7580]
(EE) 21: /usr/libexec/Xorg (_start+0x29) [0x424ce9]
(EE) 22: ? (?+0x29) [0x29]
(EE)
(EE) Bus error at address 0x7fe7e8b71000
(EE)
Fatal server error:
(EE) Caught signal 7 (Bus error). Server aborting


There's this line:
(EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb]
which is the same function that I reported to be crashing in apitrace:
https://github.com/apitrace/apitrace/issues/407

Is this just a coincidence, or these two bugs are related (or the very same)?


(I'm sorry, I still have the same month-old mesa as in comment 70, I didn't figure out how to update it easily before I started tinkering with the replays).
Comment 75 Michel Dänzer 2015-12-22 06:29:27 UTC
(In reply to Kamil Páral from comment #74)
> There's this line:
> (EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb]
> which is the same function that I reported to be crashing in apitrace:
> https://github.com/apitrace/apitrace/issues/407
> 
> Is this just a coincidence, or these two bugs are related (or the very same)?

Presumably coincidence. The apitrace crashes are segmentation faults, i.e. probably due to overrunning some buffer[0]. The Xorg crash is a bus error, which is probably fallout of the GPU hang.

That said, we could probably confirm either way if we could look at a gdb backtrace of the Xorg crash.

[0] FWIW, replaying the apitrace with valgrind on llvmpipe, I'm also seeing invalid memory access, so it might be a bug in apitrace or some shared Gallium / Mesa code rather than in the radeonsi driver.
Comment 76 Kamil Páral 2015-12-22 08:06:28 UTC
> That said, we could probably confirm either way if we could look at a gdb backtrace of the Xorg crash.

Unfortunately I don't have it. ABRT deleted it due to some unfortunate circumstances. I could try to reproduce it by looping the replay over, if needed.
Comment 77 Michel Dänzer 2015-12-22 08:15:35 UTC
(In reply to Kamil Páral from comment #76)
> I could try to reproduce it by looping the replay over, if needed.

No need, it's not important. The assumption for now is that the Xorg crash is caused by the GPU hang. If you happen to get a gdb backtrace of it in the future, that will help verify this assumption, but it's no more than nice to have.
Comment 78 Daniel Exner 2015-12-22 09:25:22 UTC
I tried to replay the trace attached:

> glretrace game.x86_64.trace 

But all I get is this (many times repeated):

>10536: message: major api error 1: GL_INVALID_ENUM in glCompressedTexImage2D(internalFormat=0x8c4d)
>10536 @1 glCompressedTexImage2DARB(target = GL_TEXTURE_2D, level = 0, >internalformat = GL_COMPRESSED_SRGB_ALPHA_S3TC_DXT1_EXT, width = 64, height = >64, border = 0, imageSize = 2048, data = blob(2048))
>10536: warning: glGetError(glCompressedTexImage2DARB) = GL_INVALID_ENUM

Mesa:
OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.1.0

Hardware:
OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0, LLVM 3.7.0)

Kernel:
Linux 4.4.0-rc6-00005-g9d951f9 #4 SMP PREEMPT Mon Dec 21 18:33:34 CET 2015 x86_64 x86_64 x86_64 GNU/Linux

Will try now with llvmpipe
Comment 79 Michel Dänzer 2015-12-22 09:36:58 UTC
(In reply to Daniel Exner from comment #78)
> >10536: message: major api error 1: GL_INVALID_ENUM in glCompressedTexImage2D(internalFormat=0x8c4d)
> >10536 @1 glCompressedTexImage2DARB(target = GL_TEXTURE_2D, level = 0, >internalformat = GL_COMPRESSED_SRGB_ALPHA_S3TC_DXT1_EXT, width = 64, height = >64, border = 0, imageSize = 2048, data = blob(2048))
> >10536: warning: glGetError(glCompressedTexImage2DARB) = GL_INVALID_ENUM

Looks like you may be missing GL_EXT_texture_compression_s3tc. Do you have libtxc-dxtn(-s2tc) packages installed?
Comment 80 Daniel Exner 2015-12-22 10:11:06 UTC
Yes, you where right. Some guy at my distro removed it without telling anyone. Is back now.

Now I get (with radeonsi):

glretrace game.x86_64.trace 
apitrace: warning: caught signal 11
47062: error: caught an unhandled exception
glretrace+0x28d196
glretrace+0x28c92c
glretrace+0x289ccd
/lib/libpthread.so.0+0x10d3f
/usr/lib/libGL.so.1+0x48945
glretrace+0x2e6e0
glretrace+0x405d6
glretrace+0xffca0
glretrace+0x3a4a4
glretrace+0x2fb35
glretrace+0x3183e
glretrace+0x317a7
glretrace+0x2fc39
glretrace+0x33eb1
glretrace+0x33630
/lib/libpthread.so.0+0x7483
/lib/libc.so.6: clone+0x6c
?
apitrace: info: taking default action for signal 11
Comment 81 Daniel Exner 2015-12-22 10:27:05 UTC
Don't get this error with llvmpipe. But also no crash. (Still running, slow as hell).
Comment 82 Michael Eagle 2015-12-22 10:32:05 UTC
(In reply to Kamil Páral from comment #74)
> 
> (I'm sorry, I still have the same month-old mesa as in comment 70, I didn't
> figure out how to update it easily before I started tinkering with the
> replays).

Hello,

On Fedora 23 I'm using this copr:
https://copr.fedoraproject.org/coprs/griever/mesa-git/

It provides the following packages as of today:
[root@mike-laptop mike]# rpm -qa | grep mesa
mesa-filesystem-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libGLES-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libOSMesa-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libgbm-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-libxatracker-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libEGL-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-libGLU-9.0.0-9.fc23.x86_64
mesa-filesystem-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-libglapi-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libglapi-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-libgbm-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libGL-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-dri-drivers-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libGL-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-dri-drivers-11.2.0-0.devel.22.ea8c0b1.fc23.i686
mesa-libEGL-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64
mesa-libwayland-egl-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64

Hope it helps.
Comment 83 Daniel Exner 2015-12-22 10:49:11 UTC
Managed to get some more infos about glretrace crash:

                Stack trace of thread 13986:
                #0  0x00007fca93b67945 loader_dri3_wait_gl (libGL.so.1)
                #1  0x000000000042e6e1 _ZN4glws11GlxDrawable6resizeEii (glretrace)
                #2  0x00000000004405d7 _ZN9glretrace14updateDrawableEii (glretrace)
                #3  0x00000000004ffca1 _ZL25retrace_glBlitFramebufferRN5trace4CallE (glretrace)
                #4  0x000000000043a4a5 _ZN7retrace8Retracer7retraceERN5trace4CallE (glretrace)
                #5  0x000000000042fb36 _ZN7retraceL11retraceCallEPN5trace4CallE (glretrace)
                #6  0x000000000043183f _ZN7retrace11RelayRunner6runLegEPN5trace4CallE (glretrace)
                #7  0x00000000004317a8 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #8  0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #9  0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #10 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #11 0x00007fca95856484 start_thread (libpthread.so.0)
                #12 0x00007fca944bcaed __clone (libc.so.6)
                
                Stack trace of thread 13984:
                #0  0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #4  0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #5  0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #6  0x00007fca95856484 start_thread (libpthread.so.0)
                #7  0x00007fca944bcaed __clone (libc.so.6)
                
                Stack trace of thread 13983:
                #0  0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #4  0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #5  0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #6  0x00007fca95856484 start_thread (libpthread.so.0)
                #7  0x00007fca944bcaed __clone (libc.so.6)
                
                Stack trace of thread 13981:
                #0  0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fead _ZN7retrace9RelayRace3runEv (glretrace)
                #4  0x0000000000430073 _ZN7retraceL8mainLoopEv (glretrace)
                #5  0x0000000000430957 main (glretrace)
                #6  0x00007fca943f45e0 __libc_start_main (libc.so.6)
                #7  0x000000000042c5e9 _start (glretrace)
                
                Stack trace of thread 13982:
                #0  0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007fca8ff35ba3 radeon_drm_cs_emit_ioctl (radeonsi_dri.so)
                #2  0x00007fca8ff353e7 impl_thrd_routine (radeonsi_dri.so)
                #3  0x00007fca95856484 start_thread (libpthread.so.0)
                #4  0x00007fca944bcaed __clone (libc.so.6)


loader_dri3_wait_gl looks suspicious. Will retry with DRI3 disabled.
Comment 84 Daniel Exner 2015-12-22 11:00:48 UTC
Disabled DRI3 and the trace run through (without GPU crash) but glretrace crash:

Wow, such stacktrace many interesting:

                Stack trace of thread 26634:
                #0  0x00007fc683d6ce6e __memcpy_sse2_unaligned (libc.so.6)
                #1  0x00007fc67f565f5f u_upload_data (radeonsi_dri.so)
                #2  0x00007fc67f567c88 u_vbuf_draw_vbo (radeonsi_dri.so)
                #3  0x00007fc67f3cce57 st_draw_vbo (radeonsi_dri.so)
                #4  0x00007fc67f39dad4 vbo_validated_drawrangeelements (radeonsi_dri.so)
                #5  0x00007fc67f39ddd4 vbo_exec_DrawRangeElementsBaseVertex (radeonsi_dri.so)
                #6  0x00000000004fcaee _ZL37retrace_glDrawRangeElementsBaseVertexRN5trace4CallE (glretrace)
                #7  0x000000000043a4a5 _ZN7retrace8Retracer7retraceERN5trace4CallE (glretrace)
                #8  0x000000000042fb36 _ZN7retraceL11retraceCallEPN5trace4CallE (glretrace)
                #9  0x000000000043183f _ZN7retrace11RelayRunner6runLegEPN5trace4CallE (glretrace)
                #10 0x00000000004317a8 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #11 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #12 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #13 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #14 0x00007fc68515e484 start_thread (libpthread.so.0)
                #15 0x00007fc683dc4aed __clone (libc.so.6)
                
                Stack trace of thread 26629:
                #0  0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fead _ZN7retrace9RelayRace3runEv (glretrace)
                #4  0x0000000000430073 _ZN7retraceL8mainLoopEv (glretrace)
                #5  0x0000000000430957 main (glretrace)
                #6  0x00007fc683cfc5e0 __libc_start_main (libc.so.6)
                #7  0x000000000042c5e9 _start (glretrace)
                
                Stack trace of thread 26630:
                #0  0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x00007fc67f83dba3 radeon_drm_cs_emit_ioctl (radeonsi_dri.so)
                #2  0x00007fc67f83d3e7 impl_thrd_routine (radeonsi_dri.so)
                #3  0x00007fc68515e484 start_thread (libpthread.so.0)
                #4  0x00007fc683dc4aed __clone (libc.so.6)
                
                Stack trace of thread 26632:
                #0  0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #4  0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #5  0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #6  0x00007fc68515e484 start_thread (libpthread.so.0)
                #7  0x00007fc683dc4aed __clone (libc.so.6)
                
                Stack trace of thread 26633:
                #0  0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace)
                #2  0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace)
                #3  0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace)
                #4  0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace)
                #5  0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace)
                #6  0x00007fc68515e484 start_thread (libpthread.so.0)
                #7  0x00007fc683dc4aed __clone (libc.so.6)
Comment 85 Kamil Páral 2015-12-22 12:37:24 UTC
(In reply to Daniel Exner from comment #80)
> Now I get (with radeonsi):
> 
> glretrace game.x86_64.trace 
> apitrace: warning: caught signal 11
> 47062: error: caught an unhandled exception

Great to see it does not crash just for me :-)

(In reply to Michael Eagle from comment #82)
> On Fedora 23 I'm using this copr:
> https://copr.fedoraproject.org/coprs/griever/mesa-git/

Thanks, that helps a lot!

(In reply to Daniel Exner from comment #83)
> Managed to get some more infos about glretrace crash:

Please note I uploaded a gdb backtrace to the apitrace bug mentioned in comment 73. Now that I tested latest mesa git (ea8c0b1, 2012-12-21), I can confirm apitrace still crashes, and I uploaded an updated gdb backtrace:
https://github.com/apitrace/apitrace/files/69636/gdb.backtrace2.txt

However, the question is whether that apitrace crash is related to the gpu hang we see in XCOM. It certainly makes debugging harder.
Comment 86 Nicolai Hähnle 2015-12-22 19:48:11 UTC
Many thanks for your patience! While I still did not see a crash, valgrind does indeed report errors when playing back your crash. It is possible that such errors are indirectly related to the lockup, if they lead to bad data being sent to the card. Today's my last day before the holidays, but I'll look into this next week.
Comment 87 Kamil Páral 2015-12-22 22:38:04 UTC
Created attachment 120655 [details]
gdb backtrace from xorg when system locks up during glretrace replay

Thanks Nicolai and others for looking into this. I have some good news.

The apitrace developer patched glretrace, so that it no longer crashes during replay. He says it's a bug inside XCOM. He also says the problem might likely cause the radeonsi driver crash as well. See his comment here:
https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366

(If this indeed turns out to be an XCOM bug, it would be nice if we could put some safeguards into the driver and didn't crash for invalid commands, but I'm saying that as someone who knows exactly zero about gpu driver programming).

Also, now that glretrace does not crash, I'm able to easily loop the trace until its very end. In just a handful of replays, my system locked up 3 times. I haven't seen all lockups, but at least once it occurred exactly at the point where the lockup occurred while recording (at the very end). So it seems I'm able to reproduce this with reasonable likelihood, and therefore can test some fixes if needed.

(In reply to Michel Dänzer from comment #77)
> No need, it's not important. The assumption for now is that the Xorg crash
> is caused by the GPU hang. If you happen to get a gdb backtrace of it in the
> future, that will help verify this assumption, but it's no more than nice to
> have.

I have it now, attaching.
Comment 88 Jose Fonseca 2015-12-22 22:52:25 UTC
(In reply to Kamil Páral from comment #87)
> The apitrace developer patched glretrace, so that it no longer crashes
> during replay. He says it's a bug inside XCOM. He also says the problem
> might likely cause the radeonsi driver crash as well. See his comment here:
> https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366
> 
> (If this indeed turns out to be an XCOM bug, it would be nice if we could
> put some safeguards into the driver and didn't crash for invalid commands,
> but I'm saying that as someone who knows exactly zero about gpu driver
> programming).

Yep.  We could add a new drirc hack for ignoring the start/end params of glDrawRangeElementsBaseVertex for applications like XCOM, ie, assume 0..~0.

XCOM is not unique here -- we've seen this happening on a Direct3D9 once at VMware.  It looks like some of the proprietary OpenGL/Direct3D drivers out there simply outright ignore the min/max index hints.
Comment 89 Michel Dänzer 2015-12-23 03:25:11 UTC
(In reply to Daniel Exner from comment #83)
> Managed to get some more infos about glretrace crash:
> 
>                 Stack trace of thread 13986:
>                 #0  0x00007fca93b67945 loader_dri3_wait_gl (libGL.so.1)

This crash is fixed in current Mesa Git master.

BTW, please create attachments for such large pieces information.

(In reply to Kamil Páral from comment #87)
> > If you happen to get a gdb backtrace of it in the future, that will help
> > verify this assumption, but it's no more than nice to have.
> 
> I have it now, attaching.

Thanks, it confirms that the Xorg crash is caused by the GPU hang and not related to the apitrace crash.
Comment 90 Paulo Dias 2015-12-23 03:48:58 UTC
just to let you guys know that with latest llvm 3.8 (256187) fixes and mesa up to 50fc4a925644378c50282004304bc8fd64b95e3c, it takes much longer for xcom enemy unknown to crash the GPU, i played for two solid hours, so its getting better. if someone wants, i can test the drirc workaround for glDrawRangeElementsBaseVertex, just put a .drirc here.
Comment 91 Daniel Exner 2015-12-23 13:37:26 UTC
Sorry for pasting instead of an attachement.

The specs for glDrawRangeElementsBaseVertex [1] say this case (array out of bounds) should be handled like this:

"Index values lying outside the range [start, end] are treated in the same way as glDrawElementsBaseVertex. "

The specs for glDrawElementsBaseVertex [2] don't say anything about this case (obviously since this function doesn't imply any size constrains for the array).

So it seems like it is indeed a Bug in the game to try to address this index element but also the operation should not crash and its unspecified behaviour.

Perhaps radeonsi should handle it the same as other mesa drivers to for the sake of cosistency.

[1] https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml
[2] https://www.opengl.org/sdk/docs/man/html/glDrawElementsBaseVertex.xhtml
Comment 92 Jose Fonseca 2015-12-23 15:20:42 UTC
(In reply to Daniel Exner from comment #91)
> So it seems like it is indeed a Bug in the game to try to address this index
> element but also the operation should not crash and its unspecified
> behaviour.
> 
> Perhaps radeonsi should handle it the same as other mesa drivers to for the
> sake of cosistency.

Yes, crashing should be avoided.

But correct rendering, no, not generally.  Not unless it can be without performance impact (which is probably not the case.)

Otherwise it would be sacrificing the performance of correct GL apps, for the sake of buggy GL apps.  Which is rewarding the wrong behavior.


It's not that hard: the start/end parameters are hints precisely aimed at enabling the driver to do performance optimizations.  If the application developers can't get them right, just them don't set to invalid values!  Use 0 / ~0 which is guaranteed to work.   This way the application developers that actually bothered to get them right don't get penalized.  Everybody's happy.


Maybe it would help if Mesa's KHR_debug / apitrace checked for this sort of error.
Comment 93 Nicolai Hähnle 2015-12-29 20:17:39 UTC
Brief update: The crashes and Valgrind errors when playing back the trace are almost certainly unrelated to the lockup. It turns out apitrace is too aggressive in trimming the client-side memory blobs during recording. (See https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502)

On a more positive note, I am also seeing the lockup on Tonga from inside the game itself. Unfortunately, I cannot reproduce it reliably yet.
Comment 94 Roland Scheidegger 2015-12-30 03:48:16 UTC
Created attachment 120734 [details] [review]
apitrace patch for honoring range in DrawRangeElementsX commands

So you'd want something like this (totally untested) patch for apitrace?
Makes sense I guess that the specified range not only means the supplied indices have to be inside that range, but it also works the other way round (driver can rely on the specified range being accessible) - otherwise the driver would still need to scan the actual index buffer. (Albeit since for this app the ranges seem to be pretty bogus who knows if the memory inside the specified range but not used in the actual indices is really always accessible.)
I am actually wondering if it would be legal if the memory isn't accessible below the start value (apitrace surely couldn't handle that)...
Comment 95 Jose Fonseca 2015-12-30 22:44:14 UTC
(In reply to Roland Scheidegger from comment #94)
> Created attachment 120734 [details] [review] [review]
> apitrace patch for honoring range in DrawRangeElementsX commands
> 
> So you'd want something like this (totally untested) patch for apitrace?
> Makes sense I guess that the specified range not only means the supplied
> indices have to be inside that range, but it also works the other way round
> (driver can rely on the specified range being accessible) - otherwise the
> driver would still need to scan the actual index buffer. (Albeit since for
> this app the ranges seem to be pretty bogus who knows if the memory inside
> the specified range but not used in the actual indices is really always
> accessible.)

Yes, this does indeed make sense.

For VBOs, the start/end range is a hint (the OpenGL shouldn't crash if the start/end range goes beyond the VBO size), but for user memory arrays, there's no reliable way to know where user memory is supposed to stop -- the start/end range is it.

> I am actually wondering if it would be legal if the memory isn't accessible
> below the start value (apitrace surely couldn't handle that)...

I don't think it's worth worrying about that.
Comment 96 Nicolai Hähnle 2015-12-31 02:22:58 UTC
Created attachment 120742 [details]
more conservative apitrace patch

For what it's worth, I've attached a modified version of Roland's patch that is slightly more conservative, guarding against some stupid end values and checking the indices. Not sure which patch is really better though, in the end it depends on how much broken software is out there. As far as I can tell, XCOM apitraces work with both variants.
Comment 97 Jose Fonseca 2016-01-04 20:00:47 UTC
I'm pushing a slightly modified version of Roland's patch.

(In reply to Nicolai Hähnle from comment #96)
> Created attachment 120742 [details]
> more conservative apitrace patch
> 
> For what it's worth, I've attached a modified version of Roland's patch that
> is slightly more conservative, guarding against some stupid end values and
> checking the indices. Not sure which patch is really better though, in the
> end it depends on how much broken software is out there. As far as I can
> tell, XCOM apitraces work with both variants.

Yes, I don't think this is necessary.  If apitrace needs more resiliency, then the best approach would be to setup a segv handler to cope with out-of-bounds reads.

This code path is only used for user arrays. (VBOs don't need this special treatment.)
Comment 98 Daniel Exner 2016-01-10 10:54:24 UTC
glretrace -v from apitrace (5e96ed318db1ba8037eb402724bc052240ac9e05) still crashes with the trace:

#0  0x00007f9d112fae6e __memcpy_sse2_unaligned (libc.so.6)
#1  0x00007f9d0caf3f5f u_upload_data (radeonsi_dri.so)
#2  0x00007f9d0caf5c88 u_vbuf_draw_vbo (radeonsi_dri.so)
#3  0x00007f9d0c95ae57 st_draw_vbo (radeonsi_dri.so)
#4  0x00007f9d0c92bad4 vbo_validated_drawrangeelements (radeonsi_dri.so)

Last output before crash:

965948 @3 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 15, count = 30, type = GL_UNSIGNED_SHORT, indices = blob(60), basevertex = 0).

Kernel 4.4.0-rc8, Mesa 11.1.0

Game itself also still crashes.
Comment 99 Daniel Exner 2016-01-10 11:13:26 UTC
Played a bit with GALLIUM_HUD.
Using: GALLIUM_HUD="fps,cpu,ps-invocations+hs-invocations+ds-invocations+cs-invocations;num-compilations+num-shaders-created,draw-calls,buffer-wait-time;num-cs-flushes,num-bytes-moved;VRAM-usage+GTT-usage,GPU-load,temperature,shader-clock+memory-clock"

I can play for some minutes before the game crashes (but doesn't kill the box) and the trace looks like this:

#0  0x00007f22458745ac hud_draw_string (radeonsi_dri.so)
#1  0x00007f2245874e28 hud_draw (radeonsi_dri.so)

I guess this is a GALLIUM_HUD bug.

Any metric that might be of particular interest?
Comment 100 Nicolai Hähnle 2016-01-10 15:33:56 UTC
Hi Daniel,

re comment #98: That's to be expected. The problem was with *recording* the trace, not with playing it back. A trace recorded without the patch will crash when played back, whether the playback session has the patch or not.

re comment #99: Interesting, and yes, I'd say that's a HUD bug. Could you please file a separate bug report for that?

Those metrics are very unlikely to help, though. One of my crash reproductions had a VM fault where a page that should have been there (according to the radeonsi VM fault dump) apparently wasn't. There are some theories (like problems with the sdma ring that is used to update the page table entries), so it's likely some race condition is involved.
Comment 101 Daniel Exner 2016-01-10 16:26:13 UTC
I added a new bug report about GALLIUM_HUD bug.

Are those faults related to this bug?

Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0f653014
Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0001DF7B
Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014
Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0a653014
Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00016F53
Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014
Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0b853014
Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00016F5C
Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014
Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x06253014
Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00021CB1
Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014
Comment 102 Nicolai Hähnle 2016-01-10 16:33:47 UTC
Yes, they're almost certainly related. A sequence of VM faults followed by a lockup is not an unusual symptom.
Comment 103 Kamil Páral 2016-01-11 16:31:03 UTC
(In reply to Nicolai Hähnle from comment #100)
> re comment #98: That's to be expected. The problem was with *recording* the
> trace, not with playing it back. A trace recorded without the patch will
> crash when played back, whether the playback session has the patch or not.

Will it help you if I try to capture another trace with fixed apitrace? If it will, can I simply grab apitrace git master (i.e. including https://github.com/apitrace/apitrace/commit/edc099cff55a6a3f9ad191acfbc8cc39f36228db ), or do I also need to apply the patch mentioned in https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 on top of that? (that patch was not pushed to git).
Comment 104 Nicolai Hähnle 2016-01-11 18:04:15 UTC
At this point, I can reproduce the lockup albeit not deterministically, so it's not really needed. If you are able to capture an apitrace that reproduces the lockup deterministically (even after a cold reboot!), then that would still be interesting - but I kind of doubt that that's possible.
Comment 105 darkm00n 2016-02-05 02:27:35 UTC
Created attachment 121531 [details]
journald xcom: ew lockup

Any news on this bug? I've just tried to play the game and my system got frozen after a couple of minutes in the first mission, I was able to move the cursor and the music was playing but couldn't do anything else. I had journald -f running on my laptop via SSH so I could get the error messages.

Arch Linux
Kernel 4.4.1
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]
Using the latest mesa-git and llvm-libs from lcarlier repo
Comment 106 darkm00n 2016-02-05 02:29:58 UTC
Created attachment 121532 [details]
journald xcom: ew apitrace crash

Also tried to run apitrace after the lockup but it crashed the game when I clicked to start the mission on the loading screen. Not sure what I'm doing wrong and couldn't get apitrace past the loading screen. journald log attached.
Comment 107 Kamil Páral 2016-02-05 08:20:56 UTC
We have found out that XCOM issues invalid OpenGL commands. The purpose of this bug report is for radeon driver to stop crashing when that happens. But I wonder if somebody contacted XCOM developers and asked them to fix their bug? That could make the game working properly with radeon driver, with no crashes. I know that XCOM devs chipped in here before, but they likely haven't followed the full discussion.
Comment 108 Edwin Smith (Feral Interactive) 2016-02-05 09:32:03 UTC
(In reply to Kamil Páral from comment #107)
> We have found out that XCOM issues invalid OpenGL commands. 

Have we confirmed that this is true and if so what call is supposedly incorrect? I been following and seen some speculation but it seems that although there was some debate no firm decision was made if this was undefined behaviour that was open to interpretation or something definitely wrong with the game.

Once we get some information on the issue we can then investigate at Feral if the issue is indeed inside the game not inside the Mesa drivers.
Comment 109 Kamil Páral 2016-02-05 10:30:51 UTC
Hi Edwin, thanks for following this discussion! I got the impression that XCOM uses invalid or at least undefined behavior from comment 88, 91 and 92, and from the apitrace bug comments here:
https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366
https://github.com/apitrace/apitrace/issues/407#issuecomment-166752457
https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502

But I'm no OpenGL developer, so I'll let Nicolai or Jose or somebody else knowledgeable confirm or refute this :)
Comment 110 Edwin Smith (Feral Interactive) 2016-02-05 13:15:51 UTC
(In reply to Kamil Páral from comment #109)
> Hi Edwin, thanks for following this discussion! I got the impression that
> XCOM uses invalid or at least undefined behavior from comment 88, 91 and 92,
> and from the apitrace bug comments here:
> https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366
> https://github.com/apitrace/apitrace/issues/407#issuecomment-166752457
> https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502
> 
> But I'm no OpenGL developer, so I'll let Nicolai or Jose or somebody else
> knowledgeable confirm or refute this :)

:) No problem, we'll have a look at the Linux code the next time we patch the game as regardless of the reasons it would be nice to get the game running as well as other Feral games that run on mesa. It's quite possible this issue is something in the original engine behaviour that needs correcting for the mesa drivers.

Many thanks to everyone one on the bugs for their investigations.
Comment 111 Nicolai Hähnle 2016-02-05 14:14:07 UTC
The apitrace-related crash from those earlier comments is not an XCOM bug. However, if I recall correctly, XCOM issues DrawRangeElements calls with ranges that are larger than necessary. This hurts performance slightly when vertex data is not in VBOs; it is irrelevant when all vertex data is in VBOs.

I have been able to reproduce the lockup (thanks to Edwin), but it takes a fairly long time inside the game to happen for me. With everything else that is going on, I simple haven't yet had the time to collect enough information to really understand what's going on.
Comment 112 Kamil Páral 2016-02-05 14:21:37 UTC
(In reply to Nicolai Hähnle from comment #111)
> However, if I recall correctly, XCOM issues DrawRangeElements calls with
> ranges that are larger than necessary.

My understanding was that exactly the opposite was the issue - the game sends indices outside of the specified bounds. Quoting Jose [1]:
"the application is giving a hint to the OpenGL driver that indices are between 0 and 215, but in fact in call 21933, the index at position 164 is 216, and more later."
[1] https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366
Comment 113 Nicolai Hähnle 2016-02-05 14:43:23 UTC
Perhaps it does, and that would be bad, but the particular apitrace crash was basically the following:

1) XCOM uses DrawRangeElements with an unnecessarily large range.
2) During tracing, apitrace scans the index/elements array to determine the range of vertices that is really being used.
3) Apitrace only stores this range of vertices.
4) During playback, apitrace would send the same range via DrawRangeElements, but provide vertex data only for the range that was determined to be really used.
5) The driver, on the other hand, relies on the entire range to be there and tries to upload it to the card. This is where the crash happened (and it also explains what I said before about XCOM being slightly inefficient here)
Comment 114 Kamil Páral 2016-02-05 14:56:52 UTC
I see, thanks for explanation. In that case I was wrong in automatically assuming there is a bug in XCOM. Sorry for the noise.
Comment 115 Jose Fonseca 2016-02-05 16:29:21 UTC
I think there are multiple issues being conflated here.


Yes, apitrace had some bugs, and they might have cause additional grief to Mesa drivers when replaying traces from XCOM.



But if the question is "does XCOM have a bug?", then the answer is IMO a definite "yes".


As explained in https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 the game is passing indices outside start..end range, which is illegal per https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml "all values in the array indices must lie between start and end, inclusive, prior to adding basevertex".

  


So if the XCOM developers are looking at this bug report, then please fix this issue.  Even if it's not the all story here, it is a bug on its own right, which can and will cause rendering issues depending on the OpenGL driver implementation.
Comment 116 Marek Olšák 2016-02-25 12:43:26 UTC
This thread is too long. Could someone please summarize the issues here? Also, how does the apitrace crash relate to the GPU hang?
Comment 117 Kamil Páral 2016-03-02 16:35:27 UTC
(In reply to Marek Olšák from comment #116)
> This thread is too long. Could someone please summarize the issues here?
> Also, how does the apitrace crash relate to the GPU hang?

I hoped somebody clever would respond, but it seems it'll have to be me. OpenGL layman alert.

There were multiple issues discovered in this report:
0. (the core issue) radeonsi hangs the system completely (or sometimes recovers) when playing XCOM, randomly (can be minutes, can be hours)
1. Jose claims XCOM has a bug, issuing invalid OpenGL commands (see comment 115).
2. apitrace is crashing while replaying almost any XCOM trace. My trace from comment 74 is affected by this, so you need a temporary fix from https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 to stop it from crashing. The fix is not mainlined, because Jose says the purpose of apitrace is to help discover problems, not hide them.
3. Another discovered issue was that apitrace was trimming vertex data too aggressively (comment 113, https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502). That is now fixed in apitrace, but my trace is affected, I'd have to re-record it. The problem is that I was really lucky that I recorded it in the first place, my computer did not hang completely as in 99% of cases, but recovered, and therefore the trace was not cut short. I don't think I'd get that lucky again. Plus Nicolai said he does not need that. The trimmed vertex data does not seem to affect the replay in a negative way, but I assume it might complicate the debugging process.
4. When looping over my trace, I can reproduce the crash pretty quick (i.e. my computer completely hangs), but not deterministically (does not happen on every replay). But it seems there is no further info I could supply from my side to help you debug this.
5. Nicolai said he can reproduce it (probably by running the game, not replaying my trace), but it takes a long time, so he wasn't able to work on this too much.
Comment 118 Edwin Smith (Feral Interactive) 2016-03-02 17:17:27 UTC
In summary I think it was intimated that the issue might be caused by how XCOM deals with indices.

===
The game is passing indices outside start..end range, which is illegal per https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml "all values in the array indices must lie between start and end, inclusive, prior to adding base vertex"
===

Mesa Intel and AMD/Nvidia closed source deal with this gracefully by ignoring the range hint if they are invalid however RadeonSi does not and can in some cases crash.

Due to XCOM originally being designed for DirectX on Windows where this behaviour is not a fatal error combined with other OpenGL drivers on Linux & Mac also not throwing an error/warning this issue was overlooked/missed on the original port as Mesa RadeonSi was not a supported driver at the time so no-one saw the issue.

This has already been fixed for our more recent games as the Mesa AMD drivers now support most of the features needed for many games so they are actively used/tested/bugs logged at Feral. 

We don't have any plans for a patch in the short term but we'll definitely back port this fix so we match the spec correctly into XCOM when we next patch it.
Comment 119 Nicolai Hähnle 2016-03-14 12:57:38 UTC
For what it's worth, I'd encourage people to update their graphics stack to latest everything (including kernel + X.Org server) and see if they still get lockups. I haven't been able to reproduce this anymore in my last two attempts - I don't know if I've just been lucky, but it might have been fixed randomly. Unfortunately, too much has changed on my system between reproduction attempts to be able to say exactly what might have fixed it.
Comment 120 Daniel Exner 2016-03-17 21:30:49 UTC
Using xorg 1.18.2, Kernel 4.5 and

OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0, LLVM 3.8.0)
OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.3.0-devel (git-84b961d)

I was able to play about 2hours without crash.

But that worked in the past also, at least sometimes. I'll see if I find some more testing time.
Comment 121 Davin McCall 2016-03-18 18:35:42 UTC
On a 7790 with latest versions of just about everything - kernel 4.5, mesa git head (11.3.0-devel), llvm 3.8, XOrg 1.18.2 - I still see this crash regularly. I'm happy to provide any further information/logs/traces if necessary.
Comment 122 Daniel Exner 2016-03-19 20:56:11 UTC
Tested again, same setup as before: crashed after 5 Minutes.
Comment 123 Nicolai Hähnle 2016-03-22 04:15:44 UTC
Thanks for the re-test. It's odd that I couldn't reproduce it any more. It may be that I was just lucky.

However, it's worth noting that Daniel has a GCN 1.0 card and David has a GCN 1.1 card, i.e. running on the radeon kernel module and DDX. It's possible that an amdgpu-only change has randomly fixed this for my Tonga-based test setup.
Comment 124 Daniel Exner 2016-03-23 10:47:55 UTC
Well there is this (still) highly experimental amdgpu für southern islands branch. Perhaps I'll try that if I feel very lucky.
Comment 125 Vladislav Kamenev 2016-03-23 17:34:55 UTC
How to get dmesg log during lockup?
Got this on r600 driver (AMD Radeon hd6650m TURKS)
Comment 126 Nicolai Hähnle 2016-03-23 18:46:03 UTC
That's interesting, because r600 is a different user space OpenGL driver. It might be an interaction with the DDX or kernel though.

If you cannot ssh from a different computer, you can still recover the log from e.g. /var/log/kern.log (the exact location may depend on the distribution).
Comment 127 Davin McCall 2016-03-30 14:40:42 UTC
The comment above from Jose Fonseca (and follow up from Feral's Edwin Smith) imply that the problem here is with the game calling glDrawRangeElementsBaseVertex with bad start/end values, resulting in the indices being out-of-range which is illegal per the GL spec, and implying that this is what causes the crash.

I have tried patching Mesa to effectively reduce glDrawRangeElementsBaseVertex calls to glDrawElementsBaseVertex (i.e. same method but with no start/end supplied). Patch follows below (inline because it is so short). However, I am sorry to say that this did *not* prevent the crashes. I conclude that there may still be a bug in Mesa and/or kernel space DRM (although it's possible my patch isn't having the effect I intended - I'm not familiar enough with Mesa code base to be sure).


diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c
index f0245fd..d1f4ac6 100644
--- a/src/mesa/vbo/vbo_exec_array.c
+++ b/src/mesa/vbo/vbo_exec_array.c
@@ -935,7 +935,9 @@ vbo_exec_DrawRangeElementsBaseVertex(GLenum mode,
    (void) check_draw_elements_data;
 #endif
 
-   vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, end,
+   //vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, end,
+   //			   count, type, indices, basevertex, 1, 0);
+   vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, ~0, ~0,
 				   count, type, indices, basevertex, 1, 0);
 }
Comment 128 Edwin Smith (Feral Interactive) 2016-04-25 09:40:47 UTC
(In reply to Davin McCall from comment #127)
> The comment above from Jose Fonseca (and follow up from Feral's Edwin Smith)
> imply that the problem here is with the game calling
> glDrawRangeElementsBaseVertex with bad start/end values, resulting in the
> indices being out-of-range which is illegal per the GL spec, and implying
> that this is what causes the crash.

It might be useful to see what the Intel and/or r600 series drivers do as neither of these driver exhibit this crash in similar circumstances. The Intel Mesa driver might be best as this driver was fully supported from release without any reports of hangs post release. It's possible a comparison might help expose why RadeonSi behaves differently and narrow down the cause of the hang as it could be the root cause is hiding behind a more benign issue.
Comment 129 Marek Olšák 2016-04-28 18:27:52 UTC
(In reply to Davin McCall from comment #127)
> diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c
> index f0245fd..d1f4ac6 100644
> --- a/src/mesa/vbo/vbo_exec_array.c
> +++ b/src/mesa/vbo/vbo_exec_array.c
> @@ -935,7 +935,9 @@ vbo_exec_DrawRangeElementsBaseVertex(GLenum mode,
>     (void) check_draw_elements_data;
>  #endif
>  
> -   vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start,
> end,
> +   //vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start,
> end,
> +   //			   count, type, indices, basevertex, 1, 0);
> +   vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, ~0, ~0,
>  				   count, type, indices, basevertex, 1, 0);
>  }

That's not correct, but it's close. If you want the driver to ignore the app-supplied index bounds, simply change "index_bounds_valid" to "false".
Comment 130 Marek Olšák 2016-04-28 18:28:42 UTC
Oh I see you did that. Sorry for the noise.
Comment 131 Davin McCall 2016-05-14 11:20:56 UTC
Edwin Smith:

> It might be useful to see what the Intel and/or r600 series drivers do as neither of these driver exhibit this crash in similar circumstances.

I think you missed the significance of my second paragraph above - I have established that the hang is _not_ caused by the mis-use of the glDrawRangeElementsBaseVertex() function. So comparing how the radeonsi and Intel/r600 drivers handle this function is not likely to help in resolving this issue.

Davin
Comment 132 Edwin Smith (Feral Interactive) 2016-06-06 07:51:27 UTC
(In reply to Davin McCall from comment #131)
> I think you missed the significance of my second paragraph above - I have
> established that the hang is _not_ caused by the mis-use of the
> glDrawRangeElementsBaseVertex() function. So comparing how the radeonsi and
> Intel/r600 drivers handle this function is not likely to help in resolving
> this issue.

OK, I'll leave this one with you but if you need anything else from Feral let us know.
Comment 133 Daniel Exner 2016-06-12 18:12:37 UTC
Created attachment 124486 [details]
crash

I once again tried it:

Kernel: 4.7.0-rc2-00342-g8714f8f
Mesa: Dev git 54f755f
Llvm: Dev cd22fc5 (using llvm git mirror)

And it crashed again. Whole screen froze, went black, returned, went black returned, DPMS kicked in. Had to reset the box.

Something else I noticed: GPU Temp as reported by lm_sensor went up to 69°C while gaming. Normal with idel Plasma DE is 44°.

(Not sure if relevant)
Comment 134 Nicolai Hähnle 2016-06-13 08:28:55 UTC
Hi Daniel, curious, but I doubt the crash is related to the lockup. Most likely, buffer creation fails in radeon_cs_create_fence and then we get a NULL pointer dereference. If you could get a backtrace with line numbers to confirm that would be nice.

In any case, GPU lockups can only be caused by actually submitting something to the GPU, which we obviously don't do once the game process crashes... so more likely, the GPU lockup happens first and then causes the subsequent failure somehow.
Comment 135 Alassane Maiga 2016-07-01 17:07:24 UTC
I had something interesting happen with kernel 4.7 rc5 and mesa 12.1 git1591e66
The system would lock up, and the console would fill with the stalling ring messages. But then it won't lock up and will recover somewhat. The game wouldn't crash either, allowing me to exit it cleanly. Plasmashell would crash though, and there is a lot of video corruption on the desktop afterwards until the reboot. Happened two times in a row
Comment 136 Daniel Exner 2016-07-03 20:48:20 UTC
I tried with:

* Kernel 4.7.0-rc5-00309-gdbdc3bb
* LLVM: 274446
* Mesa: 01ccb0d

This time I could play for a solid hour without crash. Will try again later to confirm.

@Devs: any idea if there is something that might have fixed this? Or just luck?
Comment 137 Nicolai Hähnle 2016-07-04 07:45:26 UTC
Nothing that I'm aware of. May have been just luck...
Comment 138 Daniel Exner 2016-07-04 17:57:29 UTC
Created attachment 124896 [details]
crash journal kernel 4.7.0-rc6

Seems you where right: crashed again today.

I managed to log some of it, perhaps its of any use.
Comment 139 Nicolai Hähnle 2016-07-04 19:16:59 UTC
Are you running with some kind of virtualization enabled? I'm not familiar with these AMD-Vi messages.
Comment 140 Jan Vesely 2016-07-04 19:57:12 UTC
(In reply to Nicolai Hähnle from comment #139)
> Are you running with some kind of virtualization enabled? I'm not familiar
> with these AMD-Vi messages.

Those are printed by the IOMMU driver (device accessed unmapped page). Not necessarily virtualization related.
Comment 141 Alassane Maiga 2016-07-10 02:07:24 UTC
Well I can reliably make it stop locking the system completely now. I switch to console with ctrl+alt+f2 as soon as the game locks up and then back as soon as the ring stall messages start to fill up the screen. When I am back the game is not frozen anymore, but extremely slow. In fact all opengl related rendering becomes slow. And I get this message each time I switch to console :
"GPU fault detected 146 0x0904xxxc
VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00029E48
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040xxxxC"

How should I give more information?

If I restart the performance comes back to normal
Comment 142 Marek Olšák 2016-08-01 11:42:47 UTC
I've seen plenty of GPU hangs with XCOM: Enemy Within. It's basically the same game with a little more content, but not much.

The reproducibility is random. The hangs usually happen between 1 minutes and 8 hours.

In my case, this is the data I've been able to obtain:
- Reproduced on everything I was testing on: Hawaii (radeon), Tonga, Polaris11 (amdgpu)
- It occurs with many different shaders, among which there are a few very simple ones. (no scratch or spills, a few ifs, no loops)
- Disabling HyperZ has no effect.
- Disabling CE has no effect.
- VM faults never occur.
- The hangs don't seem to have anything in common.

Action items:
- Reproduce the hang and do a hardware scan dump (it can only be done in the AMD office AFAIK), and send it to hardware teams.
Comment 143 Amarildo 2016-09-04 05:00:17 UTC
(In reply to Alassane Maiga from comment #141)
> Well I can reliably make it stop locking the system completely now. I switch
> to console with ctrl+alt+f2 as soon as the game locks up and then back as
> soon as the ring stall messages start to fill up the screen. When I am back
> the game is not frozen anymore, but extremely slow. In fact all opengl
> related rendering becomes slow. And I get this message each time I switch to
> console :
> "GPU fault detected 146 0x0904xxxc
> VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00029E48
> VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040xxxxC"
> 
> How should I give more information?
> 
> If I restart the performance comes back to normal

I think is a similar effect of when DPM is disabled (with 'radeon.dpm=0'). I'm interested to know if DPM has a finger on this issue or not.

Here's a "git diff" between si_dpm.c from kernel 3.16 and 4.7.2: http://pastebin.com/raw/UzZFfYgp

Perhaps there are some hints in there.
Comment 144 Amarildo 2016-09-04 05:03:08 UTC
BTW, Team Fortress 2 also has hangups, and I was able to play it for +40 minutes without issues with "radeon.dpm=0".
Comment 145 pandiculationfinch 2016-09-12 22:33:08 UTC
disabling radeon.dpm I was stable for 70minutes with netflix and stellaris running. usually I crash within 10-20 minutes.
Comment 146 Daniel Exner 2016-12-08 13:11:13 UTC
(In reply to Marek Olšák from comment #142)
[..]

> In my case, this is the data I've been able to obtain:
> - Reproduced on everything I was testing on: Hawaii (radeon), Tonga,
> Polaris11 (amdgpu)
Out of curiosity: what about r600?
 
> Action items:
> - Reproduce the hang and do a hardware scan dump (it can only be done in the
> AMD office AFAIK), and send it to hardware teams.
Any news here?

My best bet still is either something in DPM or LLVM.
DPM path is supported by this card being one of those needing a DPM quirk in si_dpm.c.
Comment 148 Daniel Exner 2016-12-08 22:11:53 UTC
(In reply to Marek Olšák from comment #147)
> Possible fix:
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=6dc96de303290e8d1fc294da478c4f370be98dea

I wonder how _that_ could have not been found earlier.

Anyway, played for about an hour without crash using:

OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.48.0 / 4.9.0-rc8-dirty, LLVM 4.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 13.1.0-devel (git-31f988a9d6).

Will try to crash it again tomorrow, but looks promising.
Comment 149 Daniel Exner 2016-12-09 20:41:27 UTC
(In reply to Daniel Exner from comment #148)
> Will try to crash it again tomorrow, but looks promising.

Played about 3h straight today. More than the whole past year!

Guess this can finaly be closed.

Marek, if I ever meet you in person I owe you some beer (or whatever you prefer :)
Comment 150 ArneJ 2017-01-07 20:38:32 UTC
I also suffered a lot from this issue on my R9 270X.
I was never able to go past the first tutorial mission because my PC hung during that mission.

Now I tried it again with mesa 13.0.3 and I was finally able to finish the first mission and also played 3 more missions after that without any issues.
It looks like this issue is finally resolved.

Thanks a lot Marek!
Comment 151 Marek Olšák 2017-01-07 20:58:27 UTC
OK. The CSO fix did it. Thanks for info. Closing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.