Description
Ryan Williams
2014-06-23 18:30:19 UTC
Please attach /var/log/Xorg.0.log and the output of dmesg and glxinfo.
> AMD HD7770 2GB
> Mesa 10.1.3
Which version of LLVM? Might be worth trying Mesa 10.2 or even current Git master if possible.
Created attachment 101631 [details]
X log
Created attachment 101632 [details]
Dmesg
Created attachment 101633 [details]
glxinfo
> Which version of LLVM? Might be worth trying Mesa 10.2 or even current Git
> master if possible.
I'm using llvm 3.4 (default Ubuntu). Will try Mesa 10.2 as soon as possible.
Tried it with Oibaf PPA, llvm 3.4.2 and Mesa git and the issue is still there. Seemd to lock up even quicker this time. Any chance you could try an LLVM 3.5 snapshot? Also, if you could log in via ssh and grab dmesg after the problem occurs, that might be interesting. I have been playing xcom:enemy within for a few hours on an up-to-date x86_64 fedora rawhide. No lock-up yet, but slow like hell. Tahiti XT. (In reply to comment #8) > I have been playing xcom:enemy within for a few hours on an up-to-date > x86_64 fedora rawhide. > No lock-up yet, but slow like hell. Tahiti XT. I can reproduce the lockup with mesa 10.2.1 with llvm-3.4.2, mesa-git with llvm-3.5svn and a radeon PITCAIRN/kernel 3.15.1/kernel 3.16rc1 (Archlinux x86_64) I stand corrected: Got the lockup and crash in the middle of a mission. up-to-date x86_64 fedora rawhide. Tahiti XT. (In reply to comment #7) > Any chance you could try an LLVM 3.5 snapshot? > > Also, if you could log in via ssh and grab dmesg after the problem occurs, > that might be interesting. I don't have a second machine of my own setup to try it from but I'll see what I can do. For now, Xorg.0.log.old seems to better match up with the time of the error, and shows a mumber of EQ overflow errors that aren't in Xorg.0.log. Created attachment 101680 [details]
X log.old
(In reply to comment #11) > For now, Xorg.0.log.old seems to better match up with the time of the error, > and shows a mumber of EQ overflow errors that aren't in Xorg.0.log. Those are just symptoms of the GPU hang, they don't say anything about its cause. The corresponding dmesg output should be available in /var/log/kern.log* as well. Also, I wonder if this is reproducible enough to create an apitrace reproducing it? Created attachment 101844 [details]
Dmesg 2
SSH'd and tried dmesg like you said and attached. I've been trying apitrace but the game depends on Steam and uses a seperate launcher to start the game and can't figure out how to get a proper trace becaue of it. Launching what looks like the game binary directly (game.x86_64) and it still brings up the launcher. Just this morning I was able to play without incident for ~ 1 hour for the first time, then locked up almost immediately in a mission again after restarting. (In reply to comment #15) > SSH'd and tried dmesg like you said and attached. Thanks, but this doesn't have anything about a GPU lockup or anything like that. >Thanks, but this doesn't have anything about a GPU lockup or anything like that.
I looked through it and figured as much but attached anyway as maybe I just wasn't looking for the right thing. It's the output given when the game locked up the display, so I don't know what else to do besides apitrace.
On steam, you can select the game, right click on it, select proprieties , launch options and put apitrace %command% you can run the game with gdb, strace , etc using this option. it will output to the console, so start steam in a terminal if you need to see the output (In reply to comment #18) > On steam, you can select the game, right click on it, select proprieties , > launch options and put > > apitrace %command% > > you can run the game with gdb, strace , etc using this option. it will > output to the console, so start steam in a terminal if you need to see the > output Thanks, but the game simply segfaults when I do this: Dumped crashlog to /home/ryan/.local/share/feral-interactive/XCOM/crashes//50524115-5e25-7dad-395afa0f-6f10b83e.dmp /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh: line 39: 9211 Segmentation fault (core dumped) ${DEBUGGER} "${GAMEBINARY}" $@ Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 9211 Feral have mentioned to me that they're willing to give Mesa/RadeonSI devs Steam keys to help find and fix the issue, I just need to email them (http://steamcommunity.com/app/200510/discussions/0/648811852226640080/#c522730701427327317). If you guys are willing I'll do that. My name is Edwin Smith and I work for Feral Interactive. We don't support the mesa drivers due to the stability issues compared to the closed source drivers however we can pass on any crash logs and if it helps some complimentary XCOM keys to the members of the driver team to help with the debugging effort. The crash looks like a complete GPU hang that locks up the entire card. If it helps I can look into getting exact instructions on how to attach apitrace to the Steam release. (In reply to comment #20) > The crash looks like a complete GPU hang that locks up the entire card. If > it helps I can look into getting exact instructions on how to attach > apitrace to the Steam release. That would certainly help. Thanks for taking the time to do this. To use apitrace on XCOM, follow these instructions: 1. In the Steam client library list, right click the game 2. Select "Properties" 3. Switch to the "GENERAL" tab 4. Press "SET LAUNCH OPTIONS..." 5. Put this in the text box "DEBUGGER="apitrace trace" %command%" 6. Press OK 7. Close the Properties Window 8. Hit Play 9. Select the game to test 10. Find the trace file in the steam library common/XCom-Enemy-Unknown/game.x86_64.trace or common/XCom-Enemy-Unknown/xew/game.x86_64.trace depending on which game was launched. We've also seen the GPU hang using fglrx, but haven't reproduced a GPU hang with Intel graphics. (In reply to comment #22) > To use apitrace on XCOM, follow these instructions: > 1. In the Steam client library list, right click the game > 2. Select "Properties" > 3. Switch to the "GENERAL" tab > 4. Press "SET LAUNCH OPTIONS..." > 5. Put this in the text box "DEBUGGER="apitrace trace" %command%" > 6. Press OK > 7. Close the Properties Window > 8. Hit Play > 9. Select the game to test > 10. Find the trace file in the steam library > common/XCom-Enemy-Unknown/game.x86_64.trace or > common/XCom-Enemy-Unknown/xew/game.x86_64.trace depending on > which game was launched. > > We've also seen the GPU hang using fglrx, but haven't reproduced a GPU hang > with Intel graphics. Thanks, but again it won't work. This time it complains about missing libsteam_api.so with the message: /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/../../binaries/linux/game.x86_64: error while loading shared libraries: libsteam_api.so: cannot open shared object file: No such file or directory Appears to be something going on with xcom.sh? (In reply to comment #23) > Thanks, but again it won't work. This time it complains about missing > libsteam_api.so It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH working variable or working directory. You can attach it with LD_PRELOAD which should prevent this. Set the launch options to something like this: LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command% Adjust the path to the x86_64 glxtrace.so if necessary. (In reply to comment #24) > It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH > working variable *Environment variable even. (In reply to comment #24) > (In reply to comment #23) > > Thanks, but again it won't work. This time it complains about missing > > libsteam_api.so > > It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH > working variable or working directory. You can attach it with LD_PRELOAD > which should prevent this. Set the launch options to something like this: > LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command% > Adjust the path to the x86_64 glxtrace.so if necessary. And now it segfaults again, just like before: apitrace: redirecting dlopen("libGL.so.1", 0x102) apitrace: tracing to /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/game.x86_64.trace Dumped crashlog to /home/ryan/.local/share/feral-interactive/XCOM/crashes//7f31c2e8-309f-1dae-547b3bc4-458a6b0d.dmp /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh: line 39: 6587 Segmentation fault (core dumped) ${DEBUGGER} "${GAMEBINARY}" $@ Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 6576 (In reply to comment #26) > (In reply to comment #24) > > (In reply to comment #23) > > > Thanks, but again it won't work. This time it complains about missing > > > libsteam_api.so > > > > It's possible running through apitrace is somehow losing the LD_LIBRARY_PATH > > working variable or working directory. You can attach it with LD_PRELOAD > > which should prevent this. Set the launch options to something like this: > > LD_PRELOAD=/usr/local/lib/apitrace/wrappers/glxtrace.so:$LD_PRELOAD %command% > > Adjust the path to the x86_64 glxtrace.so if necessary. > > And now it segfaults again, just like before: > > apitrace: redirecting dlopen("libGL.so.1", 0x102) > apitrace: tracing to > /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/game. > x86_64.trace > Dumped crashlog to > /home/ryan/.local/share/feral-interactive/XCOM/crashes//7f31c2e8-309f-1dae- > 547b3bc4-458a6b0d.dmp > /home/ryan/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/binaries/ > linux/xcom.sh: line 39: 6587 Segmentation fault (core dumped) > ${DEBUGGER} "${GAMEBINARY}" $@ > Game removed: AppID 200510 "XCOM: Enemy Unknown", ProcID 6576 In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory * do a backup of xcom.sh file then edit the file * line 79, change the line: eval "$GAMESCRIPT" $@ into: apitrace trace "$GAMESCRIPT" $@ * then save and launch the game, now the trace is properly generated (In reply to comment #27) > > In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory > * do a backup of xcom.sh file then edit the file > * line 79, change the line: eval "$GAMESCRIPT" $@ > into: apitrace trace "$GAMESCRIPT" $@ > * then save and launch the game, now the trace is properly generated Continues to segfault immediately on starting, just after the intial XCOM splash screen, though there's now output from apitrace in terminal and a trace. Is this just something wrong with my system? I've checked the game cache several times already to make sure there's no corrupted files so there's nothing wrong there. Created attachment 102029 [details]
XCOM apitrace Segfault Output
(In reply to comment #28) > (In reply to comment #27) > > > > In ~/.local/share/Steam/SteamApps/common/XCom-Enemy-Unknown/ directory > > * do a backup of xcom.sh file then edit the file > > * line 79, change the line: eval "$GAMESCRIPT" $@ > > into: apitrace trace "$GAMESCRIPT" $@ > > * then save and launch the game, now the trace is properly generated > > Continues to segfault immediately on starting, just after the intial XCOM > splash screen, though there's now output from apitrace in terminal and a > trace. Is this just something wrong with my system? I've checked the game > cache several times already to make sure there's no corrupted files so > there's nothing wrong there. Have you tried with apitrace built from git ? (In reply to comment #30) > Have you tried with apitrace built from git ? Git failed to build, but 5.0 built fine and worked! It's a 3.2 gb trace though, will attach once it's finished trimming. FYI, have been running XCom happily using Mesa with Sandybridge (2500k, GPU slightly overclocked to 1.4). I think this just affects Radeon. hie, i' tried to make a trace too, but i've failed to build apitrace frome git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :( Does it help instead if i use R600_DUMP_SHADERS= 1 ? If yes, someone can explain me how to redirect the output to a file because >> doen't work. (In reply to comment #33) > Does it help instead if i use R600_DUMP_SHADERS= 1 ? I'm afraid not. (In reply to comment #33) > hie, i' tried to make a trace too, but i've failed to build apitrace frome > git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :( > > Does it help instead if i use R600_DUMP_SHADERS= 1 ? > > If yes, someone can explain me how to redirect the output to a file because > >> doen't work. Compile apitrace 5.0 from here: https://github.com/apitrace/apitrace/releases (In reply to comment #35) > (In reply to comment #33) > > hie, i' tried to make a trace too, but i've failed to build apitrace frome > > git, and the version of ubuntu 14.04 which is 3.0 segfault with xcom :( > > > > Does it help instead if i use R600_DUMP_SHADERS= 1 ? > > > > If yes, someone can explain me how to redirect the output to a file because > > >> doen't work. > > Compile apitrace 5.0 from here: > > https://github.com/apitrace/apitrace/releases Tanks, it works ! I have put the file here http://dl.free.fr/htMkkwBOy I use a radeon hd4870 with r600g on ubuntu 14.04, kernel 3.15 and oibaf ppa. The symptoms are that the game crash and return to the desktop. more details here : https://bugs.freedesktop.org/show_bug.cgi?id=80618 (In reply to comment #36) > I have put the file here http://dl.free.fr/htMkkwBOy That runs into what looks like bug 80673 to me. I'm getting a related issue at bug 80922 Created attachment 103904 [details]
dmesg xcom after lock-ups
ArchLinux x86-64; linux 3.16rc6; mesa git; llvm svn; Radeon HD 7950
Lock-ups while playing in tactical mission. After several seconds waiting, game again playable.
*** Bug 85334 has been marked as a duplicate of this bug. *** *** Bug 81576 has been marked as a duplicate of this bug. *** Created attachment 113261 [details]
dmesg during lockup
My system completely freezes while playing XCOM and I have to use SysRQ to reboot. Waiting does not help.
Radeon R9 270, Fedora 21, kernel-3.18.6-200.fc21.x86_64, mesa-dri-drivers-10.4.3-1.20150124.fc21.x86_64, xorg-x11-drv-ati-7.5.0-1.fc21.x86_64, llvm-libs-3.5.0-6.fc21.x86_64.
I would suggest updating to a llvm-3.6 enabled mesa (and even git llvm 3.7). I had suffered from this lockup bug but the game has been pretty stable lately. I'm having the same lockup issue. About 15 minutes in the mission game freezes but I can move the mouse and can hear the music. After about 10 seconds of waiting the screen turns black and comes back but now the mouse is frozen, no music and the system is completely frozen. AMD Radeon HD 7870 GHz Edition, Arch Linux x64, mesa 10.5.1-2, kernel-3.19.2-1-ARCH x86_64, xf86-video-ati 1:7.5.0-2, llvm-libs 3.6.0-3. Created attachment 114671 [details]
dmesg during lockup 2
dmesg added
Tested on Ubuntu 15.04 with Oibaf PPA, same thing happens... mesa 10.6~git1504011930.5604d7~gd~v xserver-xorg-video-ati 1:7.5.0-1ubuntu2 libllvm3.6 1:3.6-2ubuntu1 kernel 3.19.0.11.10 The saddest part is that game runs much smoother with mesa driver than with fglrx. Please test, i cannot reproduce lockup with mesa-git 69411.05a1d84 and llvm-libs-svn 234894 (played 20 minutes) Created attachment 115526 [details]
dmesg for Kaveri
Slightly different dmesg from an A10-7850K system, involving the IOMMU. Maybe because of the HSA feature?
Mesa: 10.5.4
Llvm: 3.6.0
Kernel: 4.0.0
Hello, I created an account because 2 days ago I was having the same issue and wanted to provide my dmesg. But when I was trying to reproduce the issue yesterday, I couldn't... I played for hours and the performance was actually quite good. I am with fedora 22 on KDE 5 now, and I did perform some updates before trying again... The other thing I did was disable the option to suspend compositing for full screen windows. Not sure what fixed it though Created attachment 116166 [details]
dmesg output
I was able to reproduce the bug. It happens much less frequently but it still does. My apitrace is huge though. How can I trim/compress it efficiently?
here is my lspci -v. I also included a my dmesg:
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde XT [Radeon HD 7770/8760 / R7 250X] (prog-if 00 [VGA controller])
Subsystem: Diamond Multimedia Systems Device 7770
Flags: bus master, fast devsel, latency 0, IRQ 40
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at fbc80000 (64-bit, non-prefetchable) [size=256K]
I/O ports at c000 [size=256]
Expansion ROM at fbcc0000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [270] #19
Kernel driver in use: radeon
Kernel modules: radeon
Created attachment 116529 [details] Syslog excerpt showing the GPU stall and kernel backtrace I can report a "me too". According to the syslog (see the attached file for more details), the crash happens in: drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0x86/0x9d with the stack detailed below. Let me know, if you need something else, to debug this. My current stack (Debian testing as a base): GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1) Mesa: Git:master/4d35eef326 libdrm: 2.4.60-3 LLVM: SVN:trunk/r239668 (3.7 devel) X.Org: 2:1.17.1-2 Linux: 4.0.5 Firmware: <https://secure.freedesktop.org/~agd5f/radeon_ucode/hawaii/> > 286640da3d90d7b51bdb038b65addc47 hawaii_ce.bin > 161105a73f7dfb2fca513327491c32d6 hawaii_mc.bin > d6195059ea724981c9acd3abd6ee5166 hawaii_me.bin > ad511d31a4fe3147c8d80b8f6770b8d5 hawaii_mec.bin > 63eae3f33c77aadbc6ed1a09a2aed81e hawaii_pfp.bin > 5b72c73acf0cbd0cbb639302f65bc7dc hawaii_rlc.bin > f00de91c24b3520197e1ddb85d99c34a hawaii_sdma1.bin > 8e16f749d62b150d0d1f580d71bc4348 hawaii_sdma.bin > 7b6ca5302b56bd35bf52804919d57e63 hawaii_smc.bin > 9f2ba7e720e2af4d7605a9a4fd903513 hawaii_uvd.bin > b0f2a043e72fbf265b2f858b8ddbdb09 hawaii_vce.bin libclc: Git:master/5cd2688a9f DDX: Git:master/d7c82731a8 *** Bug 80922 has been marked as a duplicate of this bug. *** As written in bug 80922, comment 2 I can't seem to trigger this anylonger with the stack detailed there. But it's probably best if others report in as well and I give it a bit more than 1.5 h to happen. I'am sorry to inform you that I still see this bug. Radeon R270X Mesa 11.0.0-devel git-5f247a9 xf86-video-ati git-09c7cdb llvm svn-243977 This bug is still happening. Sometimes I can play 2 hours without a crash, sometimes 5 minutes. Mesa 11.1.0-devel kernel 4.2 llvm 3.8 Hi, I have an apitrace that hopefully shows the problem. It's 17GB at the moment -- if someone would like to give me some pointers on cutting it down or where to upload it then I can do that. OpenGL core profile version string: 3.3 (Core Profile) Mesa 11.1.0-devel (git-511a863 2015-09-26 vivid-oibaf-ppa) I'll continue testing and may generate further traces. (In reply to David Beswick from comment #56) > Hi, I have an apitrace that hopefully shows the problem. Hopefully? Does it reproduce the problem for you? I didn't think to try replaying the trace as I assumed it would have the necessary data in it, apologies. I've done so now, but the system didn't hang. Is it possible that the last part of the trace never makes it into the file because the system locks up? If I switch to another TTY I do see "ring 0 stalled" as in the syslog excerpt attached to the bug. Anything else besides Alt+PrtSqn+REISUB that may help to capture the problematic part? Otherwise, I'll continue gathering traces and will let you know if I get one that a replay can trigger. Just to update, I've captured three different traces but none have been able to reproduce the problem on replay. I've also tried the following: * Looping a trace replay over 24 hours continuously -- no repro * Running with a -O0 Mesa build -- hang remains * Going directly to fallback in all cases during si_dma_copy (wild guess based on code comments) -- hang remains I don't think traces will be a fruitful method of debugging, unless someone can suggest something I'm doing wrong. I'm continuing to look at this. If anyone has a hypothesis and would like to send a patch then I could build and test with it. Forgot to add that I also tried replaying traces via Steam, in case the Steam overlay somehow had something to do with it. It doesn't seem to help as I can't reproduce the problem via a trace that way either. @david,have you tried with vsync on and off? i find out that if i disable the SO sync and just leave the in-game one, the hangup takes much longer to occur,might be worth a shot? Paulo, what's "the SO sync"? sorry, my bad, it was a typo, i meant for the OS sync , like for example the kf5 (kde 5.4) vsync setting. FWIW I noticed the game seems to freeze for a second when playing on windows, and then comes back online. Could it be related? maybe the game freezes the gpu on windows too but the windows driver succeed during the resume operation? Hello everyone, Just to keep you up to date, I switched tack and have been using the "GALLIUM_DDEBUG" environment variable to try and capture data about the crash. I found out about this option via a Google search. As I understand it, it creates fences around each GPU operation to detect when and if they complete, dumping the contents of the draw call if it doesn't finish in a timely way. Unfortunately, I haven't been able to reproduce the crash with this variable enabled, despite playing for more than 5 hours (not consecutively.) The GALLIUM_DDEBUG mode is certainly working as performance is quite severely impacted. Maybe the way GALLIUM_DDEBUG is implemented unfortunately also prevents the issue from happening. All I can say so far is that I suspect the problem is related to vertex buffer drawing. On one occasion I disabled the fences around vertex buffer drawing while enabling GALLIUM_DDEBUG (to try and get some more performance) and I did experience a hard lock as usual. I will continue running in this mode to see if it turns up a result. If anyone else would like to try, you can modify the "~/.steam/steamapps/common/XCom-Enemy-Unknown/binaries/linux/xcom.sh" file. On a line before the call to "${GAMEBINARY}", write "export GALLIUM_DDEBUG=800". You can probably also set this environment variable before running Steam. Thank you Paulo for your suggestion, I will try that if I have time to see if it affects the frequency of the crashes. The commit I tested with was 55365a7ad50c2e4547f58995a8e3411d8f2b00a2 Hi David, thank you for your efforts! Note that GALLIUM_DDEBUG=help explains a "noflush" option that you can also use. In any case, can you post the resulting file from ~/ddebug_dumps/? Also, next time you do this experiment, please run with R600_DEBUG=ps,vs,gs,vm and post the output in addition to the ~/ddebug_dumps/. Though frankly, the best change to getting this fixed would be a way to reliably reproduce it on a developer's machine. It's a pity the apitraces seem to be unreliable. Created attachment 120587 [details]
Output of GALLIUM_DDEBUG="800 noflush" and R600_DEBUG="ps,vs,gs,vm"
Thank you Nicolai, I was able to reproduce the hang using the "noflush" option. DDEBUG seemed to detect the hang and killed the process, but my machine still locked up as usual. Thank you for the followup. Apparently, even the very first tracepoint is not processed. This is weird. Created attachment 120594 [details] kernel messages from xcom hang I'd like to help with resolving this. I added GALLIUM_DDEBUG=800 and checked using environ file that it is applied. Interestingly, I see no performance degradation (am I doing something wrong?). I have played XCOM for about 3 hours (which seems to be considerably longer than usual), then it got stuck. The screen got black for 10 seconds, then image returned, and I could move the mouse pointer, but do nothing else. I was able to switch to tty3, but when I switched back, it froze completely, and I had to use sysrq to reboot. The hang is visible in kernel messages, it is attached. There is no /home/$username/dd_dumps/ directory. I wonder whether the GALLIUM_DDEBUG variable is having any effect? Also, I can't find any documentation to it, the only thing I was able to find is this: https://patchwork.freedesktop.org/patch/57799/ Inspired by comment 66, I tried GALLIUM_DDEBUG=help (like "GALLIUM_DDEBUG=help glxgears"), but nothing is written to stdout. 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Curacao PRO [Radeon R7 370 / R9 270/370 OEM] [1002:6811] kernel-4.2.8-300.fc23.x86_64 mesa-dri-drivers-11.0.6-1.20151122.fc23.x86_64 xorg-x11-drv-ati-7.6.0-0.4.20150729git5510cd6.fc23.x86_64 xorg-x11-server-Xorg-1.18.0-2.fc23.x86_64 llvm-libs-3.7.0-1.fc23.x86_64 Fedora 23 Hi Kamil, your version of Mesa is too old: it does not contain the GALLIUM_DDEBUG feature yet. At this point, the most helpful thing for you to do would be to reproduce this with latest development versions (i.e. Git/SVN master) of Mesa and LLVM, and see if you can get an apitrace which reliably reproduces the lockup. A steam key has been given to Nicolai Hähnle from Feral to help with his investigations into the crash. I have managed to capture an apitrace while radeon driver locked up *and* recovered afterwards, so that I was able to exit the game normally (this almost never happens). I was hopeful that in this case it could contain all the calls triggering the lockup, compared to the case where the system locks up completely and does not recover. I've met issues replaying the trace, apitrace seems to crash for any XCOM trace I capture. I reported it here: https://github.com/apitrace/apitrace/issues/407 Nevertheless after many attempts I managed to replay the trace twice in full length, and it did not lock up my system again, nor was the lockup visible in the replay itself (in reality my system got locked up for about 10 seconds, but in the replay it plays uninterrupted). So it seems that apitrace is not something that can be used reliably to reproduce this issue, unfortunately. Created attachment 120647 [details] system journal containing gpu hang during apitrace replay And according to the Murphy's law, after I posted my previous comment, I replayed the trace once again and it crashed my computer. Voila! (I can't say it happened at the exact time of the replay as the original recorded hang, because I wasn't looking at it, but it happened). The trace file is here (1.5 GB compressed, recorded hang at the very end right before quitting the game): https://drive.google.com/file/d/0B0Opr_geiK5nWUMwVEhJVnZPR2s/view?usp=sharing I attach my system journal related to this "replay crash". One thing got my interest. Look at this Xorg backtrace: (EE) Backtrace: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59afb9] (EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7fe7ee2ebb1f] (EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb] (EE) 3: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x108dbe) [0x7fe7e6c33efe] (EE) 4: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x1091a3) [0x7fe7e6c345a3] (EE) 5: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x109c02) [0x7fe7e6c35a42] (EE) 6: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0x163098) [0x7fe7e6ce84f8] (EE) 7: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfacb7) [0x7fe7e6c17d27] (EE) 8: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfaf23) [0x7fe7e6c181b3] (EE) 9: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensions_vmwgfx+0xfb358) [0x7fe7e6c18b48] (EE) 10: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x168d9) [0x7fe7e81b0b69] (EE) 11: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x175d2) [0x7fe7e81b2372] (EE) 12: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x4237) [0x7fe7e818bf67] (EE) 13: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x80a3) [0x7fe7e8193be3] (EE) 14: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x9ec9) [0x7fe7e8197529] (EE) 15: /usr/libexec/Xorg (DamageRegionAppend+0x621) [0x51eeb1] (EE) 16: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x1108a) [0x7fe7e81a5f7a] (EE) 17: /usr/libexec/Xorg (AddTraps+0x4cf2) [0x519d82] (EE) 18: /usr/libexec/Xorg (SendErrorToClient+0x2df) [0x4369bf] (EE) 19: /usr/libexec/Xorg (remove_fs_handlers+0x453) [0x43a9e3] (EE) 20: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7fe7ee2d7580] (EE) 21: /usr/libexec/Xorg (_start+0x29) [0x424ce9] (EE) 22: ? (?+0x29) [0x29] (EE) (EE) Bus error at address 0x7fe7e8b71000 (EE) Fatal server error: (EE) Caught signal 7 (Bus error). Server aborting There's this line: (EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb] which is the same function that I reported to be crashing in apitrace: https://github.com/apitrace/apitrace/issues/407 Is this just a coincidence, or these two bugs are related (or the very same)? (I'm sorry, I still have the same month-old mesa as in comment 70, I didn't figure out how to update it easily before I started tinkering with the replays). (In reply to Kamil Páral from comment #74) > There's this line: > (EE) 2: /lib64/libc.so.6 (__memcpy_avx_unaligned+0x1ab) [0x7fe7ee3ffafb] > which is the same function that I reported to be crashing in apitrace: > https://github.com/apitrace/apitrace/issues/407 > > Is this just a coincidence, or these two bugs are related (or the very same)? Presumably coincidence. The apitrace crashes are segmentation faults, i.e. probably due to overrunning some buffer[0]. The Xorg crash is a bus error, which is probably fallout of the GPU hang. That said, we could probably confirm either way if we could look at a gdb backtrace of the Xorg crash. [0] FWIW, replaying the apitrace with valgrind on llvmpipe, I'm also seeing invalid memory access, so it might be a bug in apitrace or some shared Gallium / Mesa code rather than in the radeonsi driver. > That said, we could probably confirm either way if we could look at a gdb backtrace of the Xorg crash.
Unfortunately I don't have it. ABRT deleted it due to some unfortunate circumstances. I could try to reproduce it by looping the replay over, if needed.
(In reply to Kamil Páral from comment #76) > I could try to reproduce it by looping the replay over, if needed. No need, it's not important. The assumption for now is that the Xorg crash is caused by the GPU hang. If you happen to get a gdb backtrace of it in the future, that will help verify this assumption, but it's no more than nice to have. I tried to replay the trace attached: > glretrace game.x86_64.trace But all I get is this (many times repeated): >10536: message: major api error 1: GL_INVALID_ENUM in glCompressedTexImage2D(internalFormat=0x8c4d) >10536 @1 glCompressedTexImage2DARB(target = GL_TEXTURE_2D, level = 0, >internalformat = GL_COMPRESSED_SRGB_ALPHA_S3TC_DXT1_EXT, width = 64, height = >64, border = 0, imageSize = 2048, data = blob(2048)) >10536: warning: glGetError(glCompressedTexImage2DARB) = GL_INVALID_ENUM Mesa: OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.1.0 Hardware: OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0, LLVM 3.7.0) Kernel: Linux 4.4.0-rc6-00005-g9d951f9 #4 SMP PREEMPT Mon Dec 21 18:33:34 CET 2015 x86_64 x86_64 x86_64 GNU/Linux Will try now with llvmpipe (In reply to Daniel Exner from comment #78) > >10536: message: major api error 1: GL_INVALID_ENUM in glCompressedTexImage2D(internalFormat=0x8c4d) > >10536 @1 glCompressedTexImage2DARB(target = GL_TEXTURE_2D, level = 0, >internalformat = GL_COMPRESSED_SRGB_ALPHA_S3TC_DXT1_EXT, width = 64, height = >64, border = 0, imageSize = 2048, data = blob(2048)) > >10536: warning: glGetError(glCompressedTexImage2DARB) = GL_INVALID_ENUM Looks like you may be missing GL_EXT_texture_compression_s3tc. Do you have libtxc-dxtn(-s2tc) packages installed? Yes, you where right. Some guy at my distro removed it without telling anyone. Is back now. Now I get (with radeonsi): glretrace game.x86_64.trace apitrace: warning: caught signal 11 47062: error: caught an unhandled exception glretrace+0x28d196 glretrace+0x28c92c glretrace+0x289ccd /lib/libpthread.so.0+0x10d3f /usr/lib/libGL.so.1+0x48945 glretrace+0x2e6e0 glretrace+0x405d6 glretrace+0xffca0 glretrace+0x3a4a4 glretrace+0x2fb35 glretrace+0x3183e glretrace+0x317a7 glretrace+0x2fc39 glretrace+0x33eb1 glretrace+0x33630 /lib/libpthread.so.0+0x7483 /lib/libc.so.6: clone+0x6c ? apitrace: info: taking default action for signal 11 Don't get this error with llvmpipe. But also no crash. (Still running, slow as hell). (In reply to Kamil Páral from comment #74) > > (I'm sorry, I still have the same month-old mesa as in comment 70, I didn't > figure out how to update it easily before I started tinkering with the > replays). Hello, On Fedora 23 I'm using this copr: https://copr.fedoraproject.org/coprs/griever/mesa-git/ It provides the following packages as of today: [root@mike-laptop mike]# rpm -qa | grep mesa mesa-filesystem-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libGLES-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libOSMesa-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libgbm-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-libxatracker-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libEGL-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-libGLU-9.0.0-9.fc23.x86_64 mesa-filesystem-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-libglapi-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libglapi-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-libgbm-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libGL-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-dri-drivers-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libGL-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-dri-drivers-11.2.0-0.devel.22.ea8c0b1.fc23.i686 mesa-libEGL-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 mesa-libwayland-egl-11.2.0-0.devel.22.ea8c0b1.fc23.x86_64 Hope it helps. Managed to get some more infos about glretrace crash: Stack trace of thread 13986: #0 0x00007fca93b67945 loader_dri3_wait_gl (libGL.so.1) #1 0x000000000042e6e1 _ZN4glws11GlxDrawable6resizeEii (glretrace) #2 0x00000000004405d7 _ZN9glretrace14updateDrawableEii (glretrace) #3 0x00000000004ffca1 _ZL25retrace_glBlitFramebufferRN5trace4CallE (glretrace) #4 0x000000000043a4a5 _ZN7retrace8Retracer7retraceERN5trace4CallE (glretrace) #5 0x000000000042fb36 _ZN7retraceL11retraceCallEPN5trace4CallE (glretrace) #6 0x000000000043183f _ZN7retrace11RelayRunner6runLegEPN5trace4CallE (glretrace) #7 0x00000000004317a8 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #8 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #9 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #10 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #11 0x00007fca95856484 start_thread (libpthread.so.0) #12 0x00007fca944bcaed __clone (libc.so.6) Stack trace of thread 13984: #0 0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #4 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #5 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #6 0x00007fca95856484 start_thread (libpthread.so.0) #7 0x00007fca944bcaed __clone (libc.so.6) Stack trace of thread 13983: #0 0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #4 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #5 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #6 0x00007fca95856484 start_thread (libpthread.so.0) #7 0x00007fca944bcaed __clone (libc.so.6) Stack trace of thread 13981: #0 0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fead _ZN7retrace9RelayRace3runEv (glretrace) #4 0x0000000000430073 _ZN7retraceL8mainLoopEv (glretrace) #5 0x0000000000430957 main (glretrace) #6 0x00007fca943f45e0 __libc_start_main (libc.so.6) #7 0x000000000042c5e9 _start (glretrace) Stack trace of thread 13982: #0 0x00007fca9585c05f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x00007fca8ff35ba3 radeon_drm_cs_emit_ioctl (radeonsi_dri.so) #2 0x00007fca8ff353e7 impl_thrd_routine (radeonsi_dri.so) #3 0x00007fca95856484 start_thread (libpthread.so.0) #4 0x00007fca944bcaed __clone (libc.so.6) loader_dri3_wait_gl looks suspicious. Will retry with DRI3 disabled. Disabled DRI3 and the trace run through (without GPU crash) but glretrace crash: Wow, such stacktrace many interesting: Stack trace of thread 26634: #0 0x00007fc683d6ce6e __memcpy_sse2_unaligned (libc.so.6) #1 0x00007fc67f565f5f u_upload_data (radeonsi_dri.so) #2 0x00007fc67f567c88 u_vbuf_draw_vbo (radeonsi_dri.so) #3 0x00007fc67f3cce57 st_draw_vbo (radeonsi_dri.so) #4 0x00007fc67f39dad4 vbo_validated_drawrangeelements (radeonsi_dri.so) #5 0x00007fc67f39ddd4 vbo_exec_DrawRangeElementsBaseVertex (radeonsi_dri.so) #6 0x00000000004fcaee _ZL37retrace_glDrawRangeElementsBaseVertexRN5trace4CallE (glretrace) #7 0x000000000043a4a5 _ZN7retrace8Retracer7retraceERN5trace4CallE (glretrace) #8 0x000000000042fb36 _ZN7retraceL11retraceCallEPN5trace4CallE (glretrace) #9 0x000000000043183f _ZN7retrace11RelayRunner6runLegEPN5trace4CallE (glretrace) #10 0x00000000004317a8 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #11 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #12 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #13 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #14 0x00007fc68515e484 start_thread (libpthread.so.0) #15 0x00007fc683dc4aed __clone (libc.so.6) Stack trace of thread 26629: #0 0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fead _ZN7retrace9RelayRace3runEv (glretrace) #4 0x0000000000430073 _ZN7retraceL8mainLoopEv (glretrace) #5 0x0000000000430957 main (glretrace) #6 0x00007fc683cfc5e0 __libc_start_main (libc.so.6) #7 0x000000000042c5e9 _start (glretrace) Stack trace of thread 26630: #0 0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x00007fc67f83dba3 radeon_drm_cs_emit_ioctl (radeonsi_dri.so) #2 0x00007fc67f83d3e7 impl_thrd_routine (radeonsi_dri.so) #3 0x00007fc68515e484 start_thread (libpthread.so.0) #4 0x00007fc683dc4aed __clone (libc.so.6) Stack trace of thread 26632: #0 0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #4 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #5 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #6 0x00007fc68515e484 start_thread (libpthread.so.0) #7 0x00007fc683dc4aed __clone (libc.so.6) Stack trace of thread 26633: #0 0x00007fc68516405f pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x0000000000430be1 _ZN2os18condition_variable4waitERNS_11unique_lockINS_5mutexEEE (glretrace) #2 0x0000000000431749 _ZN7retrace11RelayRunner7runRaceEv (glretrace) #3 0x000000000042fc3a _ZN7retrace11RelayRunner12runnerThreadEPS0_ (glretrace) #4 0x0000000000433eb2 _ZN2os6thread13CallbackParamIFvPN7retrace11RelayRunnerEES4_EclEv (glretrace) #5 0x0000000000433631 _ZN2os6thread9_callbackINS0_13CallbackParamIFvPN7retrace11RelayRunnerEES5_EEEEPvS8_ (glretrace) #6 0x00007fc68515e484 start_thread (libpthread.so.0) #7 0x00007fc683dc4aed __clone (libc.so.6) (In reply to Daniel Exner from comment #80) > Now I get (with radeonsi): > > glretrace game.x86_64.trace > apitrace: warning: caught signal 11 > 47062: error: caught an unhandled exception Great to see it does not crash just for me :-) (In reply to Michael Eagle from comment #82) > On Fedora 23 I'm using this copr: > https://copr.fedoraproject.org/coprs/griever/mesa-git/ Thanks, that helps a lot! (In reply to Daniel Exner from comment #83) > Managed to get some more infos about glretrace crash: Please note I uploaded a gdb backtrace to the apitrace bug mentioned in comment 73. Now that I tested latest mesa git (ea8c0b1, 2012-12-21), I can confirm apitrace still crashes, and I uploaded an updated gdb backtrace: https://github.com/apitrace/apitrace/files/69636/gdb.backtrace2.txt However, the question is whether that apitrace crash is related to the gpu hang we see in XCOM. It certainly makes debugging harder. Many thanks for your patience! While I still did not see a crash, valgrind does indeed report errors when playing back your crash. It is possible that such errors are indirectly related to the lockup, if they lead to bad data being sent to the card. Today's my last day before the holidays, but I'll look into this next week. Created attachment 120655 [details] gdb backtrace from xorg when system locks up during glretrace replay Thanks Nicolai and others for looking into this. I have some good news. The apitrace developer patched glretrace, so that it no longer crashes during replay. He says it's a bug inside XCOM. He also says the problem might likely cause the radeonsi driver crash as well. See his comment here: https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 (If this indeed turns out to be an XCOM bug, it would be nice if we could put some safeguards into the driver and didn't crash for invalid commands, but I'm saying that as someone who knows exactly zero about gpu driver programming). Also, now that glretrace does not crash, I'm able to easily loop the trace until its very end. In just a handful of replays, my system locked up 3 times. I haven't seen all lockups, but at least once it occurred exactly at the point where the lockup occurred while recording (at the very end). So it seems I'm able to reproduce this with reasonable likelihood, and therefore can test some fixes if needed. (In reply to Michel Dänzer from comment #77) > No need, it's not important. The assumption for now is that the Xorg crash > is caused by the GPU hang. If you happen to get a gdb backtrace of it in the > future, that will help verify this assumption, but it's no more than nice to > have. I have it now, attaching. (In reply to Kamil Páral from comment #87) > The apitrace developer patched glretrace, so that it no longer crashes > during replay. He says it's a bug inside XCOM. He also says the problem > might likely cause the radeonsi driver crash as well. See his comment here: > https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 > > (If this indeed turns out to be an XCOM bug, it would be nice if we could > put some safeguards into the driver and didn't crash for invalid commands, > but I'm saying that as someone who knows exactly zero about gpu driver > programming). Yep. We could add a new drirc hack for ignoring the start/end params of glDrawRangeElementsBaseVertex for applications like XCOM, ie, assume 0..~0. XCOM is not unique here -- we've seen this happening on a Direct3D9 once at VMware. It looks like some of the proprietary OpenGL/Direct3D drivers out there simply outright ignore the min/max index hints. (In reply to Daniel Exner from comment #83) > Managed to get some more infos about glretrace crash: > > Stack trace of thread 13986: > #0 0x00007fca93b67945 loader_dri3_wait_gl (libGL.so.1) This crash is fixed in current Mesa Git master. BTW, please create attachments for such large pieces information. (In reply to Kamil Páral from comment #87) > > If you happen to get a gdb backtrace of it in the future, that will help > > verify this assumption, but it's no more than nice to have. > > I have it now, attaching. Thanks, it confirms that the Xorg crash is caused by the GPU hang and not related to the apitrace crash. just to let you guys know that with latest llvm 3.8 (256187) fixes and mesa up to 50fc4a925644378c50282004304bc8fd64b95e3c, it takes much longer for xcom enemy unknown to crash the GPU, i played for two solid hours, so its getting better. if someone wants, i can test the drirc workaround for glDrawRangeElementsBaseVertex, just put a .drirc here. Sorry for pasting instead of an attachement. The specs for glDrawRangeElementsBaseVertex [1] say this case (array out of bounds) should be handled like this: "Index values lying outside the range [start, end] are treated in the same way as glDrawElementsBaseVertex. " The specs for glDrawElementsBaseVertex [2] don't say anything about this case (obviously since this function doesn't imply any size constrains for the array). So it seems like it is indeed a Bug in the game to try to address this index element but also the operation should not crash and its unspecified behaviour. Perhaps radeonsi should handle it the same as other mesa drivers to for the sake of cosistency. [1] https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml [2] https://www.opengl.org/sdk/docs/man/html/glDrawElementsBaseVertex.xhtml (In reply to Daniel Exner from comment #91) > So it seems like it is indeed a Bug in the game to try to address this index > element but also the operation should not crash and its unspecified > behaviour. > > Perhaps radeonsi should handle it the same as other mesa drivers to for the > sake of cosistency. Yes, crashing should be avoided. But correct rendering, no, not generally. Not unless it can be without performance impact (which is probably not the case.) Otherwise it would be sacrificing the performance of correct GL apps, for the sake of buggy GL apps. Which is rewarding the wrong behavior. It's not that hard: the start/end parameters are hints precisely aimed at enabling the driver to do performance optimizations. If the application developers can't get them right, just them don't set to invalid values! Use 0 / ~0 which is guaranteed to work. This way the application developers that actually bothered to get them right don't get penalized. Everybody's happy. Maybe it would help if Mesa's KHR_debug / apitrace checked for this sort of error. Brief update: The crashes and Valgrind errors when playing back the trace are almost certainly unrelated to the lockup. It turns out apitrace is too aggressive in trimming the client-side memory blobs during recording. (See https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502) On a more positive note, I am also seeing the lockup on Tonga from inside the game itself. Unfortunately, I cannot reproduce it reliably yet. Created attachment 120734 [details] [review] apitrace patch for honoring range in DrawRangeElementsX commands So you'd want something like this (totally untested) patch for apitrace? Makes sense I guess that the specified range not only means the supplied indices have to be inside that range, but it also works the other way round (driver can rely on the specified range being accessible) - otherwise the driver would still need to scan the actual index buffer. (Albeit since for this app the ranges seem to be pretty bogus who knows if the memory inside the specified range but not used in the actual indices is really always accessible.) I am actually wondering if it would be legal if the memory isn't accessible below the start value (apitrace surely couldn't handle that)... (In reply to Roland Scheidegger from comment #94) > Created attachment 120734 [details] [review] [review] > apitrace patch for honoring range in DrawRangeElementsX commands > > So you'd want something like this (totally untested) patch for apitrace? > Makes sense I guess that the specified range not only means the supplied > indices have to be inside that range, but it also works the other way round > (driver can rely on the specified range being accessible) - otherwise the > driver would still need to scan the actual index buffer. (Albeit since for > this app the ranges seem to be pretty bogus who knows if the memory inside > the specified range but not used in the actual indices is really always > accessible.) Yes, this does indeed make sense. For VBOs, the start/end range is a hint (the OpenGL shouldn't crash if the start/end range goes beyond the VBO size), but for user memory arrays, there's no reliable way to know where user memory is supposed to stop -- the start/end range is it. > I am actually wondering if it would be legal if the memory isn't accessible > below the start value (apitrace surely couldn't handle that)... I don't think it's worth worrying about that. Created attachment 120742 [details]
more conservative apitrace patch
For what it's worth, I've attached a modified version of Roland's patch that is slightly more conservative, guarding against some stupid end values and checking the indices. Not sure which patch is really better though, in the end it depends on how much broken software is out there. As far as I can tell, XCOM apitraces work with both variants.
I'm pushing a slightly modified version of Roland's patch. (In reply to Nicolai Hähnle from comment #96) > Created attachment 120742 [details] > more conservative apitrace patch > > For what it's worth, I've attached a modified version of Roland's patch that > is slightly more conservative, guarding against some stupid end values and > checking the indices. Not sure which patch is really better though, in the > end it depends on how much broken software is out there. As far as I can > tell, XCOM apitraces work with both variants. Yes, I don't think this is necessary. If apitrace needs more resiliency, then the best approach would be to setup a segv handler to cope with out-of-bounds reads. This code path is only used for user arrays. (VBOs don't need this special treatment.) glretrace -v from apitrace (5e96ed318db1ba8037eb402724bc052240ac9e05) still crashes with the trace: #0 0x00007f9d112fae6e __memcpy_sse2_unaligned (libc.so.6) #1 0x00007f9d0caf3f5f u_upload_data (radeonsi_dri.so) #2 0x00007f9d0caf5c88 u_vbuf_draw_vbo (radeonsi_dri.so) #3 0x00007f9d0c95ae57 st_draw_vbo (radeonsi_dri.so) #4 0x00007f9d0c92bad4 vbo_validated_drawrangeelements (radeonsi_dri.so) Last output before crash: 965948 @3 glDrawRangeElementsBaseVertex(mode = GL_TRIANGLES, start = 0, end = 15, count = 30, type = GL_UNSIGNED_SHORT, indices = blob(60), basevertex = 0). Kernel 4.4.0-rc8, Mesa 11.1.0 Game itself also still crashes. Played a bit with GALLIUM_HUD. Using: GALLIUM_HUD="fps,cpu,ps-invocations+hs-invocations+ds-invocations+cs-invocations;num-compilations+num-shaders-created,draw-calls,buffer-wait-time;num-cs-flushes,num-bytes-moved;VRAM-usage+GTT-usage,GPU-load,temperature,shader-clock+memory-clock" I can play for some minutes before the game crashes (but doesn't kill the box) and the trace looks like this: #0 0x00007f22458745ac hud_draw_string (radeonsi_dri.so) #1 0x00007f2245874e28 hud_draw (radeonsi_dri.so) I guess this is a GALLIUM_HUD bug. Any metric that might be of particular interest? Hi Daniel, re comment #98: That's to be expected. The problem was with *recording* the trace, not with playing it back. A trace recorded without the patch will crash when played back, whether the playback session has the patch or not. re comment #99: Interesting, and yes, I'd say that's a HUD bug. Could you please file a separate bug report for that? Those metrics are very unlikely to help, though. One of my crash reproductions had a VM fault where a page that should have been there (according to the radeonsi VM fault dump) apparently wasn't. There are some theories (like problems with the sdma ring that is used to update the page table entries), so it's likely some race condition is involved. I added a new bug report about GALLIUM_HUD bug. Are those faults related to this bug? Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0f653014 Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0001DF7B Jan 10 12:04:15 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014 Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0a653014 Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00016F53 Jan 10 12:04:36 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014 Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x0b853014 Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00016F5C Jan 10 12:04:56 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014 Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x06253014 Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00021CB1 Jan 10 12:06:39 Joshua kernel: radeon 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05030014 Yes, they're almost certainly related. A sequence of VM faults followed by a lockup is not an unusual symptom. (In reply to Nicolai Hähnle from comment #100) > re comment #98: That's to be expected. The problem was with *recording* the > trace, not with playing it back. A trace recorded without the patch will > crash when played back, whether the playback session has the patch or not. Will it help you if I try to capture another trace with fixed apitrace? If it will, can I simply grab apitrace git master (i.e. including https://github.com/apitrace/apitrace/commit/edc099cff55a6a3f9ad191acfbc8cc39f36228db ), or do I also need to apply the patch mentioned in https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 on top of that? (that patch was not pushed to git). At this point, I can reproduce the lockup albeit not deterministically, so it's not really needed. If you are able to capture an apitrace that reproduces the lockup deterministically (even after a cold reboot!), then that would still be interesting - but I kind of doubt that that's possible. Created attachment 121531 [details]
journald xcom: ew lockup
Any news on this bug? I've just tried to play the game and my system got frozen after a couple of minutes in the first mission, I was able to move the cursor and the music was playing but couldn't do anything else. I had journald -f running on my laptop via SSH so I could get the error messages.
Arch Linux
Kernel 4.4.1
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]
Using the latest mesa-git and llvm-libs from lcarlier repo
Created attachment 121532 [details]
journald xcom: ew apitrace crash
Also tried to run apitrace after the lockup but it crashed the game when I clicked to start the mission on the loading screen. Not sure what I'm doing wrong and couldn't get apitrace past the loading screen. journald log attached.
We have found out that XCOM issues invalid OpenGL commands. The purpose of this bug report is for radeon driver to stop crashing when that happens. But I wonder if somebody contacted XCOM developers and asked them to fix their bug? That could make the game working properly with radeon driver, with no crashes. I know that XCOM devs chipped in here before, but they likely haven't followed the full discussion. (In reply to Kamil Páral from comment #107) > We have found out that XCOM issues invalid OpenGL commands. Have we confirmed that this is true and if so what call is supposedly incorrect? I been following and seen some speculation but it seems that although there was some debate no firm decision was made if this was undefined behaviour that was open to interpretation or something definitely wrong with the game. Once we get some information on the issue we can then investigate at Feral if the issue is indeed inside the game not inside the Mesa drivers. Hi Edwin, thanks for following this discussion! I got the impression that XCOM uses invalid or at least undefined behavior from comment 88, 91 and 92, and from the apitrace bug comments here: https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 https://github.com/apitrace/apitrace/issues/407#issuecomment-166752457 https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502 But I'm no OpenGL developer, so I'll let Nicolai or Jose or somebody else knowledgeable confirm or refute this :) (In reply to Kamil Páral from comment #109) > Hi Edwin, thanks for following this discussion! I got the impression that > XCOM uses invalid or at least undefined behavior from comment 88, 91 and 92, > and from the apitrace bug comments here: > https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 > https://github.com/apitrace/apitrace/issues/407#issuecomment-166752457 > https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502 > > But I'm no OpenGL developer, so I'll let Nicolai or Jose or somebody else > knowledgeable confirm or refute this :) :) No problem, we'll have a look at the Linux code the next time we patch the game as regardless of the reasons it would be nice to get the game running as well as other Feral games that run on mesa. It's quite possible this issue is something in the original engine behaviour that needs correcting for the mesa drivers. Many thanks to everyone one on the bugs for their investigations. The apitrace-related crash from those earlier comments is not an XCOM bug. However, if I recall correctly, XCOM issues DrawRangeElements calls with ranges that are larger than necessary. This hurts performance slightly when vertex data is not in VBOs; it is irrelevant when all vertex data is in VBOs. I have been able to reproduce the lockup (thanks to Edwin), but it takes a fairly long time inside the game to happen for me. With everything else that is going on, I simple haven't yet had the time to collect enough information to really understand what's going on. (In reply to Nicolai Hähnle from comment #111) > However, if I recall correctly, XCOM issues DrawRangeElements calls with > ranges that are larger than necessary. My understanding was that exactly the opposite was the issue - the game sends indices outside of the specified bounds. Quoting Jose [1]: "the application is giving a hint to the OpenGL driver that indices are between 0 and 215, but in fact in call 21933, the index at position 164 is 216, and more later." [1] https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 Perhaps it does, and that would be bad, but the particular apitrace crash was basically the following: 1) XCOM uses DrawRangeElements with an unnecessarily large range. 2) During tracing, apitrace scans the index/elements array to determine the range of vertices that is really being used. 3) Apitrace only stores this range of vertices. 4) During playback, apitrace would send the same range via DrawRangeElements, but provide vertex data only for the range that was determined to be really used. 5) The driver, on the other hand, relies on the entire range to be there and tries to upload it to the card. This is where the crash happened (and it also explains what I said before about XCOM being slightly inefficient here) I see, thanks for explanation. In that case I was wrong in automatically assuming there is a bug in XCOM. Sorry for the noise. I think there are multiple issues being conflated here. Yes, apitrace had some bugs, and they might have cause additional grief to Mesa drivers when replaying traces from XCOM. But if the question is "does XCOM have a bug?", then the answer is IMO a definite "yes". As explained in https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 the game is passing indices outside start..end range, which is illegal per https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml "all values in the array indices must lie between start and end, inclusive, prior to adding basevertex". So if the XCOM developers are looking at this bug report, then please fix this issue. Even if it's not the all story here, it is a bug on its own right, which can and will cause rendering issues depending on the OpenGL driver implementation. This thread is too long. Could someone please summarize the issues here? Also, how does the apitrace crash relate to the GPU hang? (In reply to Marek Olšák from comment #116) > This thread is too long. Could someone please summarize the issues here? > Also, how does the apitrace crash relate to the GPU hang? I hoped somebody clever would respond, but it seems it'll have to be me. OpenGL layman alert. There were multiple issues discovered in this report: 0. (the core issue) radeonsi hangs the system completely (or sometimes recovers) when playing XCOM, randomly (can be minutes, can be hours) 1. Jose claims XCOM has a bug, issuing invalid OpenGL commands (see comment 115). 2. apitrace is crashing while replaying almost any XCOM trace. My trace from comment 74 is affected by this, so you need a temporary fix from https://github.com/apitrace/apitrace/issues/407#issuecomment-166619366 to stop it from crashing. The fix is not mainlined, because Jose says the purpose of apitrace is to help discover problems, not hide them. 3. Another discovered issue was that apitrace was trimming vertex data too aggressively (comment 113, https://github.com/apitrace/apitrace/issues/407#issuecomment-167866502). That is now fixed in apitrace, but my trace is affected, I'd have to re-record it. The problem is that I was really lucky that I recorded it in the first place, my computer did not hang completely as in 99% of cases, but recovered, and therefore the trace was not cut short. I don't think I'd get that lucky again. Plus Nicolai said he does not need that. The trimmed vertex data does not seem to affect the replay in a negative way, but I assume it might complicate the debugging process. 4. When looping over my trace, I can reproduce the crash pretty quick (i.e. my computer completely hangs), but not deterministically (does not happen on every replay). But it seems there is no further info I could supply from my side to help you debug this. 5. Nicolai said he can reproduce it (probably by running the game, not replaying my trace), but it takes a long time, so he wasn't able to work on this too much. In summary I think it was intimated that the issue might be caused by how XCOM deals with indices. === The game is passing indices outside start..end range, which is illegal per https://www.opengl.org/sdk/docs/man/html/glDrawRangeElementsBaseVertex.xhtml "all values in the array indices must lie between start and end, inclusive, prior to adding base vertex" === Mesa Intel and AMD/Nvidia closed source deal with this gracefully by ignoring the range hint if they are invalid however RadeonSi does not and can in some cases crash. Due to XCOM originally being designed for DirectX on Windows where this behaviour is not a fatal error combined with other OpenGL drivers on Linux & Mac also not throwing an error/warning this issue was overlooked/missed on the original port as Mesa RadeonSi was not a supported driver at the time so no-one saw the issue. This has already been fixed for our more recent games as the Mesa AMD drivers now support most of the features needed for many games so they are actively used/tested/bugs logged at Feral. We don't have any plans for a patch in the short term but we'll definitely back port this fix so we match the spec correctly into XCOM when we next patch it. For what it's worth, I'd encourage people to update their graphics stack to latest everything (including kernel + X.Org server) and see if they still get lockups. I haven't been able to reproduce this anymore in my last two attempts - I don't know if I've just been lucky, but it might have been fixed randomly. Unfortunately, too much has changed on my system between reproduction attempts to be able to say exactly what might have fixed it. Using xorg 1.18.2, Kernel 4.5 and OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0, LLVM 3.8.0) OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.3.0-devel (git-84b961d) I was able to play about 2hours without crash. But that worked in the past also, at least sometimes. I'll see if I find some more testing time. On a 7790 with latest versions of just about everything - kernel 4.5, mesa git head (11.3.0-devel), llvm 3.8, XOrg 1.18.2 - I still see this crash regularly. I'm happy to provide any further information/logs/traces if necessary. Tested again, same setup as before: crashed after 5 Minutes. Thanks for the re-test. It's odd that I couldn't reproduce it any more. It may be that I was just lucky. However, it's worth noting that Daniel has a GCN 1.0 card and David has a GCN 1.1 card, i.e. running on the radeon kernel module and DDX. It's possible that an amdgpu-only change has randomly fixed this for my Tonga-based test setup. Well there is this (still) highly experimental amdgpu für southern islands branch. Perhaps I'll try that if I feel very lucky. How to get dmesg log during lockup? Got this on r600 driver (AMD Radeon hd6650m TURKS) That's interesting, because r600 is a different user space OpenGL driver. It might be an interaction with the DDX or kernel though. If you cannot ssh from a different computer, you can still recover the log from e.g. /var/log/kern.log (the exact location may depend on the distribution). The comment above from Jose Fonseca (and follow up from Feral's Edwin Smith) imply that the problem here is with the game calling glDrawRangeElementsBaseVertex with bad start/end values, resulting in the indices being out-of-range which is illegal per the GL spec, and implying that this is what causes the crash. I have tried patching Mesa to effectively reduce glDrawRangeElementsBaseVertex calls to glDrawElementsBaseVertex (i.e. same method but with no start/end supplied). Patch follows below (inline because it is so short). However, I am sorry to say that this did *not* prevent the crashes. I conclude that there may still be a bug in Mesa and/or kernel space DRM (although it's possible my patch isn't having the effect I intended - I'm not familiar enough with Mesa code base to be sure). diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c index f0245fd..d1f4ac6 100644 --- a/src/mesa/vbo/vbo_exec_array.c +++ b/src/mesa/vbo/vbo_exec_array.c @@ -935,7 +935,9 @@ vbo_exec_DrawRangeElementsBaseVertex(GLenum mode, (void) check_draw_elements_data; #endif - vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, end, + //vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, end, + // count, type, indices, basevertex, 1, 0); + vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, ~0, ~0, count, type, indices, basevertex, 1, 0); } (In reply to Davin McCall from comment #127) > The comment above from Jose Fonseca (and follow up from Feral's Edwin Smith) > imply that the problem here is with the game calling > glDrawRangeElementsBaseVertex with bad start/end values, resulting in the > indices being out-of-range which is illegal per the GL spec, and implying > that this is what causes the crash. It might be useful to see what the Intel and/or r600 series drivers do as neither of these driver exhibit this crash in similar circumstances. The Intel Mesa driver might be best as this driver was fully supported from release without any reports of hangs post release. It's possible a comparison might help expose why RadeonSi behaves differently and narrow down the cause of the hang as it could be the root cause is hiding behind a more benign issue. (In reply to Davin McCall from comment #127) > diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c > index f0245fd..d1f4ac6 100644 > --- a/src/mesa/vbo/vbo_exec_array.c > +++ b/src/mesa/vbo/vbo_exec_array.c > @@ -935,7 +935,9 @@ vbo_exec_DrawRangeElementsBaseVertex(GLenum mode, > (void) check_draw_elements_data; > #endif > > - vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, > end, > + //vbo_validated_drawrangeelements(ctx, mode, index_bounds_valid, start, > end, > + // count, type, indices, basevertex, 1, 0); > + vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, ~0, ~0, > count, type, indices, basevertex, 1, 0); > } That's not correct, but it's close. If you want the driver to ignore the app-supplied index bounds, simply change "index_bounds_valid" to "false". Oh I see you did that. Sorry for the noise. Edwin Smith:
> It might be useful to see what the Intel and/or r600 series drivers do as neither of these driver exhibit this crash in similar circumstances.
I think you missed the significance of my second paragraph above - I have established that the hang is _not_ caused by the mis-use of the glDrawRangeElementsBaseVertex() function. So comparing how the radeonsi and Intel/r600 drivers handle this function is not likely to help in resolving this issue.
Davin
(In reply to Davin McCall from comment #131) > I think you missed the significance of my second paragraph above - I have > established that the hang is _not_ caused by the mis-use of the > glDrawRangeElementsBaseVertex() function. So comparing how the radeonsi and > Intel/r600 drivers handle this function is not likely to help in resolving > this issue. OK, I'll leave this one with you but if you need anything else from Feral let us know. Created attachment 124486 [details]
crash
I once again tried it:
Kernel: 4.7.0-rc2-00342-g8714f8f
Mesa: Dev git 54f755f
Llvm: Dev cd22fc5 (using llvm git mirror)
And it crashed again. Whole screen froze, went black, returned, went black returned, DPMS kicked in. Had to reset the box.
Something else I noticed: GPU Temp as reported by lm_sensor went up to 69°C while gaming. Normal with idel Plasma DE is 44°.
(Not sure if relevant)
Hi Daniel, curious, but I doubt the crash is related to the lockup. Most likely, buffer creation fails in radeon_cs_create_fence and then we get a NULL pointer dereference. If you could get a backtrace with line numbers to confirm that would be nice. In any case, GPU lockups can only be caused by actually submitting something to the GPU, which we obviously don't do once the game process crashes... so more likely, the GPU lockup happens first and then causes the subsequent failure somehow. I had something interesting happen with kernel 4.7 rc5 and mesa 12.1 git1591e66 The system would lock up, and the console would fill with the stalling ring messages. But then it won't lock up and will recover somewhat. The game wouldn't crash either, allowing me to exit it cleanly. Plasmashell would crash though, and there is a lot of video corruption on the desktop afterwards until the reboot. Happened two times in a row I tried with: * Kernel 4.7.0-rc5-00309-gdbdc3bb * LLVM: 274446 * Mesa: 01ccb0d This time I could play for a solid hour without crash. Will try again later to confirm. @Devs: any idea if there is something that might have fixed this? Or just luck? Nothing that I'm aware of. May have been just luck... Created attachment 124896 [details]
crash journal kernel 4.7.0-rc6
Seems you where right: crashed again today.
I managed to log some of it, perhaps its of any use.
Are you running with some kind of virtualization enabled? I'm not familiar with these AMD-Vi messages. (In reply to Nicolai Hähnle from comment #139) > Are you running with some kind of virtualization enabled? I'm not familiar > with these AMD-Vi messages. Those are printed by the IOMMU driver (device accessed unmapped page). Not necessarily virtualization related. Well I can reliably make it stop locking the system completely now. I switch to console with ctrl+alt+f2 as soon as the game locks up and then back as soon as the ring stall messages start to fill up the screen. When I am back the game is not frozen anymore, but extremely slow. In fact all opengl related rendering becomes slow. And I get this message each time I switch to console : "GPU fault detected 146 0x0904xxxc VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00029E48 VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040xxxxC" How should I give more information? If I restart the performance comes back to normal I've seen plenty of GPU hangs with XCOM: Enemy Within. It's basically the same game with a little more content, but not much. The reproducibility is random. The hangs usually happen between 1 minutes and 8 hours. In my case, this is the data I've been able to obtain: - Reproduced on everything I was testing on: Hawaii (radeon), Tonga, Polaris11 (amdgpu) - It occurs with many different shaders, among which there are a few very simple ones. (no scratch or spills, a few ifs, no loops) - Disabling HyperZ has no effect. - Disabling CE has no effect. - VM faults never occur. - The hangs don't seem to have anything in common. Action items: - Reproduce the hang and do a hardware scan dump (it can only be done in the AMD office AFAIK), and send it to hardware teams. (In reply to Alassane Maiga from comment #141) > Well I can reliably make it stop locking the system completely now. I switch > to console with ctrl+alt+f2 as soon as the game locks up and then back as > soon as the ring stall messages start to fill up the screen. When I am back > the game is not frozen anymore, but extremely slow. In fact all opengl > related rendering becomes slow. And I get this message each time I switch to > console : > "GPU fault detected 146 0x0904xxxc > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00029E48 > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040xxxxC" > > How should I give more information? > > If I restart the performance comes back to normal I think is a similar effect of when DPM is disabled (with 'radeon.dpm=0'). I'm interested to know if DPM has a finger on this issue or not. Here's a "git diff" between si_dpm.c from kernel 3.16 and 4.7.2: http://pastebin.com/raw/UzZFfYgp Perhaps there are some hints in there. BTW, Team Fortress 2 also has hangups, and I was able to play it for +40 minutes without issues with "radeon.dpm=0". disabling radeon.dpm I was stable for 70minutes with netflix and stellaris running. usually I crash within 10-20 minutes. (In reply to Marek Olšák from comment #142) [..] > In my case, this is the data I've been able to obtain: > - Reproduced on everything I was testing on: Hawaii (radeon), Tonga, > Polaris11 (amdgpu) Out of curiosity: what about r600? > Action items: > - Reproduce the hang and do a hardware scan dump (it can only be done in the > AMD office AFAIK), and send it to hardware teams. Any news here? My best bet still is either something in DPM or LLVM. DPM path is supported by this card being one of those needing a DPM quirk in si_dpm.c. Possible fix: https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea (In reply to Marek Olšák from comment #147) > Possible fix: > https://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=6dc96de303290e8d1fc294da478c4f370be98dea I wonder how _that_ could have not been found earlier. Anyway, played for about an hour without crash using: OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.48.0 / 4.9.0-rc8-dirty, LLVM 4.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 13.1.0-devel (git-31f988a9d6). Will try to crash it again tomorrow, but looks promising. (In reply to Daniel Exner from comment #148) > Will try to crash it again tomorrow, but looks promising. Played about 3h straight today. More than the whole past year! Guess this can finaly be closed. Marek, if I ever meet you in person I owe you some beer (or whatever you prefer :) I also suffered a lot from this issue on my R9 270X. I was never able to go past the first tutorial mission because my PC hung during that mission. Now I tried it again with mesa 13.0.3 and I was finally able to finish the first mission and also played 3 more missions after that without any issues. It looks like this issue is finally resolved. Thanks a lot Marek! OK. The CSO fix did it. Thanks for info. Closing. *** Bug 88925 has been marked as a duplicate of this bug. *** Why is this fix in older versions (like mesa 13.0) but not newer ones? Mesa 18.2.2 on Ubuntu 18.10 (cosmic) still needs manually patching. And upstream version 19 still excludes this fix. btw: I can confirm this bug is still present in Ubuntu 18.10 and the fix from [1] works fine. Without the fix xcom crashes right after the intro, after applying the patch I can play the game. [1] https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea The commit is present in all the branches, but it got reverted two weeks later: https://cgit.freedesktop.org/mesa/mesa/commit/?id=52098fada7e Where "the previous change" probably refers to: https://cgit.freedesktop.org/mesa/mesa/commit/?id=95eb5e4eed So if you can reliably reproduce a crash with vanilla Mesa and it works fine if you re-apply the patch, it means that Michel's fix doesn't work as intended. Your problem can be the same issue as reported here, or a different issue. I'll reopen this issue to make sure that maintainers don't overlook this. Does the problem occur with Mesa built from https://cgit.freedesktop.org/mesa/mesa/commit/?id=52098fada7e ? Picked up XCOM Complete for ~£3 @ Voidu. I've played it before of course but thought I'd try out Feral's Linux version rather than playing the Windows-only GOG release through WINE. Debian Unstable https://pastebin.com/j90guU1W amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=2acb851ad43b8f05b634bf6e70e4f859dfed9281 mesa-git with llvm8 https://cgit.freedesktop.org/mesa/mesa/commit/?id=250fffac152f3cbdbea505fc642e5f023c3f3b7e I also can replicate the hard lock issue which occurs briefly after playing the vanilla game which is started through Feral's launcher. Keyboard is unresponsive - unable to open another terminal to kill the first. 2/2 times it's seized up the computer completely requiring a hard reboot. First lockup was mid way through the tutorial battle, the second was shortly afterwards. I've yet to experience a lockup when playing "Enemy Within" which I've also tried the two times. It looks like the main game and expansion have separate binaries. I'll see if compiling mesa with the changes Michel linked fixes it. So I've had a go at compiling mesa with the fix applied. I'll list what I've done here to be absolutely clear. Cloned a new copy of mesa > git clone https://gitlab.freedesktop.org/mesa/mesa.git Downloaded the patch file from the link Michel posted into the same directory Patched the source file > patch src/gallium/auxiliary/cso_cache/cso_cache.c mesa.patch Compiled the driver with a handy script: > https://pastebin.com/nmYaj2az I override (but don't replace) Debian's drivers with the compiled mesa drivers: > https://pastebin.com/VX8PUt5W Rebooted and cleared out my mesa shader cache from ~/.cache and ~/.steam/steam/steamapps/shadercache/ The stability of the game is noticeably improved - I was able to play right through the first tutorial battle, fight several more battles up to the point where you build Alien Containment. Must've played it for at least an hour. I've not tested whether Enemy Within any other game is affected by the change however. Hold on a minute! I've just applied that patch to the latest mesa - now correct me if I'm wrong - but the source already has that change applied: https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/gallium/auxiliary/cso_cache/cso_cache.c >cso_insert_state(struct cso_cache *sc, > unsigned hash_key, enum cso_cache_type type, > void *state) >{ > struct cso_hash *hash = _cso_hash_for_type(sc, type); > sanitize_hash(sc, hash, type, sc->max_size); > > return cso_hash_insert(hash, hash_key, state); >} So how can it work perfectly for one hour or more then crash and burn on two other occasions. I feel a bit daft now too =( I'll have to clear out everything, cache, mesa drivers in /opt/mesa, re-compile the latest and test it for even longer. Andrew, see comments 154 & 155. To clarify comment 155: If the problem happens with commit 52098fada7e, please bisect between 6dc96de30329 and that. Otherwise, please bisect between 52098fada7e and the current commit you were testing. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1210. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.