Summary: | radeonsi: 3D engines causing frequent GPU lockups | ||
---|---|---|---|
Product: | Mesa | Reporter: | MirceaKitsune <sonichedgehog_hyperblast00> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | major | ||
Priority: | high | CC: | pablodav, sonichedgehog_hyperblast00 |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
See Also: |
https://bugzilla.opensuse.org/show_bug.cgi?id=1046962 https://bugs.freedesktop.org/show_bug.cgi?id=105425 |
||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Xorg.0.log.old
Xorg.0.log xsession-errors-:0 dmesg lspci journalctl mesa_err.log mesa_err_2.log Screenshot of sys/kernel/debug/dri/0 watch-clients.txt watch-gem_names.txt watch-radeon_gtt_mm.txt watch-radeon_vram_mm.txt watch-ttm_dma_page_pool.txt watch-ttm_page_pool.txt Memtest86 screenshot |
Description
MirceaKitsune
2017-07-02 00:22:48 UTC
Created attachment 132388 [details]
Xorg.0.log.old
Created attachment 132389 [details]
Xorg.0.log
Created attachment 132390 [details]
xsession-errors-:0
Created attachment 132391 [details]
dmesg
Created attachment 132392 [details]
lspci
Created attachment 132393 [details]
journalctl
Created attachment 132397 [details]
mesa_err.log
I was able to find an important clue, while running Xonotic using the following environment variables:
MESA_DEBUG=1
MESA_LOG_FILE=/foo/bar/mesa_err.log
A log is generated and readable after restarting the machine. It only contains one line, but that looks like it might address the cause:
Mesa: User error: GL_INVALID_OPERATION in glGetQueryObjectiv(out of bounds)
Created attachment 132621 [details] mesa_err_2.log The freeze persists in Mesa 17.1.4. This comes as a surprise since a GPU lockup was fixed in this version, which should have been the same problem I'm getting: https://bugs.freedesktop.org/show_bug.cgi?id=101294 I'm attaching a new output generated by MESA_LOG_FILE; It appears to print different information after the update, and there are now a lot of messages about an out of memory error. As Xonotic uses texture compression and I've never had problems until last month, I don't suspect there are any actual assets filling up the memory. Could be we talking a VRAM memory leak? Created attachment 132863 [details]
Screenshot of sys/kernel/debug/dri/0
The exact same story repeats with Mesa 17.1.5 (as with 17.1.4); The release notes claim that a core crash has been fixed, yet inexplicably this freeze persists after updating to the latest version.
Something very unusual happened however: I rebooted my machine and started testing under Xonotic. After about 10 minutes of playing I got my first freeze, however it did not block the machine; Only Xonotic itself crashed (image froze and sound died), so I was able to alt-tab switch to my desktop... the system detected that the process was unresponsive and killed it, after which I could notice that it did NOT eat up any CPU or memory while it was frozen. I tested again and after about 15 minutes I got another freeze... this time though it froze the entire computer as usual (including taking down SSH).
I preformed the suggested test of monitoring the files in sys/kernel/debug/dri/0 through my SSH connection, to check whether this might be caused by a vram leak. The most relevant file in there was radeon_vram, which seems to have 2.0 GB at all times (makes sense as that's the amount of vram on my video card). I used the command "watch -n 1 cat /sys/kernel/debug/dri/0/radeon_vram" to monitor it, but that has not printed any changes in the file itself. Adding a screenshot of that directory and its contents.
I was able to use parallel SSH sessions to monitor changes in the files suggested by Max Staudt, which I did by using the command: watch -n 1 cat filename The relevant files that existed and I was able to watch are: /sys/kernel/debug/dri/0/clients /sys/kernel/debug/dri/0/gem_names /sys/kernel/debug/dri/0/radeon_gtt_mm /sys/kernel/debug/dri/0/radeon_vram_mm /sys/kernel/debug/dri/0/ttm_dma_page_pool /sys/kernel/debug/dri/0/ttm_page_pool I will attach the captures of each output, each showing its file <= 1 second before the freeze. I understand those files should retain information about VRAM, which indicates whether this could be a progressive memory leak. Very important note: It has taken me hours to obtain those outputs, and for a while I thought the freeze was fixed by an update altogether. For over 2 hours I was able to run all game engines that produced this crash without getting any freeze whatsoever, which has never happened before! However the freeze returned after I restarted my machine, meaning it's still present. I have no idea whether there's a switch in my system that causes it to happen only sometimes, but hopefully those files will say something. Created attachment 133056 [details]
watch-clients.txt
Created attachment 133057 [details]
watch-gem_names.txt
Created attachment 133058 [details]
watch-radeon_gtt_mm.txt
Created attachment 133059 [details]
watch-radeon_vram_mm.txt
Created attachment 133060 [details]
watch-ttm_dma_page_pool.txt
Created attachment 133061 [details]
watch-ttm_page_pool.txt
My understanding is limited when it comes to drivers and core OpenGL components, so what I say might be completely irrelevant. Looking through the logs, I'm noticing something suspicious in ttm_dma_page_pool: pool refills pages freed inuse available name wc 5008 0 3833 16199 radeon 0000:03:00.0 cached 22077 83375 4929 4 radeon 0000:03:00.0 The 'available' field of the 'cached' line says 4, which to me seems like a very small value. radeon_gtt_mm and radeon_vram_mm also seem like they keep adding more memory / pages / whatever than they are freeing, though this might just be an impression. Others aren't raising any suspicion with me, but hopefully someone more experienced can translate this data. As a reminder, my video card has 2GB of memory while my system has 24 GB of RAM and 8 GB of SWAP. Radeon R7 370 Gigabyte / Pitcairn Island / GCN 1.0 / RadeonSI. As the amdgpu driver is not yet supported for my card, I'm still running on the radeon driver (no fglrx). I used 'dmesg -w' via SSH to monitor dmesg output as the system froze. I have not seen anything of interest, and no new messages were printed before the crash took place. The only arguably suspicious line was: [ 1286.800069] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750 Never the less, I have discovered another important factor during my tests: I decided to look through my BIOS settings again, as I remembered I had left enabled a memory overclock setting called Performance Enhance. In the past when I had a different set of memories, this option caused the exact same freeze when I was watching Youtube (1080p @ 60fps videos). Later on I got new memories, and due to how my clocks are synced I'm running those at an underclocked (therefore more stable) frequency, so I figured I can leave this enabled without problems. The highly erratic probabilities of the freezes threw me off (once it's after 10 minutes, then it's after 2 hours), whereas a crash this obvious would have been all over the bug tracker by now if it was Mesa. After disabling it, I no longer seem to get any immediate system freezes. It will however require more testing to confirm it was that option, so please give me a few more weeks before we close this. If my theory is proven wrong, I'll immediately post a new comment and let everyone know. The freeze still happens with the Performance Enhance BIOS setting turned off, the crash is not caused by my overclocking settings. It took 2 hours of playing Minecraft in a row for it to occur again. I noticed an important clue: In the case of Minecraft, the system only seems to crash after mobs have loaded into view. If I only explore a world where no entities spawn (be it full of voxel geometry), the freeze has never happened thus far. This made me realize that all engines I noticed the freeze with have one thing in common: A skeletal mesh is loaded into view. Could this be an issue related to animated models by chance? Note that I don't suspect Vertex Buffer Objects to be a cause: I once turned off VBO in Minecraft, restarted the game, and still got a system freeze. For the time being, I've decided to test whether this also happens with the RadeonSI scheduler. To make sure I'm applying it to all games across the system, I've added the following line to ~/.profile and restarted: export R600_DEBUG=sisched I managed to play Minecraft for over an hour several times, including in areas with many mobs and therefore skeletal models in view... so far no freeze. However it will take much more testing to be sure this makes a difference, so far there is no real verdict. I'll also follow the advice of testing with Supertux Kart, which should be an easier test case for other developers. If the SI scheduler does turn out to fix the problem, it would mean this is a bug specific to the old scheduler (still default, hence why that environment variable is needed to switch). That would make sense since IIRC the scheduler influences how drawable items are queued and rendered, which is a likely candidate for something causing an error that freezes the system. Created attachment 133244 [details]
Memtest86 screenshot
To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I'll attach a picture just in case any useful information is printed there.
Wasn't sure whether to bump this same bug report, as the original issue has clearly been fixed during nearly an year of countless Kernel + Mesa + driver updates. Unfortunately I now experience a new issue acting just like what I described here at the time: When certain 3D engines are running, there is a chance that after a few minutes the machine instantly freezes and becomes fully unusable until powered off and back on. I don't know when the new crash was implemented since I haven't played a lot of 3D games recently, but I'd assume somewhere within the last few months. I now have Kernel 4.15.3 and Mesa 18.0.0. Again my video card is a Radeon R7 370 from Gigabyte (RadeonSI, GCN 1.0, AMD Pitcairn Islands). I'm running the openSUSE Tumbleweed x64 rolling release distribution. Can someone please explain a way to debug those instant system freezes as they're added to the system components? I can't get an output at the time of the crash as the entire machine stops working and becomes bricked until restarted (likely including SSH), but maybe I can make it log info that I can retrieve after I reboot? Any useful info will help, just please nothing dangerous that might permanently break my OS. (In reply to MirceaKitsune from comment #22) > Wasn't sure whether to bump this same bug report, as the original issue has > clearly been fixed during nearly an year of countless Kernel + Mesa + driver > updates. In this case please open up a new bug report, cause the old logs are most likely completely useless for the new bug. > Unfortunately I now experience a new issue acting just like what I > described here at the time: When certain 3D engines are running, there is a > chance that after a few minutes the machine instantly freezes and becomes > fully unusable until powered off and back on. I don't know when the new > crash was implemented since I haven't played a lot of 3D games recently, but > I'd assume somewhere within the last few months. Any chance to narrow that down further? I can confirm that the exact same system freeze happens with both the radeon and amdgpu driver: Using either module makes absolutely no difference. This is the last piece of confirmation I needed to conclude that what's happening must be a deliberate malware: There is simply no way a GPU crash bug could behave 100% the same way on two entirely different video drivers. Further more, this freeze is completely identical to the one I initially reported here... which was obviously fixed since there's been so many updates to every system component it would have been solved by sheer chance at this point! Functionality like this should only be seen if someone is actively re-implementing the problem on top of updated system components, with the active intent of keeping its effects identical each time. It's possible that my machine may be used to test malware usable to shut down Linux systems, in which case I need to find out where it's hidden and how it's bricking computers before it spreads. This attack must exploit vulnerabilities that keep coming and going in X11, Mesa, or some other system component... those are hopefully holes which can be discovered and plugged to render the attacks impractical altogether. Again I only know that it's triggered while certain 3D engines are running (possibly aimed primarily at gamers?) and has a random chance of happening roughly once every 30 minutes (likely to make testing harder and better hide the exploit). (In reply to MirceaKitsune from comment #24) > This is the last piece of confirmation I needed to conclude that what's > happening must be a deliberate malware: There is simply no way a GPU crash > bug could behave 100% the same way on two entirely different video drivers. The two driver stacks are mostly the same, only the kernel module differs. > Further more, this freeze is completely identical to the one I initially > reported here... which was obviously fixed since there's been so many > updates to every system component it would have been solved by sheer chance > at this point! Functionality like this should only be seen if someone is > actively re-implementing the problem on top of updated system components, > with the active intent of keeping its effects identical each time. It's > possible that my machine may be used to test malware usable to shut down > Linux systems, in which case I need to find out where it's hidden and how > it's bricking computers before it spreads. LOL, well you are funny. What you are hitting here is just a random problem triggered most likely by a bug in the userspace part of the driver which is identical for both radeon and amdgpu. > This attack must exploit vulnerabilities that keep coming and going in X11, > Mesa, or some other system component... those are hopefully holes which can > be discovered and plugged to render the attacks impractical altogether. > Again I only know that it's triggered while certain 3D engines are running > (possibly aimed primarily at gamers?) and has a random chance of happening > roughly once every 30 minutes (likely to make testing harder and better hide > the exploit). Well there is no such thing as an exploit involved here. If it wants to produce a lockup an application can just send a shader with an infinity loop to the hardware. There is no way to prevent that (see turing halting problem on wikipedia). What we can do is trying to reset the hardware after some timeout. Anyway the logs attached to this bug report are completely useless for your new issue, so closing this one. Moved to a new bug report which addresses the new issue with fresh logs: https://bugs.freedesktop.org/show_bug.cgi?id=105425 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.