Bug 101672

Summary:	radeonsi: 3D engines causing frequent GPU lockups
Product:	Mesa	Reporter:	MirceaKitsune <sonichedgehog_hyperblast00>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	major
Priority:	high	CC:	pablodav, sonichedgehog_hyperblast00
Version:	git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
See Also:	https://bugzilla.opensuse.org/show_bug.cgi?id=1046962 https://bugs.freedesktop.org/show_bug.cgi?id=105425
Whiteboard:
i915 platform:		i915 features:
Attachments:	Xorg.0.log.old Xorg.0.log xsession-errors-:0 dmesg lspci journalctl mesa_err.log mesa_err_2.log Screenshot of sys/kernel/debug/dri/0 watch-clients.txt watch-gem_names.txt watch-radeon_gtt_mm.txt watch-radeon_vram_mm.txt watch-ttm_dma_page_pool.txt watch-ttm_page_pool.txt Memtest86 screenshot

Description MirceaKitsune 2017-07-02 00:22:48 UTC

A GPU lockup has once again been introduced in Mesa and / or the RadeonSI driver. As is usual with this sort of thing, the image immediately freezes in place while audio stops and every form of input becomes unresponsive (including the NumLock / CapsLock keyboard leds), the only option being to power the machine off and back on. I started noticing this crash roughly a month ago after a distribution upgrade (openSUSE Tumbleweed).

The crash only appears to be caused by 3D rendering. It's probabilistic but very frequent. It is triggered by a variety of game engines, and I've noticed it with at least the following ones:

- Blender 3D: When opening certain scenes in Blender and going into Weight Paint mode, the system is bound to crash in at most 5 minutes of usage.

- Second Life: Linux native viewers for Second Life also trigger this, I believe somewhere between 5 and 30 minutes estimate.

- Xonotic (Darkplaces engine): Starting a game will freeze the machine anywhere between instantly (the moment a game starts) and 30 minutes at most.

- The Dark Mod (idTech 4 engine): The same freeze will occur when playing TheDarkMod, anywhere between instantly and roughly 10 minutes at most.

- MineCraft: The native version of Minecraft can also trigger the crash, after at most 1 hour of playing a game especially on servers with a lot of geometry.

My OS is openSUSE Tumbleweed x64. My current Mesa version is 17.1.3, I can confirm first noticing this in 17.1.1, but I don't know if the issue was introduced in 17.1.0 or prior. My video card is a Radeon R7 370 (Gigabyte), Pitcairn Islands GPU, GCN 1.0, RadeonSI. Official product page: http://www.gigabyte.com/products/product-page.aspx?pid=5469

Please help to address this issue soon: I am unable to use several applications due to the risk they pose to my computer! Logs will be attached soon.

Comment 1 MirceaKitsune 2017-07-02 00:23:43 UTC

Created attachment 132388 [details]
Xorg.0.log.old

Comment 2 MirceaKitsune 2017-07-02 00:24:14 UTC

Created attachment 132389 [details]
Xorg.0.log

Comment 3 MirceaKitsune 2017-07-02 00:24:49 UTC

Created attachment 132390 [details]
xsession-errors-:0

Comment 4 MirceaKitsune 2017-07-02 00:25:33 UTC

Created attachment 132391 [details]
dmesg

Comment 5 MirceaKitsune 2017-07-02 00:25:59 UTC

Created attachment 132392 [details]
lspci

Comment 6 MirceaKitsune 2017-07-02 00:26:29 UTC

Created attachment 132393 [details]
journalctl

Comment 7 MirceaKitsune 2017-07-02 15:24:40 UTC

Created attachment 132397 [details]
mesa_err.log

I was able to find an important clue, while running Xonotic using the following environment variables:

MESA_DEBUG=1
MESA_LOG_FILE=/foo/bar/mesa_err.log

A log is generated and readable after restarting the machine. It only contains one line, but that looks like it might address the cause:

Mesa: User error: GL_INVALID_OPERATION in glGetQueryObjectiv(out of bounds)

Comment 8 MirceaKitsune 2017-07-11 22:13:01 UTC

Created attachment 132621 [details]
mesa_err_2.log

The freeze persists in Mesa 17.1.4. This comes as a surprise since a GPU lockup was fixed in this version, which should have been the same problem I'm getting:

https://bugs.freedesktop.org/show_bug.cgi?id=101294

I'm attaching a new output generated by MESA_LOG_FILE; It appears to print different information after the update, and there are now a lot of messages about an out of memory error.

As Xonotic uses texture compression and I've never had problems until last month, I don't suspect there are any actual assets filling up the memory. Could be we talking a VRAM memory leak?

Comment 9 MirceaKitsune 2017-07-24 11:59:00 UTC

Created attachment 132863 [details]
Screenshot of sys/kernel/debug/dri/0

The exact same story repeats with Mesa 17.1.5 (as with 17.1.4); The release notes claim that a core crash has been fixed, yet inexplicably this freeze persists after updating to the latest version.

Something very unusual happened however: I rebooted my machine and started testing under Xonotic. After about 10 minutes of playing I got my first freeze, however it did not block the machine; Only Xonotic itself crashed (image froze and sound died), so I was able to alt-tab switch to my desktop... the system detected that the process was unresponsive and killed it, after which I could notice that it did NOT eat up any CPU or memory while it was frozen. I tested again and after about 15 minutes I got another freeze... this time though it froze the entire computer as usual (including taking down SSH).

I preformed the suggested test of monitoring the files in sys/kernel/debug/dri/0 through my SSH connection, to check whether this might be caused by a vram leak. The most relevant file in there was radeon_vram, which seems to have 2.0 GB at all times (makes sense as that's the amount of vram on my video card). I used the command "watch -n 1 cat /sys/kernel/debug/dri/0/radeon_vram" to monitor it, but that has not printed any changes in the file itself. Adding a screenshot of that directory and its contents.

Comment 10 MirceaKitsune 2017-07-26 22:10:57 UTC

I was able to use parallel SSH sessions to monitor changes in the files suggested by Max Staudt, which I did by using the command:

watch -n 1 cat filename

The relevant files that existed and I was able to watch are:

/sys/kernel/debug/dri/0/clients
/sys/kernel/debug/dri/0/gem_names
/sys/kernel/debug/dri/0/radeon_gtt_mm
/sys/kernel/debug/dri/0/radeon_vram_mm
/sys/kernel/debug/dri/0/ttm_dma_page_pool
/sys/kernel/debug/dri/0/ttm_page_pool

I will attach the captures of each output, each showing its file <= 1 second before the freeze. I understand those files should retain information about VRAM, which indicates whether this could be a progressive memory leak.

Very important note: It has taken me hours to obtain those outputs, and for a while I thought the freeze was fixed by an update altogether. For over 2 hours I was able to run all game engines that produced this crash without getting any freeze whatsoever, which has never happened before! However the freeze returned after I restarted my machine, meaning it's still present. I have no idea whether there's a switch in my system that causes it to happen only sometimes, but hopefully those files will say something.

Comment 11 MirceaKitsune 2017-07-26 22:11:51 UTC

Created attachment 133056 [details]
watch-clients.txt

Comment 12 MirceaKitsune 2017-07-26 22:12:27 UTC

Created attachment 133057 [details]
watch-gem_names.txt

Comment 13 MirceaKitsune 2017-07-26 22:12:53 UTC

Created attachment 133058 [details]
watch-radeon_gtt_mm.txt

Comment 14 MirceaKitsune 2017-07-26 22:13:29 UTC

Created attachment 133059 [details]
watch-radeon_vram_mm.txt

Comment 15 MirceaKitsune 2017-07-26 22:13:53 UTC

Created attachment 133060 [details]
watch-ttm_dma_page_pool.txt

Comment 16 MirceaKitsune 2017-07-26 22:14:16 UTC

Created attachment 133061 [details]
watch-ttm_page_pool.txt

Comment 17 MirceaKitsune 2017-07-26 22:39:28 UTC

My understanding is limited when it comes to drivers and core OpenGL components, so what I say might be completely irrelevant. Looking through the logs, I'm noticing something suspicious in ttm_dma_page_pool:

         pool      refills   pages freed    inuse available     name
           wc         5008             0     3833    16199 radeon 0000:03:00.0
       cached        22077         83375     4929        4 radeon 0000:03:00.0

The 'available' field of the 'cached' line says 4, which to me seems like a very small value. radeon_gtt_mm and radeon_vram_mm also seem like they keep adding more memory / pages / whatever than they are freeing, though this might just be an impression. Others aren't raising any suspicion with me, but hopefully someone more experienced can translate this data.

As a reminder, my video card has 2GB of memory while my system has 24 GB of RAM and 8 GB of SWAP. Radeon R7 370 Gigabyte / Pitcairn Island / GCN 1.0 / RadeonSI. As the amdgpu driver is not yet supported for my card, I'm still running on the radeon driver (no fglrx).

Comment 18 MirceaKitsune 2017-07-27 14:52:29 UTC

I used 'dmesg -w' via SSH to monitor dmesg output as the system froze. I have not seen anything of interest, and no new messages were printed before the crash took place. The only arguably suspicious line was:

[ 1286.800069] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

Never the less, I have discovered another important factor during my tests: I decided to look through my BIOS settings again, as I remembered I had left enabled a memory overclock setting called Performance Enhance. In the past when I had a different set of memories, this option caused the exact same freeze when I was watching Youtube (1080p @ 60fps videos). Later on I got new memories, and due to how my clocks are synced I'm running those at an underclocked (therefore more stable) frequency, so I figured I can leave this enabled without problems. The highly erratic probabilities of the freezes threw me off (once it's after 10 minutes, then it's after 2 hours), whereas a crash this obvious would have been all over the bug tracker by now if it was Mesa.

After disabling it, I no longer seem to get any immediate system freezes. It will however require more testing to confirm it was that option, so please give me a few more weeks before we close this. If my theory is proven wrong, I'll immediately post a new comment and let everyone know.

Comment 19 MirceaKitsune 2017-07-29 13:31:21 UTC

The freeze still happens with the Performance Enhance BIOS setting turned off, the crash is not caused by my overclocking settings. It took 2 hours of playing Minecraft in a row for it to occur again.

I noticed an important clue: In the case of Minecraft, the system only seems to crash after mobs have loaded into view. If I only explore a world where no entities spawn (be it full of voxel geometry), the freeze has never happened thus far. This made me realize that all engines I noticed the freeze with have one thing in common: A skeletal mesh is loaded into view. Could this be an issue related to animated models by chance?

Note that I don't suspect Vertex Buffer Objects to be a cause: I once turned off VBO in Minecraft, restarted the game, and still got a system freeze.

Comment 20 MirceaKitsune 2017-08-01 12:17:55 UTC

For the time being, I've decided to test whether this also happens with the RadeonSI scheduler. To make sure I'm applying it to all games across the system, I've added the following line to ~/.profile and restarted:

export R600_DEBUG=sisched

I managed to play Minecraft for over an hour several times, including in areas with many mobs and therefore skeletal models in view... so far no freeze. However it will take much more testing to be sure this makes a difference, so far there is no real verdict. I'll also follow the advice of testing with Supertux Kart, which should be an easier test case for other developers.

If the SI scheduler does turn out to fix the problem, it would mean this is a bug specific to the old scheduler (still default, hence why that environment variable is needed to switch). That would make sense since IIRC the scheduler influences how drawable items are queued and rendered, which is a likely candidate for something causing an error that freezes the system.

Comment 21 MirceaKitsune 2017-08-04 12:33:48 UTC

Created attachment 133244 [details]
Memtest86 screenshot

To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I'll attach a picture just in case any useful information is printed there.

Comment 22 MirceaKitsune 2018-02-22 00:06:11 UTC

Wasn't sure whether to bump this same bug report, as the original issue has clearly been fixed during nearly an year of countless Kernel + Mesa + driver updates. Unfortunately I now experience a new issue acting just like what I described here at the time: When certain 3D engines are running, there is a chance that after a few minutes the machine instantly freezes and becomes fully unusable until powered off and back on. I don't know when the new crash was implemented since I haven't played a lot of 3D games recently, but I'd assume somewhere within the last few months.

I now have Kernel 4.15.3 and Mesa 18.0.0. Again my video card is a Radeon R7 370 from Gigabyte (RadeonSI, GCN 1.0, AMD Pitcairn Islands). I'm running the openSUSE Tumbleweed x64 rolling release distribution.

Can someone please explain a way to debug those instant system freezes as they're added to the system components? I can't get an output at the time of the crash as the entire machine stops working and becomes bricked until restarted (likely including SSH), but maybe I can make it log info that I can retrieve after I reboot? Any useful info will help, just please nothing dangerous that might permanently break my OS.

Comment 23 Christian König 2018-02-22 07:34:03 UTC

(In reply to MirceaKitsune from comment #22)
> Wasn't sure whether to bump this same bug report, as the original issue has
> clearly been fixed during nearly an year of countless Kernel + Mesa + driver
> updates.

In this case please open up a new bug report, cause the old logs are most likely completely useless for the new bug.

> Unfortunately I now experience a new issue acting just like what I
> described here at the time: When certain 3D engines are running, there is a
> chance that after a few minutes the machine instantly freezes and becomes
> fully unusable until powered off and back on. I don't know when the new
> crash was implemented since I haven't played a lot of 3D games recently, but
> I'd assume somewhere within the last few months.

Any chance to narrow that down further?

Comment 24 MirceaKitsune 2018-02-26 21:28:00 UTC

I can confirm that the exact same system freeze happens with both the radeon and amdgpu driver: Using either module makes absolutely no difference.

This is the last piece of confirmation I needed to conclude that what's happening must be a deliberate malware: There is simply no way a GPU crash bug could behave 100% the same way on two entirely different video drivers. Further more, this freeze is completely identical to the one I initially reported here... which was obviously fixed since there's been so many updates to every system component it would have been solved by sheer chance at this point! Functionality like this should only be seen if someone is actively re-implementing the problem on top of updated system components, with the active intent of keeping its effects identical each time. It's possible that my machine may be used to test malware usable to shut down Linux systems, in which case I need to find out where it's hidden and how it's bricking computers before it spreads.

This attack must exploit vulnerabilities that keep coming and going in X11, Mesa, or some other system component... those are hopefully holes which can be discovered and plugged to render the attacks impractical altogether. Again I only know that it's triggered while certain 3D engines are running (possibly aimed primarily at gamers?) and has a random chance of happening roughly once every 30 minutes (likely to make testing harder and better hide the exploit).

Comment 25 Christian König 2018-02-27 12:02:29 UTC

(In reply to MirceaKitsune from comment #24)
> This is the last piece of confirmation I needed to conclude that what's
> happening must be a deliberate malware: There is simply no way a GPU crash
> bug could behave 100% the same way on two entirely different video drivers.

The two driver stacks are mostly the same, only the kernel module differs.

> Further more, this freeze is completely identical to the one I initially
> reported here... which was obviously fixed since there's been so many
> updates to every system component it would have been solved by sheer chance
> at this point! Functionality like this should only be seen if someone is
> actively re-implementing the problem on top of updated system components,
> with the active intent of keeping its effects identical each time. It's
> possible that my machine may be used to test malware usable to shut down
> Linux systems, in which case I need to find out where it's hidden and how
> it's bricking computers before it spreads.

LOL, well you are funny. What you are hitting here is just a random problem triggered most likely by a bug in the userspace part of the driver which is identical for both radeon and amdgpu.

> This attack must exploit vulnerabilities that keep coming and going in X11,
> Mesa, or some other system component... those are hopefully holes which can
> be discovered and plugged to render the attacks impractical altogether.
> Again I only know that it's triggered while certain 3D engines are running
> (possibly aimed primarily at gamers?) and has a random chance of happening
> roughly once every 30 minutes (likely to make testing harder and better hide
> the exploit).

Well there is no such thing as an exploit involved here.

If it wants to produce a lockup an application can just send a shader with an infinity loop to the hardware. There is no way to prevent that (see turing halting problem on wikipedia).

What we can do is trying to reset the hardware after some timeout.

Anyway the logs attached to this bug report are completely useless for your new issue, so closing this one.

Comment 26 MirceaKitsune 2018-03-09 21:45:23 UTC

Moved to a new bug report which addresses the new issue with fresh logs:

https://bugs.freedesktop.org/show_bug.cgi?id=105425

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.