Bug 97856

Summary: Computer restart playing 3D games (possibly overheating)
Product: DRI Reporter: Alex Henry <tukkek>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Zipped Xorg log file from last session (over 8MB uncompressed)
none
Xorg log and dmesg from a radeon.dpm=1 boot none

Description Alex Henry 2016-09-18 23:50:56 UTC
Hello, sorry if I'm reporting this to the wrong product but the bug report procedures on the wiki are pretty hard to understand https://www.x.org/wiki/RadeonFeature/#index11h2

I have an onboard Radeon HD 3000 (ATI RS780L). I am primarily using Debian testing ("stretch"). I believe the open source drivers for it are providade by the following packages:

linux-image-4.6.0-1-amd64 (4.6.4-1, radeon.ko)
xserver-xorg-video-ati (1:7.7.0-1)
xserver-xorg-video-radeon (1:7.7.0-1, radeon_drv.so, due to update in a few days)

The problem I'm having is that when playing 3D games the computer will randomly crash and reboot after anywhere from 10 minutes into the game, to over an hour of gameplay without rebooting (mostly around 10-30 minutes with a crash). It seems to be heat-related because when it's cooler it seems the game has less chance of crashing and when a crash happens and I try playing again after the reboot is complete, the crash seems to happen more rapidly - maybe after 5 minutes or so playing.

The crash seems to be some sort of hardware failure because there are no trace in the journald persistent logs that I can find. For this, I can't also be sure what is happening. Let me know if there is something I can do to help debug this.

To verity if the error was the driver's fault I installed a new Debian system (oldstable, Debian 7 "wheezy"), which allowed me to install the fglrx legacy driver with Radeon HD 3000 support. In this new system I haven't experienced a single reboot so far - which establishes the cause isn't hardware-only related but very likely a driver issue. Being an entirely new system means it could be something else too but since I have very frequent crashes/reboots in my primary system and none so far in the alternate system while running games for an hour or so frequently on a hot day, would indicate that the fault is indeed coming from the Radeon open-source criver.

I haven't done any 3D gaming in this computer before a couple of weeks from now so I can't say that this bug only happens on recent driver versions or not. Watching videos in a browser or in a video application (such as VLC) and 2D games like http://littlewargame.com or rendering videos (via kdenlive or such) do not cause reboots, even though they can be relatively heavy on the GPU. I haven't had any random crashes in a very long while as well except when doing 3D gaming. I've installed a few games to test it out and whenever 3D gaming the crashes do happen frequently. Some of the games I've used to test this are Heroes of Newerth and Runescape (both free to download and play) and very lightweigth in the low settings, so it shouldn't be a quesiton of me stressing the card too much either (HoN for example works fine with the fglrx legacy driver on my alternate system).

I have run memtest86, CPU and memory stress (stress-ng 0.06.15-1) in the hopes of catching a spike in my machine's temperature as the culprit for these random crashes but I've found the temperature to be stable and low even during heavy load for a long time. I've ran the Geeks3D GpuTest, which puts a heavier load than these games on the graphics card but haven't been able to cause a crash, even though in this case my tests haven't been extensible - I can run them for longer though if it would help debug the issue.

I undestand that there have been recent updates on the graphics drivers on the new Linux kernel update. I will try the new drivers as they come out since it's somewhat of a bother having to reboot the computer (and maintain a legacy system) whenever I want to play 3D games and if the problem is solved I'll report back here. If I don't comment on this issue in the near future it's because the problem persists even with the new drivers.

My guess about what is happening: since the problem seems to be heat-related maybe there is some sort of temperature sensor that the open source driver isn't able to read on my card - which I was expecting to be able to see using lmsensors (version 1:3.4.0-3) while maybe the fglrx is able to read and handle heat properly.

I don't usually report bugs to trackers that already have many reports open but since I've spent a long time in tracking this issue and was able to fix it, I thought that I should share all the information I've gathered in the hope it's useful. It's an older, onboard graphics card model, probably getting to the end of its lifespan soon but I hope this report is valuable in some way, anyway. Thank you for the good work on these open drivers. If I were able to use them on my primary system I'd certainly do it, even if the fglrx legacy system is a little bit smoother, since it would be a lot more convenient than maitaning a separare gaming system in my machine. Thanks again for the contributions to the FOSS community and for the time reading this report!
Comment 1 Alex Henry 2016-09-18 23:58:07 UTC
Forgot to mention, if relevant: the fglrx driver I'm using on my alternate system is from the AMD website, not from the Debian repositories. The installer file is named amd-driver-installer-catalyst-13.1-legacy-linux-x86.x86_64.run .
Comment 2 Alex Deucher 2016-09-19 15:21:32 UTC
Please attach your xorg log and dmesg output.  Does setting radeon.dpm=1 on the kernel command line in grub help?
Comment 3 Alex Henry 2016-09-19 16:45:07 UTC
Hi Alex! As I've said there is no error log that I can find. I assume the hardware fails and reboots before it has any chance to write anything to disk. Or are you looking for some other type of information from the logs?

I did try manually adding radeon.dpm=1 as a kernel parameter to GRUB and I also tried setting the driver profile to low to see if it would use less energy; also enabling dyndpm to try and produce less heat; I've also tried setting my BIOS fan speed control to turbo. None of that helped. On my alternate system fglrx works fine with my BIOS and the Catalyst control panel in their default options.
Comment 4 Alex Henry 2016-09-19 17:18:29 UTC
I've reproduced the bug again (this time it took around 5 minutes to crash my system after opening up a game) to make sure the xorg log had nothing of value since I was not sure I had checked this file in specific. I'll attach my log file next. Of particular note are almost 80000 lines of output similar to this:

[  6115.586] (II) RADEON(0): EDID vendor "AOC", prod id 6512
[  6115.587] (II) RADEON(0): Printing DDC gathered Modelines:
[  6115.587] (II) RADEON(0): Modeline "1366x768"x0.0   85.50  1366 1436 1579 1792  768 771 774 798 +hsync +vsync (47.7 kHz eP)
[  6115.587] (II) RADEON(0): Modeline "1360x768"x0.0   85.50  1360 1424 1536 1792  768 771 777 795 +hsync +vsync (47.7 kHz e)
...

Also I forgot to mention there are two types of crashes: one in which the computer instantly reboots and the other in which the image on-screen freezes in the last drawn frame, the sounds enters a short loop around half a second long (probably the sound buffer repeating its last contents) and the entire system becomes unresponsive, which forces a manual reboot.
Comment 5 Alex Henry 2016-09-19 17:22:04 UTC
Created attachment 126629 [details]
Zipped Xorg log file from last session (over 8MB uncompressed)

Obviously this is a text file. It can be opened with a text editor even if your system doesn't automatically recognize it as a text file due to the .old extension.
Comment 6 Alex Deucher 2016-09-19 17:30:57 UTC
Please attach your xorg log and dmesg from a regular boot and from booting with radeon.dpm=1.  It doesn't have to be after it crashes.
Comment 7 Alex Henry 2016-09-19 18:43:01 UTC
Created attachment 126631 [details]
Xorg log and dmesg from a radeon.dpm=1 boot

Rebooted my system as you've asked. /sys/class/drm/card0/device/power_method does indeed show "dpm" now (was "profile" last boot).

In my research so far, the only thing that I've saw in the logs that could have been related to this issue was the IOMMU warning in dmesg but as far as I can tell this isn't supported by my motherboard (asus-m5a78l-m-lx) since I can't find anything related on the BIOS setup utility.
Comment 8 Martin Peres 2019-11-19 09:18:40 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/742.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.