My Linux hangs randomly after I replaced my 2x Nvidia cards (ZaphodHeads, Xinerama) with Radeon 7870 (Gallium, Glamor).
Desktop hangs, I can move the pointer but clicking doesn't have any effect. After ~3 second the screen goes blank. Sometimes (10%?) it recovers after ~15 seconds. In most cases it just hangs. Linux is totally dead, doesn't answer to ping, ssh and even Alt+PrtSc shortcuts.
If it recovers, I see this in the dmesg and this in the Xorg.0.log.
[ 389.821754] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
(EE) [mi] EQ overflowing. Additional events will be discarded until existing events are processed.
I am able to reproduce the bug using Linphone (it doesn't mean it hangs just in Linphone). I call my mobile. Screen goes blank right after connection is established (connection = when you hear beeping and you wait for the guy to answer the phone). Therefore, I can test if the changes you provide resolve the issue.
Using x86_64 Arch Linux 3.9.6-1-ARCH, radeon and glamor from git, KDE.
lspci: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]
glxinfo: OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN
When Linux recovers:
When Linux hangs, thus log cut:
http://upload.nowaker.net/nwkr/1371723428_Xorg.0.log.old - when Linux hangs
Tell me if you need more logs.
I will be happy to test any provided fixes.
Is this a regression? What version of mesa and llvm are you using? Also, in the future please attach your logs directly to the bug.
Thanks for taking a look at it. Package versions as follows. Pasting a PKGBUILDs so you can see the flags for compilation.
Let me know if you want me to install these from git as well.
My 7870 burned a month ago. Now I have a new one so I can continue investigating.
In the meantime there was a kernel update - I now have 3.10.5-1-ARCH instead of 3.9.6-1-ARCH. There were some updates along the way as well, marked as [updated]. I currently have:
linux 3.10.5-1 [updated]
mesa 9.1.6-1 [updated]
libdrm 2.4.46-2 [updated]
ati-dri 9.1.6-1 [updated]
llvm-amdgpu-lib-snapshot 20130403-3 [the same]
xf86-video-ati-glamor-git 1:20120730-1 [the same]
> I am able to reproduce the bug using Linphone (it doesn't mean it hangs just in
> Linphone). I call my mobile. Screen goes blank right after connection is
> established (connection = when you hear beeping and you wait for the guy to
> answer the phone). Therefore, I can test if the changes you provide resolve the
Fortunately, it works now. No hangs in Linphone after connection is established.
This has recently returned.
At random points my system just hangs. Everything dies, so even no SSH connection is possible to the machine. Today however display hung, CTRL+ALT+F1 was not usable as usual, but SSH worked. I was able to inspect dmesg (totally nothing related to the crash) and Xorg.0.log (error message and stacktrace!).
I am sure this is a regression. I have never encountered this broblem on mesa 10.1.3 and llvm 3.4. I held these versions on my system for quite a long time and refrained to update to the latest versions because of some problem with Steam. (FYI https://bbs.archlinux.org/viewtopic.php?pid=1432315#p1432315)
Right after I updated to mesa 10.2.2 and llvm 3.4.2 three days ago my system started hanging randomly from 2 to 5 times a day. I am 100% sure it's either mesa 10.1.3 -> 10.2.2 bump or llvm 3.4 -> 3.4.2.
In the meantime I will try installing various combinations of mesa (10.1.3, 10.1.4, 10.2.0rc1-5, 10.2.1, 10.2.2) and llvm (3.4, 3.4.1, 3.4.2) to find which package exactly triggers the problem.
Using Arch Linux 3.15.3.
(EE) [mi] EQ overflow continuing. 1000 events have been dropped.
(EE) [mi] No further overflow reports will be reported until the clog is cleared.
(EE) 0: /usr/bin/X (xorg_backtrace+0x56) [0x58f186]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x52) [0x44e602]
(EE) 2: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f27799c9000+0x60ba) [0x7f27799cf0ba]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f27799c9000+0x657d) [0x7f27799cf57d]
(EE) 4: /usr/bin/X (0x400000+0x74d18) [0x474d18]
(EE) 5: /usr/bin/X (0x400000+0x9e5b9) [0x49e5b9]
(EE) 6: /usr/lib/libpthread.so.0 (0x7f2780855000+0xf4b0) [0x7f27808644b0]
(EE) 7: /usr/lib/libc.so.6 (ioctl+0x7) [0x7f277f583e47]
(EE) 8: /usr/lib/libdrm.so.2 (drmIoctl+0x28) [0x7f278064c9b8]
(EE) 9: /usr/lib/libdrm.so.2 (drmCommandWrite+0x1b) [0x7f278064ec2b]
(EE) 10: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x69014) [0x7f2778db3014]
(EE) 11: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x6a3c2) [0x7f2778db43c2]
(EE) 12: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x5d431) [0x7f2778da7431]
(EE) 13: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x19f0f9) [0x7f2778ee90f9]
(EE) 14: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11e560) [0x7f2778e68560]
(EE) 15: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x1a14a5) [0x7f2778eeb4a5]
(EE) 16: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11f674) [0x7f2778e69674]
(EE) 17: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11f8d2) [0x7f2778e698d2]
(EE) 18: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x2071b) [0x7f277d13d71b]
(EE) 19: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x20ac8) [0x7f277d13dac8]
(EE) 20: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x1ccab) [0x7f277d139cab]
(EE) 21: /usr/bin/X (0x400000+0x17fad8) [0x57fad8]
(EE) 22: /usr/bin/X (0x400000+0xc40da) [0x4c40da]
(EE) 23: /usr/bin/X (0x400000+0x33cdb) [0x433cdb]
(EE) 24: /usr/bin/X (0x400000+0x36b2f) [0x436b2f]
(EE) 25: /usr/bin/X (0x400000+0x3ad16) [0x43ad16]
(EE) 26: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7f277f4c2000]
(EE) 27: /usr/bin/X (0x400000+0x250fe) [0x4250fe]
Created attachment 102301 [details]
Not an LLVM issue, just hung with a downgraded LLVM 3.4. So the problem has to be in mesa. I will now try to find the latest working version of mesa. Starting from 10.1.4 for now.
Regarding recent two hangs. One was total, and I had to hard reset. The other, that happened 10 minutes after reboot, wasn't that bad. After `pkill -9 kdm` from SSH I was able to CTRL+ALT+F1. The error was "EQ overflow" again of course, but has got a different stacktrace just a little bit.
However, X wasn't able to start again. I had to reboot.
[ 480.971] (EE) RADEON(0): [drm] Failed to open DRM device for pci:0000:01:00.0: No such file or directory
Created attachment 102315 [details]
Xorg.0.log after the second crash
Created attachment 102316 [details]
Xorg.0.log after the second crash - trying to start X again
(In reply to comment #6)
> Not an LLVM issue, just hung with a downgraded LLVM 3.4.
Note that it's not really supported for the radeonsi driver to run against a version of LLVM older than the one it was built against. It might be good to confirm that old Mesa with LLVM 3.4.2 isn't affected by the problem.
> So the problem has to be in mesa. I will now try to find the latest working
> version of mesa. Starting from 10.1.4 for now.
Thanks, but FWIW, the most helpful thing you could do would be to bisect the problem from upstream Mesa Git.
I will bisect once I have the earliest version of mesa that causes the problem.
By the way, 10.1.4 is proved working, and I'll be trying out so 10.2.x version in the next days.
It started happening no more than two weeks ago (doing Mesa updates through Oibaf PPA almost daily). Happens with Kernel 3.15.6 and 3.16-RCx (up to latest RC7) on Ubuntu 14.04 with HD7770.
@Maciej, Please analyze dpkg logs and tell what Mesa version started to behave incorrectly for you.
@Michel, haven't been able to try 10.2.* since I have been very busy recently and needed a non-hanging machine, hence used 10.1.4 for time being. When I'm less busy I will go back to the case and try out next versions.
Before I've seen your answer I did full, fresh system reinstallation (cause hangs started happening after few minutes). Ubuntu was running fine for few hours, so I added Oibaf PPA and kernel 3.16-RC7 - so far no hang. I'll report if it happens again and add some logs.
(In reply to comment #4)
> This has recently returned.
> Using Arch Linux 3.15.3.
Note that there are known stability issues in 3.15.y kernels which are fixed in 3.16-rc7 or later.
Thanks Michel. I'll have that in mind. Once I have some time, I will just upgrade mesa to the latest version and downgrade the kernel and see if it helps.
Ok, it was fine during around 7 or 8 hours of usage (mostly browsing in Chrome, some gaming, a tv show with smplayer and vdpau). However today it started happening again.
To clarify, my system is (fresh installation made yesterday):
Ubuntu 14.04 64bit
Kernel 3.16-RC7 from mainline repo
Mesa git from Oibaf PPA
Switching to kernel 3.15.7 doesn't help.
This is very odd... I tried using Firefox instead of Chrome for a day and there was no hang, then I switched to Chrome and five minutes later I had to hard reboot. Shit happens with stable, beta and unstable channels (no custom tweaks to Chrome, all default).
@Damian, I had the same problems. My configuration is similar to yours: Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).
After consulting https://wiki.archlinux.org/index.php/ATI it tried adding radeon.dpm=0 to my kernel command line as a workaround. It solved the problem. No more lockups. The system is now stable with the above configuration. Clearly this has something to do with the powermanagement changes in MESA and the kernel (there were no lockups with earlier versions of MESA)
(In reply to comment #19)
> @Damian, I had the same problems. My configuration is similar to yours:
> Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).
> After consulting https://wiki.archlinux.org/index.php/ATI it tried adding
> radeon.dpm=0 to my kernel command line as a workaround. It solved the
> problem. No more lockups. The system is now stable with the above
> configuration. Clearly this has something to do with the powermanagement
> changes in MESA and the kernel (there were no lockups with earlier versions
> of MESA)
There is no power management code in mesa. Power management is completely self-contained in the kernel. If you are not getting lockups with dpm enabled and an older version of mesa, it may be a mesa issue.
(In reply to comment #20)
> (In reply to comment #19)
> > @Damian, I had the same problems. My configuration is similar to yours:
> > Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).
> > After consulting https://wiki.archlinux.org/index.php/ATI it tried adding
> > radeon.dpm=0 to my kernel command line as a workaround. It solved the
> > problem. No more lockups. The system is now stable with the above
> > configuration. Clearly this has something to do with the powermanagement
> > changes in MESA and the kernel (there were no lockups with earlier versions
> > of MESA)
> There is no power management code in mesa. Power management is completely
> self-contained in the kernel. If you are not getting lockups with dpm
> enabled and an older version of mesa, it may be a mesa issue.
Hmm, sounds like newer versions of MESA trigger a dormant power management issue in the radeon kernel driver (with certain Pitcairn hardware) ...just speculating.
It's a while ago thus I can't tell the exact version of kernel and MESA when I first observed the behavior. All I can say for sure is that now since using radeon.dpm=0 I did not experience a single lockup anymore (which is good enough for me).
Is there a way to build the radeon kernel driver with some additional debugging/logging that could help you to understand what's going on ?
For me it's actually not stable with 3.16 and 3.17 RC2. With or without radeon.dpm=0.
If I set "echo dynpm > /sys/class/drm/card0/device/power_method" (as root) it lasts a little longer but after a couple of hours it does crash again.
Radeon 7770 HD, Arch Linux.
General question: Don't you guys think this is the same bug as https://bugs.freedesktop.org/show_bug.cgi?id=79980?
I only know one thing: mesa 10.1.4 is the last unaffected version. Therefore, I still use it on my Arch Linux
@Damian Nowak: I would gladly use Mesa 10.1.4 but with this version I don't get any OpenGL.
(all from 06-08 via Arch Rollback Machine, the last day with 10.1.4*.)
+ recent X-Server 1.16 etc.
I see this bug report as the same as the linked above "Random radeonsi crashes". The crashes are in many shapes, in the last time my screen just freezes but the mouse pointer still moves... Can't jump into a virtual console nor kill X. I think I will install Catalyst again, it's really sad. Catalyst doesn't even support Firefox' OMTC GPU accelerated scrolling so all scrolling is stuttering on my 2560x1440 monitor (if I set it to FHD the stuttering is acceptable but of course I want full resolution...).
Max, you need to downgrade llvm too. Just install llvm from the same date as your mesa.
This should be a duplicate of my bug here:
SHort story: THe blank screen crash (and rare recover) usually happens for me is triggered by Chromium, VLC, and OpenGL games opened when Chromium or VLC are opened. Also rarely on game start ups in general it seems. I believe it's a bug in Mesa, no clue where, but the apps that trigger it all should be pushing data to the GPU, so I'd imagine it's a fault in that pipe/allocation.
Although this issue is older than #81644, the latter contains way more information. Closing this one.
*** This bug has been marked as a duplicate of bug 81644 ***
Because #81644 turns out be a catch-all bug, I'm reopening this one.
Is this issue still happening on current Mesa git?
Well, I don't know! I just updated my rock-solid set of packages that has always been super stable. Let's see if I encounter this problem again.
lib32-llvm: (3.4.1-1 => 3.6.2-1)
lib32-llvm-libs: (3.4.1-1 => 3.6.2-1)
lib32-mesa: (10.1.4-1 => 10.6.3-1)
lib32-mesa-libgl: (10.1.4-1 => 10.6.3-1)
llvm: (3.4.1-2 => 3.6.2-2)
llvm-libs: (3.4.1-2 => 3.6.2-2)
mesa: (10.1.4-1 => 10.6.3-1)
mesa-libgl: (10.1.4-1 => 10.6.3-1)
xf86-input-keyboard: (1.8.0-3 => 1.8.1-1)
xf86-input-vmmouse: (13.0.0-5 => 13.1.0-1)
xf86-input-void: (1.4.0-7 => 1.4.1-1)
xf86-video-ark: (0.7.5-5 => 0.7.5-6)
xf86-video-ati: (1:7.4.0-3 => 1:7.5.0-2)
xf86-video-dummy: (0.3.7-3 => 0.3.7-4)
xf86-video-fbdev: (0.4.4-3 => 0.4.4-4)
xf86-video-glint: (1.2.8-5 => 1.2.8-6)
xf86-video-i128: (1.3.6-5 => 1.3.6-6)
xf86-video-mach64: (6.9.4-4 => 6.9.5-1)
xf86-video-neomagic: (1.2.8-3 => 1.2.9-1)
xf86-video-nv: (2.1.20-5 => 2.1.20-6)
xf86-video-openchrome: (0.3.3-4 => 0.3.3-5)
xf86-video-r128: (6.9.2-3 => 6.10.0-1)
xf86-video-savage: (2.3.7-3 => 2.3.8-1)
xf86-video-siliconmotion: (1.7.7-5 => 1.7.8-1)
xf86-video-sis: (0.10.7-6 => 0.10.7-7)
xf86-video-tdfx: (1.4.5-5 => 1.4.5-6)
xf86-video-trident: (1.3.6-6 => 1.3.7-1)
xf86-video-vesa: (2.3.2-5 => 2.3.4-1)
xf86-video-vmware: (13.1.0-1 => 13.1.0-2)
xf86-video-voodoo: (1.2.5-5 => 1.2.5-6)
xorg-server: (1.16.4-1 => 1.17.2-4)
xorg-server-common: (1.16.4-1 => 1.17.2-4)
xorg-server-devel: (1.16.4-1 => 1.17.2-4)
Well, so far so good! Haven't had any lock-up today. I'll keep testing. If nothing happens by the end of the week I'll close the ticket. Thanks.
Assuming fixed per comment 31, otherwise please reopen.