Bug 65963

Summary: screen goes blank, Linux hangs - Radeon 7870, Gallium, Glamor
Product: Mesa Reporter: Damian Nowak <nowaker>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: medium CC: mabo, nowaker, paul.woegerer
Version: 10.2   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Xorg.0.log
Xorg.0.log after the second crash
Xorg.0.log after the second crash - trying to start X again

Description Damian Nowak 2013-06-20 10:41:16 UTC
My Linux hangs randomly after I replaced my 2x Nvidia cards (ZaphodHeads, Xinerama) with Radeon 7870 (Gallium, Glamor). 

Desktop hangs, I can move the pointer but clicking doesn't have any effect. After ~3 second the screen goes blank. Sometimes (10%?) it recovers after ~15 seconds. In most cases it just hangs. Linux is totally dead, doesn't answer to ping, ssh and even Alt+PrtSc shortcuts.

If it recovers, I see this[1] in the dmesg and this[2] in the Xorg.0.log.

[1]
[  389.821754] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec

[2]
(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.


I am able to reproduce the bug using Linphone (it doesn't mean it hangs just in Linphone). I call my mobile. Screen goes blank right after connection is established (connection = when you hear beeping and you wait for the guy to answer the phone). Therefore, I can test if the changes you provide resolve the issue.


Using x86_64  Arch Linux 3.9.6-1-ARCH, radeon and glamor from git, KDE.

lspci: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]

glxinfo: OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN



Full logs:
http://upload.nowaker.net/nwkr/1371723485_glxinfo
http://upload.nowaker.net/nwkr/1371724316_xrandr

When Linux recovers:
http://upload.nowaker.net/nwkr/1371723420_dmesg
http://upload.nowaker.net/nwkr/1371724751_Xorg.0.log

When Linux hangs, thus log cut:
http://upload.nowaker.net/nwkr/1371723428_Xorg.0.log.old - when Linux hangs


Tell me if you need more logs.
I will be happy to test any provided fixes.
Comment 1 Alex Deucher 2013-06-20 13:40:40 UTC
Is this a regression?  What version of mesa and llvm are you using?  Also, in the future please attach your logs directly to the bug.
Comment 2 Damian Nowak 2013-06-20 14:05:30 UTC
Thanks for taking a look at it. Package versions as follows. Pasting a PKGBUILDs so you can see the flags for compilation.

extra/mesa 9.1.3-1
  https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa
extra/llvm-amdgpu-lib-snapshot 20130403-3
  https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/llvm-amdgpu-snapshot

Let me know if you want me to install these from git as well.
Comment 3 Damian Nowak 2013-08-09 12:50:22 UTC
My 7870 burned a month ago. Now I have a new one so I can continue investigating.

In the meantime there was a kernel update - I now have 3.10.5-1-ARCH instead of 3.9.6-1-ARCH. There were some updates along the way as well, marked as [updated]. I currently have:

linux 3.10.5-1 [updated]
mesa 9.1.6-1 [updated]
libdrm 2.4.46-2 [updated]
ati-dri 9.1.6-1 [updated]
llvm-amdgpu-lib-snapshot 20130403-3 [the same]
xf86-video-ati-glamor-git 1:20120730-1 [the same]


> I am able to reproduce the bug using Linphone (it doesn't mean it hangs just in 
> Linphone). I call my mobile. Screen goes blank right after connection is 
> established (connection = when you hear beeping and you wait for the guy to 
> answer the phone). Therefore, I can test if the changes you provide resolve the 
> issue.

Fortunately, it works now. No hangs in Linphone after connection is established.
Comment 4 Damian Nowak 2014-07-05 12:50:15 UTC
This has recently returned.

At random points my system just hangs. Everything dies, so even no SSH connection is possible to the machine. Today however display hung, CTRL+ALT+F1 was not usable as usual, but SSH worked. I was able to inspect dmesg (totally nothing related to the crash) and Xorg.0.log (error message and stacktrace!).

I am sure this is a regression. I have never encountered this broblem on mesa 10.1.3 and llvm 3.4. I held these versions on my system for quite a long time and refrained to update to the latest versions because of some problem with Steam. (FYI https://bbs.archlinux.org/viewtopic.php?pid=1432315#p1432315)

Right after I updated to mesa 10.2.2 and llvm 3.4.2 three days ago my system started hanging randomly from 2 to 5 times a day. I am 100% sure it's either mesa 10.1.3 -> 10.2.2 bump or llvm 3.4 -> 3.4.2.

In the meantime I will try installing various combinations of mesa (10.1.3, 10.1.4, 10.2.0rc1-5, 10.2.1, 10.2.2) and llvm (3.4, 3.4.1, 3.4.2) to find which package exactly triggers the problem.

Using Arch Linux 3.15.3.

(EE) [mi] EQ overflow continuing.  1000 events have been dropped.
(EE) [mi] No further overflow reports will be reported until the clog is cleared.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x56) [0x58f186]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x52) [0x44e602]
(EE) 2: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f27799c9000+0x60ba) [0x7f27799cf0ba]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (0x7f27799c9000+0x657d) [0x7f27799cf57d]
(EE) 4: /usr/bin/X (0x400000+0x74d18) [0x474d18]
(EE) 5: /usr/bin/X (0x400000+0x9e5b9) [0x49e5b9]
(EE) 6: /usr/lib/libpthread.so.0 (0x7f2780855000+0xf4b0) [0x7f27808644b0]
(EE) 7: /usr/lib/libc.so.6 (ioctl+0x7) [0x7f277f583e47]
(EE) 8: /usr/lib/libdrm.so.2 (drmIoctl+0x28) [0x7f278064c9b8]
(EE) 9: /usr/lib/libdrm.so.2 (drmCommandWrite+0x1b) [0x7f278064ec2b]
(EE) 10: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x69014) [0x7f2778db3014]
(EE) 11: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x6a3c2) [0x7f2778db43c2]
(EE) 12: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x5d431) [0x7f2778da7431]
(EE) 13: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x19f0f9) [0x7f2778ee90f9]
(EE) 14: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11e560) [0x7f2778e68560]
(EE) 15: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x1a14a5) [0x7f2778eeb4a5]
(EE) 16: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11f674) [0x7f2778e69674]
(EE) 17: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7f2778d4a000+0x11f8d2) [0x7f2778e698d2]
(EE) 18: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x2071b) [0x7f277d13d71b]
(EE) 19: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x20ac8) [0x7f277d13dac8]
(EE) 20: /usr/lib/libglamor.so.0 (0x7f277d11d000+0x1ccab) [0x7f277d139cab]
(EE) 21: /usr/bin/X (0x400000+0x17fad8) [0x57fad8]
(EE) 22: /usr/bin/X (0x400000+0xc40da) [0x4c40da]
(EE) 23: /usr/bin/X (0x400000+0x33cdb) [0x433cdb]
(EE) 24: /usr/bin/X (0x400000+0x36b2f) [0x436b2f]
(EE) 25: /usr/bin/X (0x400000+0x3ad16) [0x43ad16]
(EE) 26: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7f277f4c2000]
(EE) 27: /usr/bin/X (0x400000+0x250fe) [0x4250fe]
Comment 5 Damian Nowak 2014-07-05 12:50:56 UTC
Created attachment 102301 [details]
Xorg.0.log
Comment 6 Damian Nowak 2014-07-06 01:15:23 UTC
Not an LLVM issue, just hung with a downgraded LLVM 3.4. So the problem has to be in mesa. I will now try to find the latest working version of mesa. Starting from 10.1.4 for now.

Regarding recent two hangs. One was total, and I had to hard reset. The other, that happened 10 minutes after reboot, wasn't that bad. After `pkill -9 kdm` from SSH I was able to CTRL+ALT+F1. The error was "EQ overflow" again of course, but has got a different stacktrace just a little bit.

However, X wasn't able to start again. I had to reboot.

[   480.971] (EE) RADEON(0): [drm] Failed to open DRM device for pci:0000:01:00.0: No such file or directory
Comment 7 Damian Nowak 2014-07-06 01:15:53 UTC
Created attachment 102315 [details]
Xorg.0.log after the second crash
Comment 8 Damian Nowak 2014-07-06 01:16:32 UTC
Created attachment 102316 [details]
Xorg.0.log after the second crash - trying to start X again
Comment 9 Michel Dänzer 2014-07-09 02:57:03 UTC
(In reply to comment #6)
> Not an LLVM issue, just hung with a downgraded LLVM 3.4.

Note that it's not really supported for the radeonsi driver to run against a version of LLVM older than the one it was built against. It might be good to confirm that old Mesa with LLVM 3.4.2 isn't affected by the problem.


> So the problem has to be in mesa. I will now try to find the latest working
> version of mesa. Starting from 10.1.4 for now.

Thanks, but FWIW, the most helpful thing you could do would be to bisect the problem from upstream Mesa Git.
Comment 10 Damian Nowak 2014-07-09 02:59:36 UTC
I will bisect once I have the earliest version of mesa that causes the problem.

By the way, 10.1.4 is proved working, and I'll be trying out so 10.2.x version in the next days.
Comment 11 Damian Nowak 2014-07-09 03:00:11 UTC
s/so//
Comment 12 Maciej 2014-07-29 09:25:11 UTC
It started happening no more than two weeks ago (doing Mesa updates through Oibaf PPA almost daily). Happens with Kernel 3.15.6 and 3.16-RCx (up to latest RC7) on Ubuntu 14.04 with HD7770.
Comment 13 Damian Nowak 2014-07-29 10:01:21 UTC
@Maciej, Please analyze dpkg logs and tell what Mesa version started to behave incorrectly for you.

@Michel, haven't been able to try 10.2.* since I have been very busy recently and needed a non-hanging machine, hence used 10.1.4 for time being. When I'm less busy I will go back to the case and try out next versions.
Comment 14 Maciej 2014-07-29 14:40:00 UTC
@Damian Nowak 

Before I've seen your answer I did full, fresh system reinstallation (cause hangs started happening after few minutes). Ubuntu was running fine for few hours, so I added Oibaf PPA and kernel 3.16-RC7 - so far no hang. I'll report if it happens again and add some logs.
Comment 15 Michel Dänzer 2014-07-30 09:08:48 UTC
(In reply to comment #4)
> This has recently returned.
[...]
> Using Arch Linux 3.15.3.

Note that there are known stability issues in 3.15.y kernels which are fixed in 3.16-rc7 or later.
Comment 16 Damian Nowak 2014-07-30 11:10:16 UTC
Thanks Michel. I'll have that in mind. Once I have some time, I will just upgrade mesa to the latest version and downgrade the kernel and see if it helps.
Comment 17 Maciej 2014-07-30 11:21:14 UTC
Ok, it was fine during around 7 or 8 hours of usage (mostly browsing in Chrome, some gaming, a tv show with smplayer and vdpau). However today it started happening again.

To clarify, my system is (fresh installation made yesterday):

Ubuntu 14.04 64bit
Kernel 3.16-RC7 from mainline repo
Mesa git from Oibaf PPA
HD7770 1GB

Switching to kernel 3.15.7 doesn't help.
Comment 18 Maciej 2014-07-31 18:59:05 UTC
This is very odd... I tried using Firefox instead of Chrome for a day and there was no hang, then I switched to Chrome and five minutes later I had to hard reboot. Shit happens with stable, beta and unstable channels (no custom tweaks to Chrome, all default).
Comment 19 paul.woegerer 2014-08-27 13:00:39 UTC
@Damian, I had the same problems. My configuration is similar to yours: Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).

After consulting https://wiki.archlinux.org/index.php/ATI it tried adding radeon.dpm=0 to my kernel command line as a workaround. It solved the problem. No more lockups. The system is now stable with the above configuration. Clearly this has something to do with the powermanagement changes in MESA and the kernel (there were no lockups with earlier versions of MESA)
Comment 20 Alex Deucher 2014-08-27 13:33:50 UTC
(In reply to comment #19)
> @Damian, I had the same problems. My configuration is similar to yours:
> Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).
> 
> After consulting https://wiki.archlinux.org/index.php/ATI it tried adding
> radeon.dpm=0 to my kernel command line as a workaround. It solved the
> problem. No more lockups. The system is now stable with the above
> configuration. Clearly this has something to do with the powermanagement
> changes in MESA and the kernel (there were no lockups with earlier versions
> of MESA)

There is no power management code in mesa.  Power management is completely self-contained in the kernel.  If you are not getting lockups with dpm enabled and an older version of mesa, it may be a mesa issue.
Comment 21 paul.woegerer 2014-08-30 18:45:45 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > @Damian, I had the same problems. My configuration is similar to yours:
> > Radeon 7870, Linux 3.16.1, xorg 1.16, MESA 10.3 RC1 (fedora f21 branch).
> > 
> > After consulting https://wiki.archlinux.org/index.php/ATI it tried adding
> > radeon.dpm=0 to my kernel command line as a workaround. It solved the
> > problem. No more lockups. The system is now stable with the above
> > configuration. Clearly this has something to do with the powermanagement
> > changes in MESA and the kernel (there were no lockups with earlier versions
> > of MESA)
> 
> There is no power management code in mesa.  Power management is completely
> self-contained in the kernel.  If you are not getting lockups with dpm
> enabled and an older version of mesa, it may be a mesa issue.

Hmm, sounds like newer versions of MESA trigger a dormant power management issue in the radeon kernel driver (with certain Pitcairn hardware) ...just speculating.

It's a while ago thus I can't tell the exact version of kernel and MESA when I first observed the behavior. All I can say for sure is that now since using radeon.dpm=0 I did not experience a single lockup anymore (which is good enough for me).

Is there a way to build the radeon kernel driver with some additional debugging/logging that could help you to understand what's going on ?
Comment 22 Maximilian Böhm 2014-08-30 22:25:22 UTC
For me it's actually not stable with 3.16 and 3.17 RC2. With or without radeon.dpm=0.
If I set "echo dynpm > /sys/class/drm/card0/device/power_method" (as root) it lasts a little longer but after a couple of hours it does crash again.

Radeon 7770 HD, Arch Linux.
General question: Don't you guys think this is the same bug as https://bugs.freedesktop.org/show_bug.cgi?id=79980?
Comment 23 Damian Nowak 2014-08-30 22:27:27 UTC
I only know one thing: mesa 10.1.4 is the last unaffected version. Therefore, I still use it on my Arch Linux
Comment 24 Maximilian Böhm 2014-08-31 13:45:05 UTC
@Damian Nowak: I would gladly use Mesa 10.1.4 but with this version I don't get any OpenGL.

Installed:
ati-dri-10.1.4-1
lib32-mesa-10.1.4-1
lib32-mesa-libgl-10.1.4-1
mesa-10.1.4-1
mesa-demos-8.1.0-2
mesa-libgl-10.1.4-1
(all from 06-08 via Arch Rollback Machine, the last day with 10.1.4*.)
+ recent X-Server 1.16 etc.

I see this bug report as the same as the linked above "Random radeonsi crashes". The crashes are in many shapes, in the last time my screen just freezes but the mouse pointer still moves... Can't jump into a virtual console nor kill X. I think I will install Catalyst again, it's really sad. Catalyst doesn't even support Firefox' OMTC GPU accelerated scrolling so all scrolling is stuttering on my 2560x1440 monitor (if I set it to FHD the stuttering is acceptable but of course I want full resolution...).
Comment 25 Damian Nowak 2014-08-31 14:24:28 UTC
Max, you need to downgrade llvm too. Just install llvm from the same date as your mesa.
Comment 26 Aaron B 2014-09-01 01:43:26 UTC
This should be a duplicate of my bug here:

https://bugs.freedesktop.org/show_bug.cgi?id=81644

SHort story: THe blank screen crash (and rare recover) usually happens for me is triggered by Chromium, VLC, and OpenGL games opened when Chromium or VLC are opened. Also rarely on game start ups in general it seems. I believe it's a bug in Mesa, no clue where, but the apps that trigger it all should be pushing data to the GPU, so I'd imagine it's a fault in that pipe/allocation.
Comment 27 Damian Nowak 2014-09-01 01:56:32 UTC
Although this issue is older than #81644, the latter contains way more information. Closing this one.

*** This bug has been marked as a duplicate of bug 81644 ***
Comment 28 Damian Nowak 2014-10-03 23:33:23 UTC
Because #81644 turns out be a catch-all bug, I'm reopening this one.
Comment 29 Marek Olšák 2015-08-02 10:58:46 UTC
Is this issue still happening on current Mesa git?
Comment 30 Damian Nowak 2015-08-03 04:06:46 UTC
Well, I don't know! I just updated my rock-solid set of packages that has always been super stable. Let's see if I encounter this problem again.

lib32-llvm: (3.4.1-1 => 3.6.2-1)
lib32-llvm-libs: (3.4.1-1 => 3.6.2-1)
lib32-mesa: (10.1.4-1 => 10.6.3-1)
lib32-mesa-libgl: (10.1.4-1 => 10.6.3-1)
llvm: (3.4.1-2 => 3.6.2-2)
llvm-libs: (3.4.1-2 => 3.6.2-2)
mesa: (10.1.4-1 => 10.6.3-1)
mesa-libgl: (10.1.4-1 => 10.6.3-1)
xf86-input-keyboard: (1.8.0-3 => 1.8.1-1)
xf86-input-vmmouse: (13.0.0-5 => 13.1.0-1)
xf86-input-void: (1.4.0-7 => 1.4.1-1)
xf86-video-ark: (0.7.5-5 => 0.7.5-6)
xf86-video-ati: (1:7.4.0-3 => 1:7.5.0-2)
xf86-video-dummy: (0.3.7-3 => 0.3.7-4)
xf86-video-fbdev: (0.4.4-3 => 0.4.4-4)
xf86-video-glint: (1.2.8-5 => 1.2.8-6)
xf86-video-i128: (1.3.6-5 => 1.3.6-6)
xf86-video-mach64: (6.9.4-4 => 6.9.5-1)
xf86-video-neomagic: (1.2.8-3 => 1.2.9-1)
xf86-video-nv: (2.1.20-5 => 2.1.20-6)
xf86-video-openchrome: (0.3.3-4 => 0.3.3-5)
xf86-video-r128: (6.9.2-3 => 6.10.0-1)
xf86-video-savage: (2.3.7-3 => 2.3.8-1)
xf86-video-siliconmotion: (1.7.7-5 => 1.7.8-1)
xf86-video-sis: (0.10.7-6 => 0.10.7-7)
xf86-video-tdfx: (1.4.5-5 => 1.4.5-6)
xf86-video-trident: (1.3.6-6 => 1.3.7-1)
xf86-video-vesa: (2.3.2-5 => 2.3.4-1)
xf86-video-vmware: (13.1.0-1 => 13.1.0-2)
xf86-video-voodoo: (1.2.5-5 => 1.2.5-6)
xorg-server: (1.16.4-1 => 1.17.2-4)
xorg-server-common: (1.16.4-1 => 1.17.2-4)
xorg-server-devel: (1.16.4-1 => 1.17.2-4)
Comment 31 Damian Nowak 2015-08-05 04:50:05 UTC
Well, so far so good! Haven't had any lock-up today. I'll keep testing. If nothing happens by the end of the week I'll close the ticket. Thanks.
Comment 32 Michel Dänzer 2016-01-29 07:21:37 UTC
Assuming fixed per comment 31, otherwise please reopen.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.