Bug 79696 - 10.2.x GPU stall & Xorg crash while using Geeqie
Summary: 10.2.x GPU stall & Xorg crash while using Geeqie
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-05 18:34 UTC by Marti Raudsepp
Modified: 2014-12-11 20:17 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
gdm-Xorg:0.log (110.19 KB, text/plain)
2014-06-05 18:34 UTC, Marti Raudsepp
Details
kernel.log (130.35 KB, text/plain)
2014-06-05 18:35 UTC, Marti Raudsepp
Details
gdb-backtrace2.txt (10.21 KB, text/plain)
2014-06-11 21:17 UTC, Marti Raudsepp
Details

Description Marti Raudsepp 2014-06-05 18:34:56 UTC
Created attachment 100481 [details]
gdm-Xorg:0.log

I'm using Arch Linux testing repositories with Mesa 10.2.0rc5

After a rather long session of browsing pictures using the Geeqie image viewer, the display froze up, and after some time passed, Xorg crashed and restarted. The new Xorg had GPU acceleration disabled. I had done a suspend & resume cycle after the last boot, prior to the crash.

Versions:
ati-dri 10.2.0rc5
mesa 10.2.0rc5
xorg-server 1.15.1
kernel 3.14.4
xf86-video-ati 7.3.0

In the log files:
20:10:27 - resume from suspend
20:58:37 - kernel: 0000:01:00.0: GPU lockup CP stall for more than 10000msec
20:59:04 - Xorg: The kernel rejected CS, see dmesg for more information.
20:59:04 - Xorg segfault
20:59:05 - new Xorg: RADEON(0): Direct rendering disabled

Xorg backtrace:
0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x584b08]
1: /usr/bin/Xorg (0x400000+0x1887f9) [0x5887f9]
2: /usr/lib/libpthread.so.0 (0x7fb78e9a4000+0xf4b0) [0x7fb78e9b34b0]
3: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x68b7c) [0x7fb786f06b7c]
4: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x6b401) [0x7fb786f09401]
5: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x48884) [0x7fb786ee6884]
6: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x19ac9b) [0x7fb787038c9b]
7: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x1a2dc8) [0x7fb787040dc8]
8: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x1a530c) [0x7fb78704330c]
9: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x13b8e4) [0x7fb786fd98e4]
10: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x13cc50) [0x7fb786fdac50]
11: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1e2a7) [0x7fb78b28b2a7]
12: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1e892) [0x7fb78b28b892]
13: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1f184) [0x7fb78b28c184]
14: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1fc70) [0x7fb78b28cc70]
15: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x524c) [0x7fb78b27224c]
16: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x5b74) [0x7fb78b272b74]
17: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6bb5) [0x7fb78b273bb5]
18: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6e9a) [0x7fb78b273e9a]
19: /usr/bin/Xorg (miCopyRegion+0x1ad) [0x564f7d]
20: /usr/bin/Xorg (miDoCopy+0x456) [0x565506]
21: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6edd) [0x7fb78b273edd]
22: /usr/bin/Xorg (0x400000+0x112258) [0x512258]
23: /usr/bin/Xorg (0x400000+0xced0a) [0x4ced0a]
24: /usr/bin/Xorg (0x400000+0xd02e8) [0x4d02e8]
25: /usr/bin/Xorg (0x400000+0x35c8e) [0x435c8e]
26: /usr/bin/Xorg (0x400000+0x39aaa) [0x439aaa]
27: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7fb78d611000]
28: /usr/bin/Xorg (0x400000+0x2507e) [0x42507e]
Comment 1 Marti Raudsepp 2014-06-05 18:35:38 UTC
Created attachment 100482 [details]
kernel.log
Comment 2 Michel Dänzer 2014-06-06 03:18:24 UTC
It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel.

Does this happen with particular pictures or randomly?
Comment 3 Marti Raudsepp 2014-06-06 10:57:29 UTC
> Does this happen with particular pictures or randomly?

I haven't tried to reproduce it yet, I will try to find the time for it.

> It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel.

I guess once I've managed to reproduce it, otherwise we don't know if it fixes anything.

PITCAIRN_mc2.bin will be included with the kernel 3.15 firmware package right?
Comment 4 Alex Deucher 2014-06-06 14:20:50 UTC
(In reply to comment #3)
> > Does this happen with particular pictures or randomly?
> 
> I haven't tried to reproduce it yet, I will try to find the time for it.
> 
> > It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel.
> 
> I guess once I've managed to reproduce it, otherwise we don't know if it
> fixes anything.
> 
> PITCAIRN_mc2.bin will be included with the kernel 3.15 firmware package
> right?

It's available in the linuxfirmware tree:
http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/?id=4848a1e059de5aa723ff6627b90de8a5d9632626
Comment 5 Marti Raudsepp 2014-06-06 22:37:38 UTC
(In reply to comment #2)
> Does this happen with particular pictures or randomly?

Randomly. I can reproduce this pretty reliably now by enabling slideshow in Geeqie, setting delay to 0.2s, letting it run for 15 mins or so. If you need some more information about the crash, I can provide.

> It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15
> kernel.

Indeed, after upgrading to kernel 3.15rc8 and git firmware, I can no longer reproduce the issue. It's been running without problems for 45 minutes now.
Comment 6 Marti Raudsepp 2014-06-06 22:52:06 UTC
Just to be clear, there's still a bug in Xorg segfaulting, right? Should it be able to survive GPU hangs?
Comment 7 Michel Dänzer 2014-06-09 07:38:33 UTC
(In reply to comment #6)
> Just to be clear, there's still a bug in Xorg segfaulting, right? Should it
> be able to survive GPU hangs?

If you can get a gdb backtrace with debugging symbols for /usr/lib/xorg/modules/dri/radeonsi_dri.so and /usr/lib/libglamor.so.0, we can see what we can do. But realistically, it's not always possible to recover gracefully from GPU hangs.
Comment 8 Marti Raudsepp 2014-06-11 21:17:14 UTC
Created attachment 100902 [details]
gdb-backtrace2.txt

Here's a backtrace from a crashed Xorg (including "bt" and "bt full").
Comment 9 Michel Dänzer 2014-06-12 03:19:36 UTC
(In reply to comment #0)
> After a rather long session of browsing pictures using the Geeqie image
> viewer, the display froze up, and after some time passed, Xorg crashed and
> restarted. The new Xorg had GPU acceleration disabled.

(In reply to comment #8)
> Here's a backtrace from a crashed Xorg (including "bt" and "bt full").

Hmm. I think it should be possible to avoid that crash. However, given what you said above, the X server won't be able to actually draw anything anymore. Would that really be an improvement?
Comment 10 Marti Raudsepp 2014-06-12 09:31:11 UTC
(In reply to comment #9)
> the X server won't be able to actually draw anything
> anymore. Would that really be an improvement?

I don't know, that's your call to make.

Other drivers like Intel's are capable of recovering from GPU hangs (sometimes?) by resetting the hardware or whatnot. I wasn't sure whether Radeon does or not.
Comment 11 Alex Deucher 2014-06-12 13:44:57 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > the X server won't be able to actually draw anything
> > anymore. Would that really be an improvement?
> 
> I don't know, that's your call to make.
> 
> Other drivers like Intel's are capable of recovering from GPU hangs
> (sometimes?) by resetting the hardware or whatnot. I wasn't sure whether
> Radeon does or not.

Radeon attempts to reset the hw as well.  It's not always successful.
Comment 12 Clemens Fruhwirth 2014-06-30 07:25:14 UTC
FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 -> 10.2.1. I am on Verde with VERDE_mc2.bin.

I started bisecting, but it takes me about a day to judge good vs bad, and as I have around 1000 commits in front of me, this will take about 10 days.

I am also on Arch. xorg-server=1.15.1, xf86-video-ati=7.3.0. linux=3.15.2.
Comment 13 Marek Olšák 2014-07-01 11:37:24 UTC
(In reply to comment #12)
> FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 ->
> 10.2.1. I am on Verde with VERDE_mc2.bin.
> 
> I started bisecting, but it takes me about a day to judge good vs bad, and
> as I have around 1000 commits in front of me, this will take about 10 days.
> 
> I am also on Arch. xorg-server=1.15.1, xf86-video-ati=7.3.0. linux=3.15.2.

Kernel 3.15 is indeed unstable on Cape Verde. We know where the problem is. If kernel 3.14 doesn't hang the GPU for you, it's the same problem.
Comment 14 Clemens Fruhwirth 2014-07-01 12:33:53 UTC
(In reply to comment #13)

> Kernel 3.15 is indeed unstable on Cape Verde. We know where the problem is.
> If kernel 3.14 doesn't hang the GPU for you, it's the same problem.

That doesn't correlate well with my testing:

3.14.x + 10.1.4 = stable
3.15.x + 10.1.4 = stable
3.14.x + 10.2.1 = unstable
3.15.x + 10.2.1 = unstable

I assume that it's something different :/
Comment 15 Marti Raudsepp 2014-07-01 13:55:32 UTC
(In reply to comment #12)
> FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 ->
> 10.2.1. I am on Verde with VERDE_mc2.bin.

What's the workload that causes this issue for you? Perhaps, if it's the same bug, running Geeqie slideshow can reproduce it faster.
Comment 16 Clemens Fruhwirth 2014-07-04 09:10:47 UTC
My most stable test case is unfortunately the Mathematica welcome screen. After some 5 mins of work in Chrome followed by a couple of Mathematica restarts, like 5-10, the lockup or segfault happens. Geeqie didn't do anything for me.

I haven't made any good progress with bisecting other than reconfirming that 

http://cgit.freedesktop.org/mesa/mesa/commit/?id=cb4ad1368551b64756c7b6e2007588e34739b188

fixes a segfault. This fix uncovers the lockup behavior.

Maybe apitrace is a way forward to generate something more testable.

Marek: Can you provide more information on the other known problem?
Comment 17 Marek Olšák 2014-07-04 11:16:09 UTC
(In reply to comment #16)
> My most stable test case is unfortunately the Mathematica welcome screen.
> After some 5 mins of work in Chrome followed by a couple of Mathematica
> restarts, like 5-10, the lockup or segfault happens. Geeqie didn't do
> anything for me.
> 
> I haven't made any good progress with bisecting other than reconfirming that 
> 
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=cb4ad1368551b64756c7b6e2007588e34739b188
> 
> fixes a segfault. This fix uncovers the lockup behavior.

The commit fixes code which isn't used by radeonsi. I guess the bisection went wrong.

> 
> Maybe apitrace is a way forward to generate something more testable.
> 
> Marek: Can you provide more information on the other known problem?

Kernel 3.15-rc1 and later locks up randomly. There is a memory corruption when page tables are moved.
Comment 18 Clemens Fruhwirth 2014-07-04 18:12:15 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > I haven't made any good progress with bisecting other than reconfirming that 
> > 
> > http://cgit.freedesktop.org/mesa/mesa/commit/
> > ?id=cb4ad1368551b64756c7b6e2007588e34739b188
> > 
> > fixes a segfault. This fix uncovers the lockup behavior.
> 
> The commit fixes code which isn't used by radeonsi. I guess the bisection
> went wrong.

That's sad to hear that my doubts in my test cases were not unfounded.

> > Marek: Can you provide more information on the other known problem?
> 
> Kernel 3.15-rc1 and later locks up randomly. There is a memory corruption
> when page tables are moved.

Please ping the bug when there is a fix I can try.
Comment 19 Clemens Fruhwirth 2014-08-11 19:52:42 UTC
I still see crashes linux-3.15.18 and ati-dri 10.2.5.
Comment 20 Clemens Fruhwirth 2014-08-31 19:37:45 UTC
I would appreciate an update to this bug as I am stuck on 10.1.4.
Comment 21 Michel Dänzer 2014-09-01 06:04:50 UTC
(In reply to comment #20)
> I would appreciate an update to this bug as I am stuck on 10.1.4.

I'm afraid bisecting is still the best bet. To avoid bisecting to a wrong commit, make sure to wait long enough before declaring any commit good.

Also, if you can't test some commits because of the segfault fixed by http://cgit.freedesktop.org/mesa/mesa/commit/?id=cb4ad1368551b64756c7b6e2007588e34739b188 , you can apply that manually for testing those commits.
Comment 22 Michel Dänzer 2014-09-01 06:17:31 UTC
BTW, I assume Mesa 10.1.5/6 and the current Git 10.1 branch are not affected?
Comment 23 Clemens Fruhwirth 2014-12-11 20:11:55 UTC
(In reply to Michel Dänzer from comment #22)
> BTW, I assume Mesa 10.1.5/6 and the current Git 10.1 branch are not affected?

I tried my best to bisect it down, but the bisection over last month again was unstable and took a long time to qualify a result, all in midst of other software changes.

I say screw it, upgraded everything to latest stable, and no more lockups. From my side, this bug can be closed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.