Created attachment 100481 [details] gdm-Xorg:0.log I'm using Arch Linux testing repositories with Mesa 10.2.0rc5 After a rather long session of browsing pictures using the Geeqie image viewer, the display froze up, and after some time passed, Xorg crashed and restarted. The new Xorg had GPU acceleration disabled. I had done a suspend & resume cycle after the last boot, prior to the crash. Versions: ati-dri 10.2.0rc5 mesa 10.2.0rc5 xorg-server 1.15.1 kernel 3.14.4 xf86-video-ati 7.3.0 In the log files: 20:10:27 - resume from suspend 20:58:37 - kernel: 0000:01:00.0: GPU lockup CP stall for more than 10000msec 20:59:04 - Xorg: The kernel rejected CS, see dmesg for more information. 20:59:04 - Xorg segfault 20:59:05 - new Xorg: RADEON(0): Direct rendering disabled Xorg backtrace: 0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x584b08] 1: /usr/bin/Xorg (0x400000+0x1887f9) [0x5887f9] 2: /usr/lib/libpthread.so.0 (0x7fb78e9a4000+0xf4b0) [0x7fb78e9b34b0] 3: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x68b7c) [0x7fb786f06b7c] 4: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x6b401) [0x7fb786f09401] 5: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x48884) [0x7fb786ee6884] 6: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x19ac9b) [0x7fb787038c9b] 7: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x1a2dc8) [0x7fb787040dc8] 8: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x1a530c) [0x7fb78704330c] 9: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x13b8e4) [0x7fb786fd98e4] 10: /usr/lib/xorg/modules/dri/radeonsi_dri.so (0x7fb786e9e000+0x13cc50) [0x7fb786fdac50] 11: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1e2a7) [0x7fb78b28b2a7] 12: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1e892) [0x7fb78b28b892] 13: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1f184) [0x7fb78b28c184] 14: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x1fc70) [0x7fb78b28cc70] 15: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x524c) [0x7fb78b27224c] 16: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x5b74) [0x7fb78b272b74] 17: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6bb5) [0x7fb78b273bb5] 18: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6e9a) [0x7fb78b273e9a] 19: /usr/bin/Xorg (miCopyRegion+0x1ad) [0x564f7d] 20: /usr/bin/Xorg (miDoCopy+0x456) [0x565506] 21: /usr/lib/libglamor.so.0 (0x7fb78b26d000+0x6edd) [0x7fb78b273edd] 22: /usr/bin/Xorg (0x400000+0x112258) [0x512258] 23: /usr/bin/Xorg (0x400000+0xced0a) [0x4ced0a] 24: /usr/bin/Xorg (0x400000+0xd02e8) [0x4d02e8] 25: /usr/bin/Xorg (0x400000+0x35c8e) [0x435c8e] 26: /usr/bin/Xorg (0x400000+0x39aaa) [0x439aaa] 27: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7fb78d611000] 28: /usr/bin/Xorg (0x400000+0x2507e) [0x42507e]
Created attachment 100482 [details] kernel.log
It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel. Does this happen with particular pictures or randomly?
> Does this happen with particular pictures or randomly? I haven't tried to reproduce it yet, I will try to find the time for it. > It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel. I guess once I've managed to reproduce it, otherwise we don't know if it fixes anything. PITCAIRN_mc2.bin will be included with the kernel 3.15 firmware package right?
(In reply to comment #3) > > Does this happen with particular pictures or randomly? > > I haven't tried to reproduce it yet, I will try to find the time for it. > > > It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 kernel. > > I guess once I've managed to reproduce it, otherwise we don't know if it > fixes anything. > > PITCAIRN_mc2.bin will be included with the kernel 3.15 firmware package > right? It's available in the linuxfirmware tree: http://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/?id=4848a1e059de5aa723ff6627b90de8a5d9632626
(In reply to comment #2) > Does this happen with particular pictures or randomly? Randomly. I can reproduce this pretty reliably now by enabling slideshow in Geeqie, setting delay to 0.2s, letting it run for 15 mins or so. If you need some more information about the crash, I can provide. > It might be worth getting the PITCAIRN_mc2.bin microcode and trying a 3.15 > kernel. Indeed, after upgrading to kernel 3.15rc8 and git firmware, I can no longer reproduce the issue. It's been running without problems for 45 minutes now.
Just to be clear, there's still a bug in Xorg segfaulting, right? Should it be able to survive GPU hangs?
(In reply to comment #6) > Just to be clear, there's still a bug in Xorg segfaulting, right? Should it > be able to survive GPU hangs? If you can get a gdb backtrace with debugging symbols for /usr/lib/xorg/modules/dri/radeonsi_dri.so and /usr/lib/libglamor.so.0, we can see what we can do. But realistically, it's not always possible to recover gracefully from GPU hangs.
Created attachment 100902 [details] gdb-backtrace2.txt Here's a backtrace from a crashed Xorg (including "bt" and "bt full").
(In reply to comment #0) > After a rather long session of browsing pictures using the Geeqie image > viewer, the display froze up, and after some time passed, Xorg crashed and > restarted. The new Xorg had GPU acceleration disabled. (In reply to comment #8) > Here's a backtrace from a crashed Xorg (including "bt" and "bt full"). Hmm. I think it should be possible to avoid that crash. However, given what you said above, the X server won't be able to actually draw anything anymore. Would that really be an improvement?
(In reply to comment #9) > the X server won't be able to actually draw anything > anymore. Would that really be an improvement? I don't know, that's your call to make. Other drivers like Intel's are capable of recovering from GPU hangs (sometimes?) by resetting the hardware or whatnot. I wasn't sure whether Radeon does or not.
(In reply to comment #10) > (In reply to comment #9) > > the X server won't be able to actually draw anything > > anymore. Would that really be an improvement? > > I don't know, that's your call to make. > > Other drivers like Intel's are capable of recovering from GPU hangs > (sometimes?) by resetting the hardware or whatnot. I wasn't sure whether > Radeon does or not. Radeon attempts to reset the hw as well. It's not always successful.
FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 -> 10.2.1. I am on Verde with VERDE_mc2.bin. I started bisecting, but it takes me about a day to judge good vs bad, and as I have around 1000 commits in front of me, this will take about 10 days. I am also on Arch. xorg-server=1.15.1, xf86-video-ati=7.3.0. linux=3.15.2.
(In reply to comment #12) > FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 -> > 10.2.1. I am on Verde with VERDE_mc2.bin. > > I started bisecting, but it takes me about a day to judge good vs bad, and > as I have around 1000 commits in front of me, this will take about 10 days. > > I am also on Arch. xorg-server=1.15.1, xf86-video-ati=7.3.0. linux=3.15.2. Kernel 3.15 is indeed unstable on Cape Verde. We know where the problem is. If kernel 3.14 doesn't hang the GPU for you, it's the same problem.
(In reply to comment #13) > Kernel 3.15 is indeed unstable on Cape Verde. We know where the problem is. > If kernel 3.14 doesn't hang the GPU for you, it's the same problem. That doesn't correlate well with my testing: 3.14.x + 10.1.4 = stable 3.15.x + 10.1.4 = stable 3.14.x + 10.2.1 = unstable 3.15.x + 10.2.1 = unstable I assume that it's something different :/
(In reply to comment #12) > FYI I see a similar GPU stalls/display blanks with an upgrade from 10.1.4 -> > 10.2.1. I am on Verde with VERDE_mc2.bin. What's the workload that causes this issue for you? Perhaps, if it's the same bug, running Geeqie slideshow can reproduce it faster.
My most stable test case is unfortunately the Mathematica welcome screen. After some 5 mins of work in Chrome followed by a couple of Mathematica restarts, like 5-10, the lockup or segfault happens. Geeqie didn't do anything for me. I haven't made any good progress with bisecting other than reconfirming that http://cgit.freedesktop.org/mesa/mesa/commit/?id=cb4ad1368551b64756c7b6e2007588e34739b188 fixes a segfault. This fix uncovers the lockup behavior. Maybe apitrace is a way forward to generate something more testable. Marek: Can you provide more information on the other known problem?
(In reply to comment #16) > My most stable test case is unfortunately the Mathematica welcome screen. > After some 5 mins of work in Chrome followed by a couple of Mathematica > restarts, like 5-10, the lockup or segfault happens. Geeqie didn't do > anything for me. > > I haven't made any good progress with bisecting other than reconfirming that > > http://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=cb4ad1368551b64756c7b6e2007588e34739b188 > > fixes a segfault. This fix uncovers the lockup behavior. The commit fixes code which isn't used by radeonsi. I guess the bisection went wrong. > > Maybe apitrace is a way forward to generate something more testable. > > Marek: Can you provide more information on the other known problem? Kernel 3.15-rc1 and later locks up randomly. There is a memory corruption when page tables are moved.
(In reply to comment #17) > (In reply to comment #16) > > I haven't made any good progress with bisecting other than reconfirming that > > > > http://cgit.freedesktop.org/mesa/mesa/commit/ > > ?id=cb4ad1368551b64756c7b6e2007588e34739b188 > > > > fixes a segfault. This fix uncovers the lockup behavior. > > The commit fixes code which isn't used by radeonsi. I guess the bisection > went wrong. That's sad to hear that my doubts in my test cases were not unfounded. > > Marek: Can you provide more information on the other known problem? > > Kernel 3.15-rc1 and later locks up randomly. There is a memory corruption > when page tables are moved. Please ping the bug when there is a fix I can try.
I still see crashes linux-3.15.18 and ati-dri 10.2.5.
I would appreciate an update to this bug as I am stuck on 10.1.4.
(In reply to comment #20) > I would appreciate an update to this bug as I am stuck on 10.1.4. I'm afraid bisecting is still the best bet. To avoid bisecting to a wrong commit, make sure to wait long enough before declaring any commit good. Also, if you can't test some commits because of the segfault fixed by http://cgit.freedesktop.org/mesa/mesa/commit/?id=cb4ad1368551b64756c7b6e2007588e34739b188 , you can apply that manually for testing those commits.
BTW, I assume Mesa 10.1.5/6 and the current Git 10.1 branch are not affected?
(In reply to Michel Dänzer from comment #22) > BTW, I assume Mesa 10.1.5/6 and the current Git 10.1 branch are not affected? I tried my best to bisect it down, but the bisection over last month again was unstable and took a long time to qualify a result, all in midst of other software changes. I say screw it, upgraded everything to latest stable, and no more lockups. From my side, this bug can be closed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.