A different bug in kernel 3.3-rc1 causes GPU lockups to permanently disable the X server, so I noticed that since Nov last year my system was logging GPU lockups in its syslog. Until 3.3-rc1 it always recovered, so I did not notice. The first lockup coincidenced with upgradig from 6.14.2 to 6.14.3, so I used this span of changes to search for the cause. With the commit "r5xx+: Fix vline setup with crtc offsets" (linked in URL field) reverted from 6.14.3 I no longer see any lockups. Kernel: first seen with 3.1, still happend with 3.3-rc1 Xserver: 1.11.1 ... 1.11.3 XRandR says: Screen 0: minimum 320 x 200, current 2304 x 1024, maximum 8192 x 8192 DVI-1 connected 1280x1024+0+0 (normal left inverted right x axis y axis) 338mm x 270mm 1280x1024 60.0*+ 75.0 1280x960 60.0 1152x864 75.0 1024x768 75.1 70.1 60.0 832x624 74.6 800x600 72.2 75.0 60.3 56.2 640x480 72.8 75.0 66.7 60.0 720x400 70.1 640x400 70.0 DVI-0 connected 1024x768+1280+80 (normal left inverted right x axis y axis) 307mm x 230mm 1024x768 60.0*+ 800x600 60.3 640x480 60.0 DIN disconnected (normal left inverted right x axis y axis) example lockup: Jan 27 21:41:01 thoregon kernel: [275709.590135] radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec Jan 27 21:41:01 thoregon kernel: [275709.590143] GPU lockup (waiting for 0x0071970B last fence id 0x0071970A) Jan 27 21:41:01 thoregon kernel: [275709.606356] radeon 0000:07:00.0: GPU softreset Jan 27 21:41:01 thoregon kernel: [275709.606362] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0xA0003028 Jan 27 21:41:01 thoregon kernel: [275709.606368] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 Jan 27 21:41:01 thoregon kernel: [275709.606374] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 Jan 27 21:41:01 thoregon kernel: [275709.606385] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE Jan 27 21:41:01 thoregon kernel: [275709.621394] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 Jan 27 21:41:01 thoregon kernel: [275709.637404] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0x00003028 Jan 27 21:41:01 thoregon kernel: [275709.637410] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 Jan 27 21:41:01 thoregon kernel: [275709.637415] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 Jan 27 21:41:01 thoregon kernel: [275709.638421] radeon 0000:07:00.0: GPU reset succeed Jan 27 21:41:01 thoregon kernel: [275709.643301] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). Jan 27 21:41:01 thoregon kernel: [275709.643336] radeon 0000:07:00.0: WB enabled Jan 27 21:41:01 thoregon kernel: [275709.689612] [drm] ring test succeeded in 1 usecs Jan 27 21:41:01 thoregon kernel: [275709.689625] [drm] ib test succeeded in 1 usecs
Please attach your xorg log and dmesg output.
Created attachment 56258 [details] dmesg from 3.2.0 dmesg from 3.2.0 As noted I have seen these lockups with any kernel that I used beneath xf86-video-ati-6.14.3 (3.1.0 to 3.3-rc1)
Created attachment 56259 [details] current xorg log my current xorg log with the patched 6.14.3 (with 3b9fdc807dd7e52af0576299cefba596040f6f2f reverted) lockups happens with server 1.11.1 (installed when I upgraded to 6.1.4.2), 1.11.2 and 1.11.3 (currently installed, as visible in the attached log) I do not have old Xorg logs from older kernels...
> ...upgraded to 6.1.4.2 That should have been "...upgraded to 6.14.3". Sorry for the spam.
Created attachment 56332 [details] [review] Some vline wait debugging output Can you run the driver with this patch and attach a snapshot of Xorg.0.log from when a lockup occurred? Beware that this might generate a lot of output.
I tried your debugging patch, at the point of the lockup the xorg log was aroung 550k. But the relevant part is rather short. syslog for comparison of the timestamps of the lockups: [ 1508.670067] radeon 0000:07:00.0: GPU lockup CP stall for more than 10020msec [ 1508.670070] GPU lockup (waiting for 0x0000316B last fence id 0x0000316A) [ 1508.686127] radeon 0000:07:00.0: GPU softreset [ 1508.686129] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0xA0003028 [ 1508.686131] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 [ 1508.686133] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 [ 1508.686142] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 1508.701160] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 1508.717146] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0x00003028 [ 1508.717148] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 [ 1508.717150] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 [ 1508.718149] radeon 0000:07:00.0: GPU reset succeed [ 1508.720959] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [ 1508.720977] radeon 0000:07:00.0: WB enabled [ 1508.767090] [drm] ring test succeeded in 1 usecs [ 1508.767097] [drm] ib test succeeded in 1 usecs [ 1520.080077] radeon 0000:07:00.0: GPU lockup CP stall for more than 10040msec [ 1520.080079] GPU lockup (waiting for 0x0000317E last fence id 0x0000317D) [ 1520.096133] radeon 0000:07:00.0: GPU softreset [ 1520.096135] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0xA0003028 [ 1520.096137] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 [ 1520.096139] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 [ 1520.096147] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 1520.111165] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 1520.127151] radeon 0000:07:00.0: R_008010_GRBM_STATUS=0x00003028 [ 1520.127152] radeon 0000:07:00.0: R_008014_GRBM_STATUS2=0x00000002 [ 1520.127154] radeon 0000:07:00.0: R_000E50_SRBM_STATUS=0x200000C0 [ 1520.128153] radeon 0000:07:00.0: GPU reset succeed [ 1520.130943] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [ 1520.130960] radeon 0000:07:00.0: WB enabled [ 1520.177065] [drm] ring test succeeded in 1 usecs [ 1520.177071] [drm] ib test succeeded in 1 usecs The corresponding part of Xorg.0.log: [ 888.770] (start, stop) before clamping: (997, 1024) [ 888.771] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 888.771] (start, stop) after translation: (997, 1024) [ 888.854] (start, stop) before clamping: (997, 1024) [ 888.854] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 888.854] (start, stop) after translation: (997, 1024) [ 889.920] (start, stop) before clamping: (793, 807) [ 889.920] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768 [ 1191.324] (start, stop) before clamping: (997, 1024) [ 1191.324] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 1191.325] (start, stop) after translation: (997, 1024) [ 1491.730] (start, stop) before clamping: (80, 848) [ 1491.731] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768 [ 1491.731] (start, stop) after translation: (160, 848) [ 1491.731] (start, stop) before clamping: (997, 1024) [ 1491.731] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 1491.731] (start, stop) after translation: (997, 1024) [ 1509.311] (start, stop) before clamping: (80, 848) [ 1509.311] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768 [ 1509.311] (start, stop) after translation: (160, 848) [ 1509.311] (start, stop) before clamping: (997, 1024) [ 1509.311] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 1509.311] (start, stop) after translation: (997, 1024) [ 1520.608] (start, stop) before clamping: (793, 807) [ 1520.609] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768 [ 1520.609] (start, stop) before clamping: (997, 1024) [ 1520.609] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 1520.609] (start, stop) after translation: (997, 1024) [ 2154.377] (start, stop) before clamping: (793, 807) [ 2154.377] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768 [ 2154.377] (start, stop) before clamping: (997, 1024) [ 2154.377] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 2154.377] (start, stop) after translation: (997, 1024) [ 2154.393] (start, stop) before clamping: (80, 848) [ 2154.393] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768 [ 2154.393] (start, stop) after translation: (160, 848) [ 2154.393] (start, stop) before clamping: (997, 1024) [ 2154.393] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 2154.393] (start, stop) after translation: (997, 1024) [ 2154.826] (start, stop) before clamping: (793, 807) [ 2154.826] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768 [ 2155.326] (start, stop) before clamping: (997, 1024) [ 2155.326] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 2155.326] (start, stop) after translation: (997, 1024) [ 2157.325] (start, stop) before clamping: (997, 1024) [ 2157.325] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024 [ 2157.325] (start, stop) after translation: (997, 1024) (I kept the complete log and can attach it, if you need it...)
Created attachment 56365 [details] [review] Correctly initialize DESKTOP_HEIGHT register AFAICT this is actually a kernel bug, which was merely exposed by the (also not quite correct) X driver change. This patch should fix it.
Created attachment 56367 [details] [review] vline coordinate space fixes If the kernel patch doesn't fix the lockups, please attach the debugging output from this X driver patch.
That patch (attachment 56365 [details] [review]) seems to work. The patched xf86-video-ati (6.14.3 + your first debugging patch) did not cause any GPU lockups with kernel 3.3-rc1 + Jerome's mutex fix for the GPU reset path + your DESKTOP_HEIGHT patch. While I do have a sure way to see if the bug has been triggered (GPU lockup messages in the syslog), I do not have a reliable way to seen if the trigger was there but without a lockup. I'm rather sure that it is fixed, but lacking a reliable way to trigger it, it is impossible to confirm this 100%. I will keep trying to trigger this and will definitely let you know, if that lockup returns. But I think I'm starting to lose the overview about how many interconnected bugs are in the play. Bug 1) Kernel causes a GPU lockup on invalid requests -> I think that one should be fixed by attachment 56365 [details] [review]. Bug 2) Regression(?) in 6.14.2 to 6.14.3 that cause the driver to send these invalid request. Is attachment 56367 [details] [review] supposed to fix this? Do you want me to try that patch against vanilla 3.2 to see if there are still GPU lockups logged? Bug 3) Regression in 3.2 to 3.3-rc1 with the recursive mutex lock on GPU reset -> That one has been fixed in mainline with Jerome's patch Bug 4) Regression in 3.2 to 3.3-rc1 that the X server fails to recover from an GPU lockup even with Bug 3 fixed -> Do you think that its worth trying to find/fix this, even if with Bug 1 fixed these lockups can't be triggered anymore?
(In reply to comment #9) > That patch (attachment 56365 [details] [review] [review]) seems to work. Great, I'll submit the fix. > Bug 1) Kernel causes a GPU lockup on invalid requests -> I think that one > should be fixed by attachment 56365 [details] [review] [review]. Not (only) on invalid requests, but even potentially on valid requests, if the vertical CRTC scanout start is larger than the vertical blank period. > Bug 2) Regression(?) in 6.14.2 to 6.14.3 that cause the driver to send these > invalid request. Is attachment 56367 [details] [review] [review] supposed to fix this? It does fix a bug in 6.14.3, but that doesn't result in invalid requests to the kernel but merely in vline ranges which start too far towards the bottom. > Do you want me to try that patch against vanilla 3.2 to see if there are > still GPU lockups logged? The X driver fix can't avoid the kernel bug. I'll submit it to the xorg-driver-ati mailing list for review, but it's not directly relevant for this bug, I just attached it for the sake of completeness and potential debugging information. I think 4) means 3) isn't completely fixed yet.
(In reply to comment #10) > Great, I'll submit the fix. Thanks for your fix. :-) And your explanations. > I think 4) means 3) isn't completely fixed yet. I suspect 3) and 4) really are different bugs, because the lockup on the mutex was so obvious and is after Jerome's patch obviously is no longer happening. But I think 3+4 are better handled at https://bugzilla.kernel.org/show_bug.cgi?id=42678 ,so I will bother the kernel people there and keep that bug open even after your patch for 1) in the kernel. But it should be the correct to close this bug, when your patch is in.
Dave applied the fix to the drm-fixes tree. We should get notification here when it hits mainline, we can resolve as fixed then.
I had seen Dave's pull request and now the fix is in mainline: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=1b61925061660009f5b8047f93c5297e04541273 OK, I'm closing this as fixed. :-)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.