Bug 45329 - [RV730]vline setup changes in xf86-video-ati-6.14.3 causes GPU lockups
Summary: [RV730]vline setup changes in xf86-video-ati-6.14.3 causes GPU lockups
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: xf86-video-ati maintainers
QA Contact: Xorg Project Team
URL: http://cgit.freedesktop.org/xorg/driv...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-01-28 02:10 UTC by Torsten Kaiser
Modified: 2012-02-02 12:15 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg from 3.2.0 (63.97 KB, text/plain)
2012-01-28 07:59 UTC, Torsten Kaiser
no flags Details
current xorg log (47.52 KB, text/plain)
2012-01-28 08:06 UTC, Torsten Kaiser
no flags Details
Some vline wait debugging output (865 bytes, patch)
2012-01-30 08:21 UTC, Michel Dänzer
no flags Details | Splinter Review
Correctly initialize DESKTOP_HEIGHT register (916 bytes, patch)
2012-01-31 03:23 UTC, Michel Dänzer
no flags Details | Splinter Review
vline coordinate space fixes (2.90 KB, patch)
2012-01-31 03:25 UTC, Michel Dänzer
no flags Details | Splinter Review

Description Torsten Kaiser 2012-01-28 02:10:30 UTC
A different bug in kernel 3.3-rc1 causes GPU lockups to permanently disable the X server, so I noticed that since Nov last year my system was logging GPU lockups in its syslog. Until 3.3-rc1 it always recovered, so I did not notice.

The first lockup coincidenced with upgradig from 6.14.2 to 6.14.3, so I used this span of changes to search for the cause.

With the commit "r5xx+: Fix vline setup with crtc offsets" (linked in URL field) reverted from 6.14.3 I no longer see any lockups.

Kernel: first seen with 3.1, still happend with 3.3-rc1
Xserver: 1.11.1 ... 1.11.3

XRandR says:
Screen 0: minimum 320 x 200, current 2304 x 1024, maximum 8192 x 8192
DVI-1 connected 1280x1024+0+0 (normal left inverted right x axis y axis) 338mm x 270mm
   1280x1024      60.0*+   75.0  
   1280x960       60.0  
   1152x864       75.0  
   1024x768       75.1     70.1     60.0  
   832x624        74.6  
   800x600        72.2     75.0     60.3     56.2  
   640x480        72.8     75.0     66.7     60.0  
   720x400        70.1  
   640x400        70.0  
DVI-0 connected 1024x768+1280+80 (normal left inverted right x axis y axis) 307mm x 230mm
   1024x768       60.0*+
   800x600        60.3  
   640x480        60.0  
DIN disconnected (normal left inverted right x axis y axis)

example lockup:
Jan 27 21:41:01 thoregon kernel: [275709.590135] radeon 0000:07:00.0: GPU lockup CP stall for more than 10000msec
Jan 27 21:41:01 thoregon kernel: [275709.590143] GPU lockup (waiting for 0x0071970B last fence id 0x0071970A)
Jan 27 21:41:01 thoregon kernel: [275709.606356] radeon 0000:07:00.0: GPU softreset 
Jan 27 21:41:01 thoregon kernel: [275709.606362] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0xA0003028
Jan 27 21:41:01 thoregon kernel: [275709.606368] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
Jan 27 21:41:01 thoregon kernel: [275709.606374] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Jan 27 21:41:01 thoregon kernel: [275709.606385] radeon 0000:07:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Jan 27 21:41:01 thoregon kernel: [275709.621394] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Jan 27 21:41:01 thoregon kernel: [275709.637404] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0x00003028
Jan 27 21:41:01 thoregon kernel: [275709.637410] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
Jan 27 21:41:01 thoregon kernel: [275709.637415] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Jan 27 21:41:01 thoregon kernel: [275709.638421] radeon 0000:07:00.0: GPU reset succeed
Jan 27 21:41:01 thoregon kernel: [275709.643301] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
Jan 27 21:41:01 thoregon kernel: [275709.643336] radeon 0000:07:00.0: WB enabled
Jan 27 21:41:01 thoregon kernel: [275709.689612] [drm] ring test succeeded in 1 usecs
Jan 27 21:41:01 thoregon kernel: [275709.689625] [drm] ib test succeeded in 1 usecs
Comment 1 Alex Deucher 2012-01-28 06:49:00 UTC
Please attach your xorg log and dmesg output.
Comment 2 Torsten Kaiser 2012-01-28 07:59:59 UTC
Created attachment 56258 [details]
dmesg from 3.2.0

dmesg from 3.2.0
As noted I have seen these lockups with any kernel that I used beneath xf86-video-ati-6.14.3 (3.1.0 to 3.3-rc1)
Comment 3 Torsten Kaiser 2012-01-28 08:06:07 UTC
Created attachment 56259 [details]
current xorg log

my current xorg log with the patched 6.14.3 (with 3b9fdc807dd7e52af0576299cefba596040f6f2f reverted)

lockups happens with server 1.11.1 (installed when I upgraded to 6.1.4.2), 1.11.2 and 1.11.3 (currently installed, as visible in the attached log)

I do not have old Xorg logs from older kernels...
Comment 4 Torsten Kaiser 2012-01-28 08:08:32 UTC
> ...upgraded to 6.1.4.2

That should have been "...upgraded to 6.14.3". Sorry for the spam.
Comment 5 Michel Dänzer 2012-01-30 08:21:13 UTC
Created attachment 56332 [details] [review]
Some vline wait debugging output

Can you run the driver with this patch and attach a snapshot of Xorg.0.log from when a lockup occurred? Beware that this might generate a lot of output.
Comment 6 Torsten Kaiser 2012-01-30 11:28:15 UTC
I tried your debugging patch, at the point of the lockup the xorg log was aroung 550k.
But the relevant part is rather short.

syslog for comparison of the timestamps of the lockups:
[ 1508.670067] radeon 0000:07:00.0: GPU lockup CP stall for more than 10020msec
[ 1508.670070] GPU lockup (waiting for 0x0000316B last fence id 0x0000316A)
[ 1508.686127] radeon 0000:07:00.0: GPU softreset
[ 1508.686129] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0xA0003028
[ 1508.686131] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
[ 1508.686133] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[ 1508.686142] radeon 0000:07:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
[ 1508.701160] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
[ 1508.717146] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0x00003028
[ 1508.717148] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
[ 1508.717150] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[ 1508.718149] radeon 0000:07:00.0: GPU reset succeed
[ 1508.720959] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[ 1508.720977] radeon 0000:07:00.0: WB enabled
[ 1508.767090] [drm] ring test succeeded in 1 usecs
[ 1508.767097] [drm] ib test succeeded in 1 usecs
[ 1520.080077] radeon 0000:07:00.0: GPU lockup CP stall for more than 10040msec
[ 1520.080079] GPU lockup (waiting for 0x0000317E last fence id 0x0000317D)
[ 1520.096133] radeon 0000:07:00.0: GPU softreset
[ 1520.096135] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0xA0003028
[ 1520.096137] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
[ 1520.096139] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[ 1520.096147] radeon 0000:07:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
[ 1520.111165] radeon 0000:07:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
[ 1520.127151] radeon 0000:07:00.0:   R_008010_GRBM_STATUS=0x00003028
[ 1520.127152] radeon 0000:07:00.0:   R_008014_GRBM_STATUS2=0x00000002
[ 1520.127154] radeon 0000:07:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[ 1520.128153] radeon 0000:07:00.0: GPU reset succeed
[ 1520.130943] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
[ 1520.130960] radeon 0000:07:00.0: WB enabled
[ 1520.177065] [drm] ring test succeeded in 1 usecs
[ 1520.177071] [drm] ib test succeeded in 1 usecs

The corresponding part of Xorg.0.log:
[   888.770] (start, stop) before clamping: (997, 1024)
[   888.771] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[   888.771] (start, stop) after translation: (997, 1024)
[   888.854] (start, stop) before clamping: (997, 1024)
[   888.854] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[   888.854] (start, stop) after translation: (997, 1024)
[   889.920] (start, stop) before clamping: (793, 807)
[   889.920] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768
[  1191.324] (start, stop) before clamping: (997, 1024)
[  1191.324] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  1191.325] (start, stop) after translation: (997, 1024)
[  1491.730] (start, stop) before clamping: (80, 848)
[  1491.731] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768
[  1491.731] (start, stop) after translation: (160, 848)
[  1491.731] (start, stop) before clamping: (997, 1024)
[  1491.731] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  1491.731] (start, stop) after translation: (997, 1024)
[  1509.311] (start, stop) before clamping: (80, 848)
[  1509.311] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768
[  1509.311] (start, stop) after translation: (160, 848)
[  1509.311] (start, stop) before clamping: (997, 1024)
[  1509.311] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  1509.311] (start, stop) after translation: (997, 1024)
[  1520.608] (start, stop) before clamping: (793, 807)
[  1520.609] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768
[  1520.609] (start, stop) before clamping: (997, 1024)
[  1520.609] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  1520.609] (start, stop) after translation: (997, 1024)
[  2154.377] (start, stop) before clamping: (793, 807)
[  2154.377] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768
[  2154.377] (start, stop) before clamping: (997, 1024)
[  2154.377] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  2154.377] (start, stop) after translation: (997, 1024)
[  2154.393] (start, stop) before clamping: (80, 848)
[  2154.393] (start, stop) after clamping: (80, 768), crtc->mode.VDisplay=768
[  2154.393] (start, stop) after translation: (160, 848)
[  2154.393] (start, stop) before clamping: (997, 1024)
[  2154.393] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  2154.393] (start, stop) after translation: (997, 1024)
[  2154.826] (start, stop) before clamping: (793, 807)
[  2154.826] (start, stop) after clamping: (793, 768), crtc->mode.VDisplay=768
[  2155.326] (start, stop) before clamping: (997, 1024)
[  2155.326] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  2155.326] (start, stop) after translation: (997, 1024)
[  2157.325] (start, stop) before clamping: (997, 1024)
[  2157.325] (start, stop) after clamping: (997, 1024), crtc->mode.VDisplay=1024
[  2157.325] (start, stop) after translation: (997, 1024)

(I kept the complete log and can attach it, if you need it...)
Comment 7 Michel Dänzer 2012-01-31 03:23:32 UTC
Created attachment 56365 [details] [review]
Correctly initialize DESKTOP_HEIGHT register

AFAICT this is actually a kernel bug, which was merely exposed by the (also not quite correct) X driver change. This patch should fix it.
Comment 8 Michel Dänzer 2012-01-31 03:25:39 UTC
Created attachment 56367 [details] [review]
vline coordinate space fixes

If the kernel patch doesn't fix the lockups, please attach the debugging output from this X driver patch.
Comment 9 Torsten Kaiser 2012-01-31 10:55:48 UTC
That patch (attachment 56365 [details] [review]) seems to work.
The patched xf86-video-ati (6.14.3 + your first debugging patch) did not cause any GPU lockups with kernel 3.3-rc1 + Jerome's mutex fix for the GPU reset path + your DESKTOP_HEIGHT patch.

While I do have a sure way to see if the bug has been triggered (GPU lockup messages in the syslog), I do not have a reliable way to seen if the trigger was there but without a lockup. I'm rather sure that it is fixed, but lacking a reliable way to trigger it, it is impossible to confirm this 100%. I will keep trying to trigger this and will definitely let you know, if that lockup returns.


But I think I'm starting to lose the overview about how many interconnected bugs are in the play.

Bug 1) Kernel causes a GPU lockup on invalid requests -> I think that one should be fixed by attachment 56365 [details] [review].
Bug 2) Regression(?) in 6.14.2 to 6.14.3 that cause the driver to send these invalid request. Is attachment 56367 [details] [review] supposed to fix this? Do you want me to try that patch against vanilla 3.2 to see if there are still GPU lockups logged?
Bug 3) Regression in 3.2 to 3.3-rc1 with the recursive mutex lock on GPU reset -> That one has been fixed in mainline with Jerome's patch
Bug 4) Regression in 3.2 to 3.3-rc1 that the X server fails to recover from an GPU lockup even with Bug 3 fixed -> Do you think that its worth trying to find/fix this, even if with Bug 1 fixed these lockups can't be triggered anymore?
Comment 10 Michel Dänzer 2012-01-31 11:31:41 UTC
(In reply to comment #9)
> That patch (attachment 56365 [details] [review] [review]) seems to work.

Great, I'll submit the fix.


> Bug 1) Kernel causes a GPU lockup on invalid requests -> I think that one
> should be fixed by attachment 56365 [details] [review] [review].

Not (only) on invalid requests, but even potentially on valid requests, if the vertical CRTC scanout start is larger than the vertical blank period.


> Bug 2) Regression(?) in 6.14.2 to 6.14.3 that cause the driver to send these
> invalid request. Is attachment 56367 [details] [review] [review] supposed to fix this? 

It does fix a bug in 6.14.3, but that doesn't result in invalid requests to the kernel but merely in vline ranges which start too far towards the bottom.

> Do you want me to try that patch against vanilla 3.2 to see if there are
> still GPU lockups logged?

The X driver fix can't avoid the kernel bug. I'll submit it to the xorg-driver-ati mailing list for review, but it's not directly relevant for this bug, I just attached it for the sake of completeness and potential debugging information.


I think 4) means 3) isn't completely fixed yet.
Comment 11 Torsten Kaiser 2012-01-31 11:57:19 UTC
(In reply to comment #10)
> Great, I'll submit the fix.

Thanks for your fix. :-)
And your explanations.

> I think 4) means 3) isn't completely fixed yet.

I suspect 3) and 4) really are different bugs, because the lockup on the mutex was so obvious and is after Jerome's patch obviously is no longer happening.

But I think 3+4 are better handled at https://bugzilla.kernel.org/show_bug.cgi?id=42678 ,so I will bother the kernel people there and keep that bug open even after your patch for 1) in the kernel.

But it should be the correct to close this bug, when your patch is in.
Comment 12 Michel Dänzer 2012-02-02 01:21:47 UTC
Dave applied the fix to the drm-fixes tree. We should get notification here when it hits mainline, we can resolve as fixed then.
Comment 13 Torsten Kaiser 2012-02-02 12:15:40 UTC
I had seen Dave's pull request and now the fix is in mainline:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=1b61925061660009f5b8047f93c5297e04541273

OK, I'm closing this as fixed. :-)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.