Bug 54129 - [bisected] Kernel 3.5.0 breaks KMS on Radeon RV250
[bisected] Kernel 3.5.0 breaks KMS on Radeon RV250
Status: RESOLVED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Radeon
unspecified
x86 (IA32) Linux (All)
: medium major
Assigned To: Christian König
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-27 19:38 UTC by Andrea
Modified: 2012-10-03 10:50 UTC (History)
1 user (show)

See Also:


Attachments
Screenshot of bad rendering (80.34 KB, image/jpeg)
2012-08-27 19:38 UTC, Andrea
no flags Details
Other example of bad rendering (57.74 KB, image/jpeg)
2012-08-27 19:38 UTC, Andrea
no flags Details
Possible fix (1.48 KB, patch)
2012-09-09 09:47 UTC, Christian König
no flags Details | Splinter Review
dmesg with debug patch, with xorg conf setting NoAccel=TRUE (65.53 KB, text/plain)
2012-09-11 21:30 UTC, Simon Kitching
no flags Details
dmesg with debug patch, with "text" appended to kernel commandline (64.01 KB, text/plain)
2012-09-11 21:31 UTC, Simon Kitching
no flags Details
dmesg output with debug patch. normal login into KDE (81.30 KB, text/plain)
2012-09-11 21:37 UTC, Andrea
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrea 2012-08-27 19:38:00 UTC
Created attachment 66187 [details]
Screenshot of bad rendering

I run Fedora 17 and since they have shipped a kernel 3.5.X I have a lot of artefacts when I log in KDE.

Kernel 3.4.6 works ok.

My hardware in a Thinkpad laptop with a 

01:00.0 VGA compatible controller: ATI Technologies Inc Radeon RV250 [Mobility FireGL 9000] (rev 02)

and I load the R200 microcode.

Basically, as soon as I log on KDE I have a lot of rectangular areas which are left black.
They are 100% reproducible always with the same pattern (at least in the few seconds before I logoff again), and they move around when I click or windows are displayed.

If I pass the option radeon.modeset=0 to the kernel (in grub) there are no artefacts, but of course XV support is not there so this is not really an option as video player struggle a lot.

I managed to bisect the issue to the following commits:

bad ========= 3b7a2b2 drm/radeon: rework fence handling, drop fence list v7
skip ======== bb63556 drm/radeon: convert fence to uint64_t v4
good ======== d6999bc drm/radeon: replace the per ring mutex with a global one

"skip" here means that the kernel does not boot: after the linux penguin logo is displayed on the top left of the screen, nothing else happens, even though I am able to reboot pressing Ctrl-Alt-Del.

So there are 2 commits that could be responsible.

Please, let me know if there is anything I can provide on top of that.
Comment 1 Andrea 2012-08-27 19:38:34 UTC
Created attachment 66188 [details]
Other example of bad rendering
Comment 2 Jerome Glisse 2012-08-27 20:19:04 UTC
Does kernel patch :

http://people.freedesktop.org/~glisse/0001-drm-radeon-extra-type-safe-for-fence-emission.patch

Helps ?
Comment 3 Jerome Glisse 2012-08-27 20:26:33 UTC
Also can you test if booting with radeon.no_wb=1 fix the issue ?
Comment 4 Andrea 2012-08-28 20:26:30 UTC
(In reply to comment #3)
> Also can you test if booting with radeon.no_wb=1 fix the issue ?

this did not make any difference
(tested on v3.6-rc3 where the problems still exists)
Comment 5 Andrea 2012-08-28 21:06:03 UTC
(In reply to comment #2)
> Does kernel patch :
> 
> http://people.freedesktop.org/~glisse/0001-drm-radeon-extra-type-safe-for-fence-emission.patch
> 
> Helps ?

no difference.
Comment 6 Simon Kitching 2012-09-03 15:17:38 UTC
I also had graphics corruption with a Radeon Mobility X1600 (RV350), and bisected to exactly the same two patches:
  bb63556 -- hangs on start of Plymouth
  3b7a2b2 -- plymouth works again, but graphics corrupted.

The symptoms were somewhat different than described here: I got shimmery 70s paisley patterns rather than "black rectangles".


The problem continues up to 3.5.3.
However 3.6.0-rc4 fixes the issue - graphics appear to work fine again.
Comment 7 Alex Deucher 2012-09-03 15:22:43 UTC
(In reply to comment #6)
> The problem continues up to 3.5.3.
> However 3.6.0-rc4 fixes the issue - graphics appear to work fine again.

Can you bisect to see what commit fixed the issue?
Comment 8 Andrea 2012-09-03 20:19:45 UTC
(In reply to comment #6)
> I also had graphics corruption with a Radeon Mobility X1600 (RV350), and
> bisected to exactly the same two patches:
>   bb63556 -- hangs on start of Plymouth
>   3b7a2b2 -- plymouth works again, but graphics corrupted.
> 
> The symptoms were somewhat different than described here: I got shimmery 70s
> paisley patterns rather than "black rectangles".
> 
> 
> The problem continues up to 3.5.3.
> However 3.6.0-rc4 fixes the issue - graphics appear to work fine again.

Not here.

I've just tried 3.6-rc4 and I get the same corruptions.
Comment 9 Simon Kitching 2012-09-08 06:48:22 UTC
Sorry, have to take back that comment about 3.6-rc4+ working; I'm now getting the "black screen" problem consistently. I was definitely running the right kernel ("uname -a" was reporting 3.6-rc4+) and can only think that I accidentally fat-fingered the keyboard and selected the grub "recovery" option (ie with "nomodeset").

In short: 3.6-rc4+ just boots to a totally black screen for me, due to something merged in the 3.6-rc series. I've bisected this, and raised a separate bug (54662) for it. Interestingly, that commit is *also* about "radeon fence" handling.

I presume that this bug (54129) is still also present and lurking underneath the black screen - but obviously I can't test that.
Comment 10 Alex Deucher 2012-09-08 15:39:11 UTC
Does X load ok if you disable acceleration:

Option "NoAccel" "TRUE"

in the device section of your xorg.conf?
Comment 11 Christian König 2012-09-09 09:47:55 UTC
Created attachment 66876 [details] [review]
Possible fix

Please try the attached patch.

Also please supply the output of "sudo cat /sys/kernel/debug/dri/0/radeon_fence_info" with and without this patch.

Thx,
Christian.
Comment 12 Simon Kitching 2012-09-09 13:33:05 UTC
> Please try the attached patch.
> 
> Also please supply the output of "sudo cat
> /sys/kernel/debug/dri/0/radeon_fence_info" with and without this patch.
> 

Ok, good news - the patch resolves both this bug and #54662.

* radeon_fence_info output from standard ubuntu kernel (3.2.0-29):

Last signaled fence 0x000037E7

* radeon_fence_info output from version 876dc9f3^ (ie last version showing the "corrupted graphics" output, before I hit the patch that just makes the screen go black):

--- ring 0 ---
Last signaled fence 0x00000001000000cd
Last emitted  0x00000000000000cd


* radeon_fence_info output from current head version (3.6.0-rc5+) with patch "make 64bit fences more robust" applied:

--- ring 0 ---
Last signaled fence 0x0000000100000cea
Last emitted        0x0000000100000cea

Note that the patch does not apply to 3.5.3, nor 876dc9f3^ so I didn't test it against anything but current master head.


Alex: sorry, but Ubuntu doesn't usually have an xorg.conf file anymore AFAIK. I tried to generate one with "sudo Xorg -configure" but that just reported "Fatal server error: Server is already active", and I'm not sure how else to generate a base xorg.conf file to then modify.
Comment 13 Christian König 2012-09-09 17:44:03 UTC
(In reply to comment #12)
> Ok, good news - the patch resolves both this bug and #54662.

Having a workaround for the problem doesn't explain why the heck the counter is going backwards!

Either I'm missing something important in the algorithm or gcc is strangely shuffling the code around, maybe we should add a read memory barrier in radeon_fence_read.

Just in case: You're not working on a time machine or encountered a temporal anomaly recently?

Christian.
Comment 14 Simon Kitching 2012-09-09 18:01:25 UTC
> 
> Having a workaround for the problem doesn't explain why the heck the counter is
> going backwards!
> 
> Just in case: You're not working on a time machine or encountered a temporal
> anomaly recently?

No temporal problems here - breakfast, lunch, dinner still occurring in the regular order :-).

Isn't the problem simply that the top 32 bits of the emitted counter are being discarded on this 32-bit machine, ie that "signalled" is 64-bits, but pre-patch "emitted" was having its upper 32-bits cleared?
Comment 15 Jerome Glisse 2012-09-10 17:10:54 UTC
http://people.freedesktop.org/~glisse/0001-debug-fence-emission-reception.patch

Could you please boot with attached patch and without the fixing patch. Just boot in runlevel 3 so you only have plymouth. Then save dmesg, dmesg > fencedebug.txt and attach it to the bug. It will help to understand what's going on.
Comment 16 Jerome Glisse 2012-09-10 17:30:59 UTC
http://people.freedesktop.org/~glisse/0001-debug-fence-emission-reception.patch

Could you please boot with attached patch and without the fixing patch. Just boot in runlevel 3 so you only have plymouth. Then save dmesg, dmesg > fencedebug.txt and attach it to the bug. It will help to understand what's going on.
Comment 17 Tormod Volden 2012-09-10 20:31:40 UTC
Simon, BTW, you can make a minimal /etc/X11/xorg.conf like this:

Section "Device"
	Identifier "my-radeon-card"
	Option "NoAccel" "TRUE"
EndSection
Comment 18 Simon Kitching 2012-09-11 21:27:48 UTC
Ok, results of testing the 0001-debug-fence patch are as follows.

Kernel version built is master branch of linus' tree, commit 55d512e2 (3.6.0-rc5) plus *only* the debug patch.

== test 1
Booting with 
* /usr/share/X11/xorg.conf.d/50-mydevice.conf setting NoAccel to TRUE
* grub kernel commandline of "root=UUID=... ro quiet splash $vt_handoff"
resulted in a working graphical system and the full dmesg output is attached as file "dmesg-debug-noaccel.txt". However the important bits appear to be:

[    2.578335] [drm] rfence(R) 0x7557effd
[    2.597213] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000010000000 and cpu addr 0xffca8000
[    2.682994] [drm] efence 0x00000001
[    2.976116] [drm] rfence(M) 0x00000001
[   16.799294] [drm] efence 0x00000002

Note that Ubuntu runlevel 2 by defaults boots to graphics mode, and levels 3..5 are identical to level 2. See: http://www.debianadmin.com/debian-and-ubuntu-linux-run-levels.html

Note also that there were no further "[drm]" messages in dmesg even after using the system for a few minutes. 

== test 2
Booting *without* the xorg.conf.d/50-mydevice.conf file (ie *without* overriding NoAccel) and with the above kernel commandline resulted in a black screen.

== test 3
Booting *without* the xorg.conf.d/50-mydevice.conf file (ie *without* overriding NoAccel) and with "3" appended to the kernel commandline also resulted in a black screen.

== test 4
Booting with "text" appended to the kernel commandline resulted in plymouth completing and then switching to a working text-mode system. The dmesg output is attached as file "dmesg-debug-text.txt". The important bits are similar to the "noaccel" case.

== test 5
Booting with "nomodeset" resulted in a working graphics system. The dmesg output had no "drm" entries, and did not have any of the added "fence" debug output.

I hope this sheds some light - and thanks for looking into this issue!
Comment 19 Simon Kitching 2012-09-11 21:30:08 UTC
Created attachment 67002 [details]
dmesg with debug patch, with xorg conf setting NoAccel=TRUE
Comment 20 Simon Kitching 2012-09-11 21:31:25 UTC
Created attachment 67003 [details]
dmesg with debug patch, with "text" appended to kernel commandline
Comment 21 Andrea 2012-09-11 21:37:25 UTC
Created attachment 67005 [details]
dmesg output with debug patch. normal login into KDE

(In reply to comment #16)
> http://people.freedesktop.org/~glisse/0001-debug-fence-emission-reception.patch
> 
> Could you please boot with attached patch and without the fixing patch. Just
> boot in runlevel 3 so you only have plymouth. Then save dmesg, dmesg >
> fencedebug.txt and attach it to the bug. It will help to understand what's
> going on.

Here is my dmseg.
Generated in Fedora 17 after logging into KDE and noticing the usual black artefacts (no runlevel 3).

The patch has been applied to v3.6-rc5 without the other patch at comment 11.
BTW, if I apply the patch at comment 11, everything seems to work properly.
Comment 22 Simon Kitching 2012-09-15 21:04:34 UTC
A patch has been merged into Linus' tree for 3.6-rc5+:

commit f492c171a38d77fc13a8998a0721f2da50835224
Author: Christian König <deathsimple@vodafone.de>
Date:   Thu Sep 13 10:33:47 2012 +0200

    drm/radeon: make 64bit fences more robust v3
...    
    The intention of this patch is to make fences as robust as
    they where before introducing 64bit fences. This is
    necessary because on older systems it looks like the fence
    value gets corrupted on initialization.
    
    Fixes:
    https://bugs.freedesktop.org/show_bug.cgi?id=51344
    
    Should also fix:
    https://bugs.freedesktop.org/show_bug.cgi?id=54129
    https://bugs.freedesktop.org/show_bug.cgi?id=54662

It does indeed seem to resolve this bug 54129 (and 54662) for me. System boots fine (without messing with NoAccel=TRUE or nomodeset). Suspend/resume also work fine. And dmesg output looks fine.

Thanks Christian/Jerome!
Comment 23 Andrea 2012-09-17 19:14:11 UTC
So far so good.

Thank you guys.
Comment 24 Christian König 2012-10-03 10:50:34 UTC
Any objections to closing this bug now?