Bug 9252

Summary: complete lockups with radeon 9600 XT
Product: DRI Reporter: Xavier Bestel <xavier.bestel>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED WORKSFORME QA Contact:
Severity: normal    
Priority: high CC: auxsvr, glisse, simone, z3ro.geek
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
syslog, 71M unzipped, with drm debug=1
none
Xorg log
none
Patch, probably not correct!
none
r300_scratch is broken.
none
r300_scratch is broken.
none
If we can't idle unlock. Lock when we should...
none
Updated patch to Rune's comment.
none
DO_USLEEP if it is supported
none
Xorg.log (without xorg.conf) none

Description Xavier Bestel 2006-12-05 05:05:58 UTC
Hi,

I'm using debian/sid, currently with Xorg 7.1, ati 6.6.3, kernel 2.6.18, libdrm
2.0.2, mesa 6.5.1. DRI is enabled, I'm using compiz on a radon 9600 XT. The big
problem is that there are random lockups - complete lockups, like not even ping
is working, nothing in the logs after reboot. Looks like the PCI bus is hosed,
but I'm not an expert.

Here is an excerpt of my config file:

Section "Device"
        Identifier      "ATI Radeon 9600XT"
        Driver          "radeon"
        VendorName      "ATI"
        Option          "accel"
        Option          "AccelMethod"           "xaa"
        Option          "RenderAccel"           "true"
        Option          "EnablePageFlip"        "true"
        Option          "ColorTiling"           "true"
        Option          "AccelDFS"              "true"
        Option          "XAANoOffscreenPixmaps"
        Option          "GARTSize"              "64"
        Option          "DDCMode"               "true"
EndSection
Comment 1 Jerome Glisse 2006-12-05 05:12:47 UTC
Could you disable page flip, render accel, attach your xorg log,
and try enabling debug output of radeon & drm module (you should
mount your log partition with sync option to have somethings
usable).
Comment 2 Xavier Bestel 2006-12-05 05:49:09 UTC
How do I enable debug output ? Soemthing like "drm debug=1" in /etc/modules ?

Thanks,
       Xav
Comment 3 Jerome Glisse 2006-12-05 05:53:35 UTC
radeon debug=1 would be more helpfull (my fault). no need to restart
you could stop xserver, rmmod radeon and modprobe radeon debug=1
don't forget to mount the partition where you log file are with sync
option.

Did you try with disabling pageflip, &/or colortilling ?
Comment 4 Xavier Bestel 2006-12-05 05:59:31 UTC
Thanks, I'll do that.
I think I already tried various combinations of various configs (BTW, AGP x8
with writecache lockups right at start), but I'll try that specifically when I
get back home.

        Xav
Comment 5 Michel Dänzer 2006-12-05 06:13:50 UTC
Guys, no need to even reload the kernel module to enable debugging output...
That's what /sys/module/drm/parameters/debug is for.
Comment 6 Xavier Bestel 2006-12-05 11:46:46 UTC
Here's a lockup log (syslog + X log). I couldn't find a debug parameter for
radeon, I used the drm debug param.
Comment 7 Xavier Bestel 2006-12-05 11:49:45 UTC
Created attachment 7967 [details]
syslog, 71M unzipped, with drm debug=1
Comment 8 Xavier Bestel 2006-12-05 11:52:11 UTC
Created attachment 7968 [details]
Xorg log
Comment 9 Xavier Bestel 2007-02-23 03:19:04 UTC
Just FYI, the lockups are still present with debian experimental's xorg 1.2

        Xav
Comment 10 Jerome Glisse 2007-02-23 05:05:23 UTC
*** Bug 8833 has been marked as a duplicate of this bug. ***
Comment 11 Jerome Glisse 2007-02-23 05:13:12 UTC
Did you remount your /var partition with sync ? I don't see
anylockup situation in the log (i might be wrong but i think
once lockup happen we shouldn't see any more cmd buffer submit).
IIRC the command to remount with sync should look like this:
mount -o remount,sync <varpartitionmountpoint>
Comment 12 Xavier Bestel 2007-02-25 09:46:04 UTC
Yes, I did. But when my machine freezes, it looks like a PCI bus freeze, so everything is stuck, and I don't think it's still able to write something to the logfiles (even if mounted -o sync).
Comment 13 Oliver McFadden 2007-02-26 20:06:10 UTC
I think that I have a similar problem; actually, I get a couple of different
lock up types, but this is one of them.

I also haven't been able to get anything to disk yet, but I'm going try with a
kernel serial console that should (hopefully) get all the DRM messages.

I'm also going to try with the R300 ring buffer debug patch, originally posted
at http://marc.theaimsgroup.com/?l=dri-devel&m=111736048825917&w=2

I modified it slightly so it applies cleanly to Git master.
http://z3ro.name/r300_ring_buffer.patch

Just some random ideas that may or may not help you.
Comment 14 Papadakos Panagiotis 2007-02-27 02:35:58 UTC
Created attachment 8875 [details] [review]
Patch, probably not correct!

I don't think it is correct, but it helps me here with my X700 to avoid some lockups with beryl.
Comment 15 Oliver McFadden 2007-02-27 03:47:16 UTC
Could you explain some of your reasoning for that patch? I'm just wondering if I should test this, but I'm not sure why you would disable that lock? 
Comment 16 Papadakos Panagiotis 2007-02-27 10:46:47 UTC
(In reply to comment #15)
> Could you explain some of your reasoning for that patch? I'm just wondering if
> I should test this, but I'm not sure why you would disable that lock? 
> 
Ignore the patch. With this the lockups in my desktop with beryl where a bit late, so that's why I posted it just to see if others problems where affected also.

P.S.
By the way I think that there is something wrong with r300_scratch in r300_cmdbuf.c in the drm module. I created a patch, for which I am not sure if it is correct, although I think it should be. I think r300_scratch was totally broken.
Comment 17 Papadakos Panagiotis 2007-02-27 10:49:05 UTC
Created attachment 8889 [details] [review]
r300_scratch is broken.
Comment 18 Papadakos Panagiotis 2007-02-27 10:53:35 UTC
Created attachment 8893 [details] [review]
r300_scratch is broken.

Correct one.
Comment 19 Oliver McFadden 2007-02-27 11:14:48 UTC
So the R300 scratch patch is related to the lockups, or another problem? 
Comment 20 Papadakos Panagiotis 2007-02-27 11:26:32 UTC
(In reply to comment #19)
> So the R300 scratch patch is related to the lockups, or another problem? 
> 
I think it fixes my lockups with beryl, (up to now of course).
Why don't you try it and post your comments?
Comment 21 Oliver McFadden 2007-02-27 11:46:24 UTC
I still get lockups with that patch, but if it solves your problem then commit it of course. 
Comment 22 Papadakos Panagiotis 2007-02-28 12:24:41 UTC
Created attachment 8909 [details] [review]
If we can't idle unlock. Lock when we should...

Well I think that this patch is correct and should probably help.
Please try it.
Comment 23 Rune Petersen 2007-02-28 12:48:05 UTC
A comment on the patch:
You only need to retake the lock if you previously released it.

I would personally prefer if you moved LOCK_HARDWARE() in after UNLOCK_HARDWARE() and DO_SLEEP().

Though on a functional level i shouldn't make a difference.
Comment 24 Papadakos Panagiotis 2007-02-28 12:55:25 UTC
Created attachment 8910 [details] [review]
Updated patch to Rune's comment.

Thanks Rune. You are correct! Updated patch.
Comment 25 Papadakos Panagiotis 2007-02-28 13:48:08 UTC
(In reply to comment #24)
> Created an attachment (id=8910) [details]
> Updated patch to Rune's comment.
> 
> Thanks Rune. You are correct! Updated patch.
> 
Unfortunately again there seems to be a problem. I get the same
lockups (trying to rotate the cube, it lockups when I stop rotating) if I restart beryl! What could be happening?
Comment 26 Papadakos Panagiotis 2007-02-28 14:16:16 UTC
Created attachment 8913 [details] [review]
DO_USLEEP if it is supported

Still lockups the second time I run beryl.
Comment 27 Michel Dänzer 2007-03-02 03:19:09 UTC
(In reply to comment #26)
> Created an attachment (id=8913) [details]
> DO_USLEEP if it is supported

What exactly is the problem addressed by this patch? While it may make sense to drop the lock while sleeping (but it's not obvious why not doing so could cause lockups, except by changing the timing maybe), I don't think making the sleep conditional on the do_usleeps configuration makes sense because it's not directly related to this situation. Its purpose is to allow sleeps when the CPU gets too far ahead of the GPU, but here it would instead greatly change the time out waiting for the GPU to go idle.

> Still lockups the second time I run beryl.

So apparently, that's not related to your patches at all? What are the symptoms? E.g., is the X server still running? If not, any hints in its log file or stderr  output? ...
Comment 28 Oliver McFadden 2007-03-07 07:17:35 UTC
Could you explain the r300_scratch patch? How was it broken?
Comment 29 Papadakos Panagiotis 2007-03-07 13:44:47 UTC
(In reply to comment #28)
> Could you explain the r300_scratch patch? How was it broken?
> 
According to  Aapo Tahkola, it is not broken. See http://archive.netbsd.se/?ml=dri-devel&a=2007-02&t=3223874.

Although I think this code could be cleaner.

P.S.
What is the u in the drm_r300_cmd_header_t union?
Comment 30 Papadakos Panagiotis 2007-03-07 13:54:26 UTC
(In reply to comment #27)

> So apparently, that's not related to your patches at all? What are the
> symptoms? E.g., is the X server still running? If not, any hints in its log
> file or stderr  output? ...
> 
I think not. This patch just changed the behaviour of the lockup. The lockup
can be reproduced by running beryl (I use the latest SVN version) with AIGLX and by rotating the cube. When you let the cube the system lockups. No hints in the logs. With the patch I sent, what was happening is that the first time I was running beryl, when I would rotate the cube for the first time, and let it ,then while the cube was getting its initial position I could see something like the cube vanishing and reappearing again, the time that it would lock up without the patch. Then if I rotated the cube again, everything was working just fine. No vanishing and reappearing. If I restarted beryl and tried to do
the same things, then my system would lock up.

P.S.
I hope you understand what I wrote. Sorry for my English.

Comment 31 Simone Lazzaris 2007-06-04 08:26:00 UTC
I seems to experience the same bug here;
arch linux, kernel 2.6.21, xorg 1.3.0, mesa 6.5.3, xf86-video-ati-6.6.192

As I start compiz, the system freeze irrimediably, not even ping nor acpi shutdown works. Nothing is stated in the logs after reboot.
This is systematic, and it's been like that since a few months. 
Before that time (whith older version of kernel, xorg, ati driver, everything) I was able to use compiz at decent speed.
Other opengl applications doesn't (apparently) hit the bug. Specifically, I use google earth without problems.

Comment 32 Michel Dänzer 2007-06-09 04:39:19 UTC
Trying to recover from the attempted hijacks and get back to the original problem... Xavier, if you can still reproduce this, does Option "BusType" "PCI" help?
Comment 33 Alex Deucher 2009-04-02 06:52:36 UTC
closing due to lack of feedback
Comment 34 Xavier Bestel 2009-04-02 08:33:07 UTC
Sorry, I continued discussion on debian #515326, at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=515326 ...

In a nutshell:

- "BusType" "PCI" gives me a "white screen of death" and that log:
(**) RADEON(0): Forced into PCI mode
(EE) RADEON(0): [pci] Out of memory (-12)
(EE) RADEON(0): [pci] PCI failed to initialize. Disabling the DRI.
(II) RADEON(0): [drm] removed 1 reserved context for kernel
(II) RADEON(0): [drm] unmapping 8192 bytes of SAREA 0xf82c6000 at 0xb79fc000
(II) RADEON(0): [drm] Closed DRM master.
(WW) RADEON(0): Direct rendering disabled

- I can ssh my machine sometimes (with 6.6 I couldn't), and an attach+backtrace on Xorg gives that:
(gdb) bt
#0  0xb7f22424 in __kernel_vsyscall ()
#1  0xb7bb0b29 in ioctl () from /lib/i686/cmov/libc.so.6
#2  0xb79c0bed in drmDMA (fd=10, request=0xbff3dcfc) at ../../libdrm/xf86drm.c:1266
#3  0xb79474c7 in RADEONCPGetBuffer (pScrn=0x9a83f88) at ../../src/radeon_accel.c:594
#4  0xb7999823 in RADEONPrepareSolidCP (pPix=0x9db03d0, alu=3, pm=4294967295, fg=0) at ../../src/radeon_exa_funcs.c:92
#5  0xb777d44a in exaFillRegionSolid (pDrawable=0x9db03d0, pRegion=0x9db2448, pixel=0, planemask=4294967295, alu=<value optimized out>) at ../../exa/exa_accel.c:1072
#6  0xb777edf2 in exaPolyFillRect (pDrawable=0x9db03d0, pGC=0x9d377d0, nrect=1, prect=0x9d4b51c) at ../../exa/exa_accel.c:751
#7  0x0817aad4 in damagePolyFillRect (pDrawable=0x9db03d0, pGC=0x9d377d0, nRects=1, pRects=0x9d4b51c) at ../../../miext/damage/damage.c:1404
#8  0x08089490 in ProcPolyFillRectangle (client=0x9d4b328) at ../../dix/dispatch.c:1769
#9  0x0808c51f in Dispatch () at ../../dix/dispatch.c:437
#10 0x080716f5 in main (argc=9, argv=0xbff3e064, envp=Cannot access memory at address 0xc0286431) at ../../dix/main.c:397


Ah, and FWIW, the deadlocks disappeared with 6.10.0 and reappeared around 6.10.99 IIRC.
Comment 35 Alex Deucher 2009-04-02 08:41:34 UTC
(In reply to comment #34)

> 
> Ah, and FWIW, the deadlocks disappeared with 6.10.0 and reappeared around
> 6.10.99 IIRC.
> 

Any chance you could bisect between those releases and find the bad commit?
Comment 36 Xavier Bestel 2009-04-02 08:45:08 UTC
I'd very much like to do that (and more), but I really don't have time for this right now. Sorry for this.
Comment 37 Xavier Bestel 2009-05-08 06:29:00 UTC
Nowadays the behavior is different: I have a segfault during startup (I think during compiz startup). I tried removing xorg.conf, same result.
Comment 38 Xavier Bestel 2009-05-08 06:30:16 UTC
Created attachment 25640 [details]
Xorg.log (without xorg.conf)
Comment 39 Maciej Cencora 2009-05-08 06:57:31 UTC
(In reply to comment #37)
> Nowadays the behavior is different: I have a segfault during startup (I think
> during compiz startup). I tried removing xorg.conf, same result.
> 

Yes, now it is 3d driver crashing. Could you disable compiz for now, and try reproducing it with some simpler 3d app?
It would be great if you could provide us with backtrace with mesa debugging symbols.
Comment 40 Michel Dänzer 2009-05-08 07:43:17 UTC
This may be fixed in mesa Git master and mesa_7_[45]_branch already.

Anyway, it's a separate issue so it should have been a new report and this one only reopened if the original problem reported here still happens.
Comment 41 Matt Turner 2010-12-02 19:47:36 UTC
Closing due to inactivity and that the reported issue seems to have changed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.