Bug 15203 - r300 lockup
Summary: r300 lockup
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/other (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-25 13:11 UTC by Markus Amsler
Modified: 2008-05-18 18:18 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Test case (1.11 KB, text/plain)
2008-03-25 13:11 UTC, Markus Amsler
no flags Details
drm log where system didn't freeze. (104.28 KB, text/plain)
2008-03-25 13:17 UTC, Markus Amsler
no flags Details
drm log with RADEON_FIFO_DEBUG (313.40 KB, text/plain)
2008-03-26 06:17 UTC, Markus Amsler
no flags Details
mesa patch (2.60 KB, patch)
2008-03-26 17:37 UTC, Markus Amsler
no flags Details | Splinter Review
r300: Correctly translate the value for the R300_CMD_WAIT command. (2.28 KB, patch)
2008-03-29 10:53 UTC, Oliver McFadden
no flags Details | Splinter Review
Emitting wait (2.67 KB, text/plain)
2008-03-29 15:32 UTC, Markus Amsler
no flags Details
Added masks for CP_RB_CNTL (1.01 KB, patch)
2008-03-31 20:31 UTC, Oliver McFadden
no flags Details | Splinter Review
Wait forever fix (2.31 KB, patch)
2008-04-02 03:34 UTC, Markus Amsler
no flags Details | Splinter Review
Handle error in BEGIN_RING macro (8.10 KB, patch)
2008-04-02 06:01 UTC, Markus Amsler
no flags Details | Splinter Review

Description Markus Amsler 2008-03-25 13:11:50 UTC
Created attachment 15455 [details]
Test case

The attached test locks up my RV370 [Radeon X300SE] within 1 second or less.
It locks up in 3 possible ways:
  - machine up, xorg 100%, hw cursor works
  - machine up, xorg 100%, no hw cursor
  - system freeze
If I export RADEON_DEBUG=all the lockup is gone.
Attached is also a drm log, where the system didn't freeze.
It only locks up if the tris variable in the test case is over 300.
I run everything from git (xorg, drm, mesa, xf86-video-ati)
Kernel: stock debian unstable 2.6.24-1-686

Compile test with
  gcc -std=c99 texsub.c -lglut -o texsub

I tried to fix it myself without any luck.
Comment 1 Markus Amsler 2008-03-25 13:17:05 UTC
Created attachment 15456 [details]
drm log where system didn't freeze.
Comment 2 Oliver McFadden 2008-03-26 04:51:08 UTC
I just tested this on my development box (r300-vertprog-branch, ATI Technologies
Inc RV410 [Radeon X700 Pro (PCIE)) and I can reproduce the lockup.

The first time executed fine, but the second time hard locked. I looked at your
debug log and it looks like it dies somewhere during processing of an indirect
buffer, then just sits in a loop waiting for the CP to idle. Probably that
indirect buffer does something it shouldn't...

Mar 25 15:16:53 debian kernel: [drm:drm_unlocked_ioctl] pid=3090, cmd=0xc010644d, nr=0x4d, dev 0xe200, auth=1
Mar 25 15:16:53 debian kernel: [drm:radeon_cp_indirect] idx=11 s=0 e=112 d=0
Mar 25 15:16:53 debian kernel: [drm:radeon_cp_dispatch_indirect] buf=11 s=0x0 e=0x70
Mar 25 15:16:53 debian kernel: [drm:drm_unlocked_ioctl] pid=3090, cmd=0x6444, nr=0x44, dev 0xe200, auth=1
Mar 25 15:16:53 debian kernel: [drm:radeon_cp_idle]
Mar 25 15:16:53 debian kernel: [drm:radeon_do_cp_idle]
Mar 25 15:16:53 debian kernel: [drm:drm_unlocked_ioctl] ret = -16
Mar 25 15:16:53 debian kernel: [drm:drm_unlocked_ioctl] pid=3090, cmd=0x6444, nr=0x44, dev 0xe200, auth=1
...

Defining RADEON_FIFO_DEBUG to 1 at the top of radeon_cp.c will give some more
information, though not a lot; still useful though.
Comment 3 Markus Amsler 2008-03-26 06:17:14 UTC
Created attachment 15472 [details]
drm log with RADEON_FIFO_DEBUG
Comment 4 Markus Amsler 2008-03-26 17:37:02 UTC
Created attachment 15493 [details] [review]
mesa patch

This patch makes the lockup gone. I have absolutely no clue why. I just knew RADEON_DEBUG=sync fixes the lockup. So I peppered the source with radeonWaitForIdle until I found this hack
Comment 5 Jerome Glisse 2008-03-27 07:39:36 UTC
(In reply to comment #4)
> Created an attachment (id=15493) [details]
> mesa patch
> 
> This patch makes the lockup gone. I have absolutely no clue why. I just knew
> RADEON_DEBUG=sync fixes the lockup. So I peppered the source with
> radeonWaitForIdle until I found this hack
> 

This means that this lockup is due to sync problem. My wild guess is that we write a register that shouldn't be written before a flush (like changing vertex program
while there is still fragment being rendered with old vertex program). I think we could apply the patch you point to as temporary fix.
Comment 6 Markus Amsler 2008-03-27 10:52:37 UTC
Its has a 30% performance impact on Wow, but only 1% on glxgears. It also doesn't fix the Wow lockup I was after in the first place. It's also possible that it doesn't fix the bug, but only changes the timing, so the bug gets not triggered. I'm gonna investigate this further.
Comment 7 Michel Dänzer 2008-03-27 11:19:26 UTC
Synchronously waiting for idle on each flush is a pretty large hammer... does just emitting wait for idle commands to the CP instead help as well?
Comment 8 Oliver McFadden 2008-03-27 22:23:03 UTC
I would rather *NOT* see this patch committed to Mesa, as it really just hides
the issue: we're doing something wrong, which should be fixed.

This must be a sync problem because more aggressive flushing can solve the
problem, though it's not the correct solution.

A discussion on IRC between MrCooper and glisse mentioned the NQWAIT_UNTIL
(0xe50) register. I've seen the blob write this register in Revenge dumps, but
it's never used by our drivers.

I wouldn't mind getting some comments from AMD describing the precisely correct
way to flush and synchronize the ASIC. We're obviously missing something. I
guess tcore might provide this information once it's released...
Comment 9 Markus Amsler 2008-03-28 13:56:19 UTC
Emitting wait doesn't help.
Comment 10 Michel Dänzer 2008-03-28 15:11:24 UTC
(In reply to comment #9)
> Emitting wait doesn't help.

What exactly did you try?
Comment 11 Oliver McFadden 2008-03-29 10:53:25 UTC
Created attachment 15563 [details] [review]
r300: Correctly translate the value for the R300_CMD_WAIT command.

This patch fixes the lockup caused by the test case on this bug. I've explained
it in the commit log of the patch, but I'd like to get a couple of reviews
before pushing.

Markus, could you please test this patch and report your results? Thanks.
Comment 12 Markus Amsler 2008-03-29 15:32:57 UTC
Created attachment 15568 [details]
Emitting wait

(In reply to comment #10)
> (In reply to comment #9)
> > Emitting wait doesn't help.
> 
> What exactly did you try?
> 
That.

Lockups are still present here with git drm.

drm/r300: fix wait interface mixup
r300: Correctly translate the value for the R300_CMD_WAIT command.

Had no noticeable effects here :(
Comment 13 Oliver McFadden 2008-03-30 09:32:23 UTC
Are you sure you're using the latest DRM from Git? I've already merged the
R300_CMD_WAIT patch, so you don't need to worry about manually patching anymore;
just compile and install DRM from Git.

This allowed me to run your "texsub" test-lockup program several times, and even
for several minutes without any lockups whatsoever. I even tried with glxgears
running in the background, too.
Comment 14 Markus Amsler 2008-03-30 15:55:11 UTC
Yeah, I'm sure, checked twice. Lockup always within 2-3 seconds. 
Currently my favorite theory is that it has to do with the command buffer running out of space. That would explain the 2-3 seconds (to fill the command buffer) and why it only occurs on certain systems (fast CPU, slow GPU should be more likely to freeze).
Comment 15 Michel Dänzer 2008-03-31 04:04:58 UTC
(In reply to comment #14)
> Currently my favorite theory is that it has to do with the command buffer
> running out of space.

If you mean the ring buffer, I think that's unlikely for various reasons, but it should be easy to verify either way by enabling/adding appropriate debugging output in the DRM.
Comment 16 Markus Amsler 2008-03-31 18:53:10 UTC
Yeah, I meant the ring buffer. My theory wasn't that bad after all :)

It's definitely a regression introduced with

commit 31815730732a5d2a446aa316a5b4d837766762e6
Author: Oliver McFadden <z3ro.geek@gmail.com>
Date:   Fri Jun 8 19:40:57 2007 +0000

    r300: Added the CP maximum fetch size and ring rptr update variables.

Comment 17 Oliver McFadden 2008-03-31 20:30:22 UTC
Sorry about that, the patch was made before we had the documentation. You can
try the patch that I will attach which adds correct masks, but this might not
fix it. It's possible we're not calculating the values correctly, but I need to
have a closer look at the documentation for that.
Comment 18 Oliver McFadden 2008-03-31 20:31:27 UTC
Created attachment 15593 [details] [review]
Added masks for CP_RB_CNTL
Comment 19 Oliver McFadden 2008-03-31 21:04:01 UTC
Btw, I just did another test with the test program from this bug report, and
I've found something interesting.

I'm using a GARTSize of 64 and I've reverted my CP_RB_CNTL commit to the DRM. I
can run the demo perfectly fine with "tris" set to 10 or 100, but when I set it
to 1000 (the default) the demo will only run for a few seconds before it locks
up with (what looks like) a flash of a random red colored triangle.

Anyway, I have no problem reverting my CP_RB_CNTL commit since it seems to break
some things, but even with this commit reverted, there still seems to be some
bug...
Comment 20 Oliver McFadden 2008-03-31 21:11:52 UTC
Another thing I just noticed: on fglrx this demo appears to render a single
solid triangle (eg, there isn't any depth fighting) but on R300 DRI there is a
lot of depth fighting. So clearly we're not disabling the Z block correctly...
Comment 21 Markus Amsler 2008-04-01 01:15:59 UTC
The mask patch didn't help
With 318157.. reverted texsub runs fine with tris=1000. I notice the pattern in the triangles too.
I looked at RADEON_CP_RB_CNTL
With 318157..    : 0x00040911
Without 318157.. : 0x00000011

I searched the documentation for RADEON_CP_RB_CNTL but didn't found it. In fact I didn't found any ring buffer related registers. I'm also wondering where the shift/mask values come from.

Comment 22 Roland Scheidegger 2008-04-01 05:23:51 UTC
I fail to see what's wrong with this commit. Clearly we shouldn't update the ring pointer for every command read (which just causes almost as much data traffic for writing the ring pointer as for reading commands at least in some cases), and increasing fetch size should be fine too - I'm sure fglrx does that too. Maybe you could try doing only one or the other to see what's problematic. I suspect though the reason this commit breaks things is somewhat more complicated (e.g. the small delays introduced by writing ring ptr could fix things).
RADEON_CP_RB_CNTL is documented in older docs - I guess it's not in the new docs because it's not a 3d reg. There is a section on ring buffer management in the r5xx docs (which is likely exactly the same as for r3xx), though without the register defines it seems. Masks should be unnecessary (but certainly don't hurt) since the values calculated don't exceed the maximum anyway.
Comment 23 Markus Amsler 2008-04-01 07:49:50 UTC
You're right, it's the ring pointer update size. Values less than or equal 8 work fine, greater or equal 9 causes crashes. Perhaps increasing it further would cause crashes on other systems too.
The fetch size has no effect on crashes.
Comment 24 Markus Amsler 2008-04-01 09:42:10 UTC
Disabling writeback with no_wb=1 also makes the lockup disappear. And no, it's not an april fools joke :).
Resolving this should be easy now.
Comment 25 Markus Amsler 2008-04-01 12:27:49 UTC
I know now whats happening. If the ring buffer gets low on space radeon_wait_ring is called. If writeback is enabled radeon_wait_ring can run into the -EBUSY timeout, because it's possible that during the timeout the CP won't update the ring pointer (few slow gpu commands in the ring buffer). Now the -EBUSY timeout isn't handled at all, so the ring buffer gets overwritten, with undefined result.
I'm still thinking of a clever way to fix this.
Comment 26 Roland Scheidegger 2008-04-01 13:08:32 UTC
(In reply to comment #25)
> I know now whats happening. If the ring buffer gets low on space
> radeon_wait_ring is called. If writeback is enabled radeon_wait_ring can run
> into the -EBUSY timeout, because it's possible that during the timeout the CP
> won't update the ring pointer (few slow gpu commands in the ring buffer). Now
> the -EBUSY timeout isn't handled at all, so the ring buffer gets overwritten,
> with undefined result.
> I'm still thinking of a clever way to fix this.
Ahhh timeouts... This is actually tricky. Now, just ignoring the error and overwriting the ring clearly is a no-no, maybe should just repeat until there's no error (of course, not quite indefinitely as the chip may have locked up). Or could just increase the timeout but that's ugly too.
Note that you could get the same error case even if you update the ring ptr for every quadword read, since a single command can run for a very, very long time (certainly with indirect buffers). Or if you'd want to submit a large chunk at once and the ring ptr simply didn't advance enough until it hits the timeout. (That's about what I had in mind when I mentioned it might just hide a real bug...)
Comment 27 Oliver McFadden 2008-04-01 22:19:10 UTC
I would rather we just disable the writeback than adding some work around (increase the timeout, etc); we've got enough problems that I'd rather not add more work arounds.
Comment 28 Michel Dänzer 2008-04-02 01:48:07 UTC
(In reply to comment #27)
> I would rather we just disable the writeback than adding some work around[...]

Disabling writeback is just a workaround...
Comment 29 Michel Dänzer 2008-04-02 02:14:48 UTC
I think the best intermediate solution would be to increase the timeout. In the long run, the error should be propagated and handled properly everywhere. The lockup detection should also be improved to watch for progress of the ring and indirect buffers though to avoid false positives (and resetting the CP or whatever, which usually just makes things worse).
Comment 30 Markus Amsler 2008-04-02 03:34:50 UTC
Created attachment 15627 [details] [review]
Wait forever fix

Currently I have this patch in my tree. It's no intended for inclusion, just a hack so I can play Wow :)
I'm a bit surprised that the error handling is in such a bad shape, at least in the radeon code. Looks like a good janitorial (like me) task. In most places you just have to check for an error and pass it up.
Comment 31 Markus Amsler 2008-04-02 06:01:37 UTC
Created attachment 15628 [details] [review]
Handle error in BEGIN_RING macro

Is this an acceptable patch? Return in the BEGIN_RING macro is a bit ugly, but simple. Instead of lockups I now get

drmRadeonCmdBuffer: -16

in the shell and 

[drm:r300_emit_vpu] *ERROR* radeon_wait_ring failed in r300_emit_vpu:387
[drm:r300_do_cp_cmdbuf] *ERROR* r300_emit_vpu failed

in dmesg. It works although in many places the error is not propagated. That's because in a subsequent BEGIN_RING the error gets probably propagated.
Comment 32 Markus Amsler 2008-04-03 06:28:18 UTC
Just did some extensive testing with timeout increased by factor 10. No crashes whatsoever. It looks like all Wow related lockups on my machine are because of this bug. I also think that many bug reports about r300 lockups are dupes of this one (easy to check set no_wb=1). 
Could someone with commit right in xf86-video-ati increase the default timeout from 10000 to 100000?
Comment 33 Michel Dänzer 2008-04-07 09:27:22 UTC
(In reply to comment #32)
> Could someone with commit right in xf86-video-ati increase the default timeout
> from 10000 to 100000?

Done.


On the DRM side, I now think the best next step would be to fix the timeout loops to never spuriously time out while the GPU is actually making progress on the ring buffer or indirect buffer.
Comment 34 Markus Amsler 2008-04-07 15:38:50 UTC
Thanks. Lockup is gone.

(In reply to comment #33)
> On the DRM side, I now think the best next step would be to fix the timeout
> loops to never spuriously time out while the GPU is actually making progress on
> the ring buffer or indirect buffer.

Or improve the error handling.
Comment 35 Michel Dänzer 2008-04-08 01:03:34 UTC
(In reply to comment #34)
> Or improve the error handling.

That's tricky to get right though, and can cause different problems. E.g. if something gets an error return code and then tries to reset the CP while it's actually still chugging along happily, that can itself cause a lockup or other problems. But there's no reason to return an error in the first place.
Comment 36 Arren Lex 2008-05-18 18:17:47 UTC
Interestingly enough, I don't experience a lockup when running this test case on an X300SE, either before or after this patch (commit 0a96173cc38e506728d4c3f2dd383ba56e856578). However, the patch for this bug  causes the mouse cursor to lag and jump like crazy, whereas it was smooth before (even though the rest of my system became significantly less responsive). Does this matter?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.