Bug 24535 - [KMS] KDE 4.3 causes lock up on RV620 (M82)
Summary: [KMS] KDE 4.3 causes lock up on RV620 (M82)
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 24587 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-10-14 14:34 UTC by Rafał Miłecki
Modified: 2009-11-04 15:23 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
make vtx buffer 32k like non-kms (670 bytes, patch)
2009-10-19 16:14 UTC, Alex Deucher
no flags Details | Splinter Review
Dumps of ring in several lock up states (10.79 KB, application/bzip2)
2009-10-21 08:49 UTC, Rafał Miłecki
no flags Details
dmesg from clean drm-next (53.37 KB, text/plain)
2009-11-02 05:55 UTC, Rafał Miłecki
no flags Details
dmesg with glisse's patch applied (53.58 KB, text/plain)
2009-11-02 05:56 UTC, Rafał Miłecki
no flags Details
r600_ring_info from debugfs with glisse's patch (3.32 KB, text/plain)
2009-11-02 06:02 UTC, Rafał Miłecki
no flags Details
dmesg with hacked v2 of glisse's patch (53.58 KB, text/plain)
2009-11-02 06:34 UTC, Rafał Miłecki
no flags Details
dmesg with v2 of glisse's patch (54.00 KB, text/plain)
2009-11-02 08:17 UTC, Rafał Miłecki
no flags Details
dmesg with v4 of glisse's patch (54.22 KB, text/plain)
2009-11-02 09:18 UTC, Rafał Miłecki
no flags Details
Alex's patch little modified (1.31 KB, text/plain)
2009-11-02 11:53 UTC, Rafał Miłecki
no flags Details
dmesg with v5 of glisse's patch (53.84 KB, text/plain)
2009-11-02 12:03 UTC, Rafał Miłecki
no flags Details
don't RMW cp_rb_cntl (1.41 KB, patch)
2009-11-02 12:41 UTC, Alex Deucher
no flags Details | Splinter Review
patch for kernel (2.73 KB, patch)
2009-11-02 13:21 UTC, Alex Deucher
no flags Details | Splinter Review

Description Rafał Miłecki 2009-10-14 14:34:10 UTC
First I reported but #24467 but it was incorrect.

I was successfully using openSUSE 11.1 (KDE 3.5.x) with KMS. Probably about 30 KDE 3.5 sessions without any lock up. Then I switched to openSUSE 11.2 milestone 6 (KDE 4.3.0) and run *one* KMS X session, it was fine. Next switched to 11.2 milestone 8 and experienced lock up.

First I thought it was due to regressions in drm/radeon and even tried to trace it in bug #24467. But that was just matter of (un)luck. One KDE 4.3 sessions for 20 tries works fine. I even reinstalled milestone 6 and confirmed it's freezes. Also tried going back to commits tested with openSUSE 11.1 and no luck here.

So finally, it seems KDE 4.3 (both .0 and .1) do some operation that locks up my machine. This is not likely to be Mesa's r600 related, as I removed this driver.

Also I've tried starting plain X, playing movie with mplayer in it, starting "konsole" (KDE4's app), xclock and xeyes. It was working fine, no lock up. Then I started "kwin" (KDE4's window manager/decorator) and that causes lock up seconds after starting it.
Comment 1 Rafał Miłecki 2009-10-19 15:58:14 UTC
I've tried disabling Composite, DFS and UTS, but didn't help.

Then I've hacked xf86-video-ati to disable Solid (return FALSE in R600PrepareSolid). No more lock ups with this hack (checked with 3 reboots).
Comment 2 Alex Deucher 2009-10-19 16:14:45 UTC
Created attachment 30577 [details] [review]
make vtx buffer 32k like non-kms

Does this patch help at all?  I suspect there's a problem with the default state and prepare state when we run out of vtx buffer space.  This should match things up more closely with non-kms and help id the problem, but isn't a real solution.
Comment 3 Rafał Miłecki 2009-10-19 22:26:43 UTC
(In reply to comment #2)
> Created an attachment (id=30577) [details]
> make vtx buffer 32k like non-kms
> 
> Does this patch help at all?  I suspect there's a problem with the default
> state and prepare state when we run out of vtx buffer space.

It does not :/
Comment 4 Rafał Miłecki 2009-10-21 08:48:13 UTC
I rebooted once more day later and I got locks up again. Even with Solid disabled :| That's seems I got luck again when testing this in 3 reboots in row.

So turning off solid does not fix issue.
Comment 5 Rafał Miłecki 2009-10-21 08:49:43 UTC
Created attachment 30600 [details]
Dumps of ring in several lock up states

I've made some dumps of rings using debugfs (and ssh).
Comment 6 Rafał Miłecki 2009-10-21 08:55:27 UTC
The weird thing about some dumps is that ring seems to be... smaller than normally. AFAIK this never should happen. I copy head of each ring log from attached archive:

==> ring1.txt <==             
CP_STAT 0x80008241            
CP_RB_WPTR 0x00000010         
CP_RB_RPTR 0x00040790         
1920 free dwords in ring      
1920 dwords in ring           
r[1936]=0x0000002b            
r[1937]=0x80000000            
r[1938]=0x80000000            

==> ring2.txt <==
CP_STAT 0x80000241
CP_RB_WPTR 0x00000010
CP_RB_RPTR 0x000403d8
968 free dwords in ring
968 dwords in ring     
r[0984]=0x00000256     
r[0985]=0x00000011     
r[0986]=0xc0002a00     

==> ring3.txt <==
CP_STAT 0x80000241
CP_RB_WPTR 0x00000010
CP_RB_RPTR 0x000400b8
168 free dwords in ring
168 dwords in ring     
r[0184]=0x00000256     
r[0185]=0x00000011     
r[0186]=0xc0002a00     

==> ring4.txt <==
CP_STAT 0x800280c1
CP_RB_WPTR 0x00000010
CP_RB_RPTR 0x00040538
1320 free dwords in ring
1320 dwords in ring
r[1336]=0xe400000c
r[1337]=0x40240054
r[1338]=0x00000000

==> ring5.txt <==
CP_STAT 0x00000000
CP_RB_WPTR 0x00000000
CP_RB_RPTR 0x00000000
262128 free dwords in ring
0 dwords in ring
r[0000]=0xc0023200

==> ring6.txt <==
CP_STAT 0x00000000
CP_RB_WPTR 0x00000000
CP_RB_RPTR 0x00000000
262128 free dwords in ring
0 dwords in ring
r[0000]=0xc0023200

==> ring7.txt <==
CP_STAT 0x00000000
CP_RB_WPTR 0x00000000
CP_RB_RPTR 0x00000000
262128 free dwords in ring
0 dwords in ring
r[0000]=0xc0023200

==> ring8.txt <==
CP_STAT 0xffffffff
CP_RB_WPTR 0xffffffff
CP_RB_RPTR 0xffffffff
262127 free dwords in ring
0 dwords in ring
r[262143]=0x80000000

==> ring.normal.txt <==
CP_STAT 0x00000000
CP_RB_WPTR 0x00000530
CP_RB_RPTR 0x00000530
262144 free dwords in ring
0 dwords in ring
r[1328]=0x24242c23

In first logs you can see decreased size of ring and equal amount of dwords in ring and free dwords. Next weird thing.

I did not have time yet to check GRBM_STATUS register on GPU (as suggested by Alex). I was told to check it's value and search for it in "r600_reg* files in ddx and also r600_demo". Purpose is "to see what blocks are busy when it hangs".
Comment 7 Stefano Carignano 2009-10-22 04:42:21 UTC
Out of curiosity, are you using a preemptive kernel ? Do you observe a change in the behavior/survival rate when enabling preempt ? As I mentioned in bug #24587, enabling a preemptive kernel for me reduces the chance of survival to ~ 5% (I won't say 0 because I managed to have a working session with it once, out of 10-15 tries)

Also, does suspend/resume work for you ? Or you too upon resume get the same lockup after a few seconds ?

Is there something I could do to help debugging this thing a bit? 
Comment 8 Rafał Miłecki 2009-10-28 13:00:09 UTC
(In reply to comment #7)
> Out of curiosity, are you using a preemptive kernel ? Do you observe a change
> in the behavior/survival rate when enabling preempt ?

I do:
> grep "PREEMPT" .config
# CONFIG_PREEMPT_RCU is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
didn't try without PREEMPT yet. Thanks for tip.


> Also, does suspend/resume work for you ? Or you too upon resume get the same
> lockup after a few seconds ?

I didn't manage to have X running long enough to try that.


> Is there something I could do to help debugging this thing a bit? 

That's probably question to developers (Alex maybe?)... Maybe you could dump value of mentioner register while lock up, no more ideas from me.
Comment 9 Rafał Miłecki 2009-10-28 14:43:10 UTC
Before lock up (in console and in X - the same):
0x8010: 0x00003030
0x8014: 0x00000003

While lock up:
0x8010: 0xA0003030
0x8014: 0x00000003
Comment 10 Rafał Miłecki 2009-10-28 15:36:30 UTC
Huh, noticed something new, I am sure we will solve that issue soon now :)

In my case rdev->cp.ring_size is 1'048'576. That means 1'048'576/4=262'144 dwords. And... 262'144 == 0x40000!

As you can see in ring info dumps lock up often happens when CP_RB_RPTR just over 0x00040000 (like 0x00040790).

There are reasons why this may happen according to my current knowledge/understanding:
1) GPU thinks ring is bigged and keeps reading over 0x00040000 while it should not
2) Driver thinks ring is 0x00040000 while it's bigger. As the result we leave some ring space over 0x00040000 untouched (with garbages maybe) and start writing from 0x0 again. GPU sees it didn't achieve read pointer and tries to read more... from space over 0x00040000 which contains garbages
Comment 11 Rafał Miłecki 2009-10-28 17:24:20 UTC
Another interesting thing:

# cat r600_ring_info
CP_STAT 0x00000000
CP_RB_WPTR 0x00000000 0x00000010
CP_RB_RPTR 0x00000000 0x00000000
262128 free dwords in ring
16 dwords in ring
r[0000]=0xc0023200
r[0001]=0x101e1000
r[0002]=0x00000000
r[0003]=0x000002a0
r[0004]=0xc0016800
r[0005]=0x00000140
r[0006]=0x00002f10
r[0007]=0x80000000
r[0008]=0x80000000
r[0009]=0x80000000
r[0010]=0x80000000
r[0011]=0x80000000
r[0012]=0x80000000
r[0013]=0x80000000
r[0014]=0x80000000
r[0015]=0x80000000
r[0016]=0xc0016800
r[0017]=0x00000141
r[0018]=0xdeadbeef
...

show I ever see 0xdeadbeef? Isn't this some test value of writeback of sth?
Comment 12 Rafał Miłecki 2009-11-01 05:32:49 UTC
I can confirm lock up happens only when GPU reads commands about 0x00040000 position. Moreover when it do one round of reading (0x0 → 0x40000) without lock up, this won't lock up in any further reading of ~0x40000. Rendering is stable then.
Comment 13 Rafał Miłecki 2009-11-02 05:55:33 UTC
Created attachment 30913 [details]
dmesg from clean drm-next
Comment 14 Rafał Miłecki 2009-11-02 05:56:26 UTC
Created attachment 30914 [details]
dmesg with glisse's patch applied

I've applied http://people.freedesktop.org/~glisse/0001-r600-hack-to-test-ring-buffer.patch

[drm] RPTR/WPTR before schedule 0x000000FE 0x000000FE
[drm:r600_ib_test] *ERROR* radeon: fence wait failed (-16).
[drm:r600_init] *ERROR* radeon: failled testing IB (-16).
Comment 15 Rafał Miłecki 2009-11-02 06:02:40 UTC
Created attachment 30915 [details]
r600_ring_info from debugfs with glisse's patch
Comment 16 Rafał Miłecki 2009-11-02 06:34:47 UTC
Created attachment 30916 [details]
dmesg with hacked v2 of glisse's patch

Added registers dumps on "radeon: fence wait failed".
Comment 17 Rafał Miłecki 2009-11-02 08:17:38 UTC
Created attachment 30919 [details]
dmesg with v2 of glisse's patch

[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] 1530 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] 1534 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw) mask: 0x000000FF
[drm] 1282 0x08000F03 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] ring test succeeded in 1 usecs
[drm] radeon: ib pool ready.
[drm] RPTR/WPTR before schedule 0x000000FE 0x000000FE
phy0: Selected rate control algorithm 'iwl-agn-rs'
[drm:r600_ib_test] *ERROR* radeon: fence wait failed (-16).
[drm] 1788 0x08000F03 CP_RB_CNTL  (1024 ring size 256 dw)
REGISTER: CP_STAT : 0x80000645
REGISTER: CP_RB_RPTR : 0x000001B0
REGISTER: CP_RB_WPTR : 0x00000005
REGISTER: CP_RB_CNTL : 0x08000F03
[drm:r600_init] *ERROR* radeon: failled testing IB (-16).
Comment 18 Rafał Miłecki 2009-11-02 08:45:33 UTC
(In reply to comment #17)
> Created an attachment (id=30919) [details]
> dmesg with v2 of glisse's patch

I've meant v3 here, sorry.
Comment 19 Rafał Miłecki 2009-11-02 09:18:04 UTC
Created attachment 30920 [details]
dmesg with v4 of glisse's patch

[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] 1530 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] 1534 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw) mask: 0x000000FF
[drm] 1282 0x08000F03 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] ring test succeeded in 1 usecs
[drm] radeon: ib pool ready.
[drm] RPTR/WPTR before schedule 0x000000FE 0x000000FE
[drm:r600_ib_test] *ERROR* radeon: fence wait failed (-16).
[drm] 1788 0x08000F03 CP_RB_CNTL  (1024 ring size 256 dw)
REGISTER: CP_STAT : 0x80000241
REGISTER: CP_RB_RPTR : 0x000001B0
REGISTER: CP_RB_WPTR : 0x00000005
REGISTER: CP_RB_CNTL : 0x08000F03
[drm:r600_init] *ERROR* radeon: failled testing IB (-16).
Comment 20 Rafał Miłecki 2009-11-02 11:53:58 UTC
Created attachment 30923 [details]
Alex's patch little modified

I've modified http://www.botchco.com/alex/xorg/dump_rb_cntl_and_reset.diff a little. Result:

[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm]             predicted: 0x08000911
[drm]                actual: 0x08000f03
[drm] i'll write: (0x08000000 | 0x00000900 | 0x00000011)
[drm]  actual after writing: 0x08000911
[drm]    actual after reset: 0x08000911
[drm] actual after CP start: 0x08000911

So we write value to CNTL and it looks alright when we read it back. But it seems the lacking part was CP reset. With patch applied it worked in 4 boots in a row. 2 warm, 2 cold.
Comment 21 Rafał Miłecki 2009-11-02 12:03:29 UTC
Created attachment 30925 [details]
dmesg with v5 of glisse's patch

[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] 1533 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] 1537 0x00000000 CP_RB_CNTL  (1024 ring size 256 dw) mask: 0x000000FF
[drm] 1271 0x08000F03 CP_RB_CNTL  (1024 ring size 256 dw) rbbufsz 7
[drm] 1279 0x08000907 CP_RB_CNTL  (1024 ring size 256 dw)
[drm] ring test succeeded in 1 usecs
[drm] radeon: ib pool ready.
[drm] RPTR/WPTR before schedule 0x000000FE 0x000000FE
Comment 22 Alex Deucher 2009-11-02 12:41:03 UTC
Created attachment 30926 [details] [review]
don't RMW cp_rb_cntl

As pointed out by Andre and IRC, we don't RMW CP_RB_CNTL in the non-kms path.
Comment 23 Rafał Miłecki 2009-11-02 13:12:59 UTC
(In reply to comment #22)
> Created an attachment (id=30926) [details]
> don't RMW cp_rb_cntl
> 
> As pointed out by Andre and IRC, we don't RMW CP_RB_CNTL in the non-kms path.

Seems to work, tested this in three boots.
Comment 24 Rafał Miłecki 2009-11-02 13:16:57 UTC
OK, so to sum up this. Reading CP_RB_CNTL just after writing it is a bad idea. Last patch from Alex avoids that and it works fine with it applied.

I've also tried to put "mdelay(1)" after writing to CP_RB_CNTL and it worked as well.
Comment 25 Alex Deucher 2009-11-02 13:21:52 UTC
Created attachment 30930 [details] [review]
patch for kernel

The attached patch is what I'm pushing upstream.  It covers the r1xx-r5xx paths as well.
Comment 26 Rafał Miłecki 2009-11-04 15:19:13 UTC
Patch hit drm-next:

commit d6f28938d9426d12eea1578949f1d73d24ad37ec
Author: Alex Deucher <alexdeucher@gmail.com>
Date:   Mon Nov 2 16:01:27 2009 -0500

    drm/radeon/kms: Don't RMW CP_RB_CNTL
Comment 27 Rafał Miłecki 2009-11-04 15:23:18 UTC
*** Bug 24587 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.