Bug 51344 - massive corruption on RV410
Summary: massive corruption on RV410
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: XOrg git
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Christian König
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-22 11:14 UTC by Tormod Volden
Modified: 2012-10-15 20:47 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (42.71 KB, text/x-log)
2012-06-22 11:14 UTC, Tormod Volden
no flags Details
dmesg output (52.20 KB, text/plain)
2012-06-22 11:22 UTC, Tormod Volden
no flags Details
screenshot (no xorg.conf options) (75.23 KB, image/png)
2012-06-22 11:50 UTC, Tormod Volden
no flags Details
backport of Christian's patch (1.59 KB, patch)
2012-09-10 20:27 UTC, Tormod Volden
no flags Details | Splinter Review
backport of Christian's v2 patch (1.43 KB, patch)
2012-09-11 18:42 UTC, Tormod Volden
no flags Details | Splinter Review
Possible fix (1.76 KB, patch)
2012-09-12 11:49 UTC, Christian König
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Tormod Volden 2012-06-22 11:14:45 UTC
Created attachment 63356 [details]
Xorg.0.log

This happened early May on drm-next somewhere between 4f256e8..d3029b4, and is still there in 3.5rc3 (and in current drm-next).

Things are smeared out vertically. Looks like desktop background is not corrupted. By turning off "EXABitmaps" there is less corruption.

I haven't done git bisecting, only download bisecting from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-next/ and v3.4-rc6-295-g4f256e8 from May 8th was good and v3.4-rc6-315-gd3029b4 from May 10th was bad. Unfortunately the build from May 9th has been deleted in the meantime so I can not narrow it down further this way. So the commits in question should be:

d3029b4 drm/radeon/kms: fix warning on 32-bit in atomic fence printing
f2e3922 drm/radeon: make the ib an inline object
f237750 drm/radeon: remove r600 blit mutex v2
68470ae drm/radeon: move the semaphore from the fence into the ib
7c0d409 drm/radeon: immediately free ttm-move semaphore
c507f7e drm/radeon: rip out the ib pool
a8c0594 drm/radeon: simplify semaphore handling v2
c3b7fe8 drm/radeon: multiple ring allocator v3
0085c950 drm/radeon: use one wait queue for all rings add fence_wait_any v2
557017a drm/radeon: define new SA interface v3
2e0d991 drm/radeon: make sa bo a stand alone object
e6661a9 drm/radeon: keep start and end offset in the SA
711a972 drm/radeon: add sub allocator debugfs file
a651c55 drm/radeon: add proper locking to the SA v3
dd8bea2 drm/radeon: use inline functions to calc sa_bo addr
8a47cc9 drm/radeon: rework locking ring emission mutex in fence deadlock detection v2
3b7a2b2 drm/radeon: rework fence handling, drop fence list v7
bb63556 drm/radeon: convert fence to uint64_t v4
d6999bc drm/radeon: replace the per ring mutex with a global one
133f4cb drm/radeon: fix possible lack of synchronization btw ttm and other ring

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI Radeon Mobility X700 (PCIE) [1002:5653]
Comment 1 Tormod Volden 2012-06-22 11:22:39 UTC
Created attachment 63357 [details]
dmesg output
Comment 2 Tormod Volden 2012-06-22 11:50:37 UTC
Created attachment 63359 [details]
screenshot (no xorg.conf options)
Comment 3 Tom Stellard 2012-06-24 10:57:40 UTC
Can you try to bisect this using git bisect and find the first bad commit?
Comment 4 Tormod Volden 2012-06-27 12:33:37 UTC
Sorry, I don't know when I can have time to do that. I'll try harder if the bug can be confirmed by other people too. Maybe the right developer can make an educated guess if it's limited to this card.
Comment 5 Andrea 2012-08-27 20:00:41 UTC
Hi guys,

can this be related to 

https://bugs.freedesktop.org/show_bug.cgi?id=54129

?

I ended up in the same area of the git log.
Comment 6 Jerome Glisse 2012-08-27 20:26:28 UTC
Also can you test if booting with radeon.no_wb=1 fix the issue ?
Comment 7 Tormod Volden 2012-08-28 07:22:14 UTC
Thanks, will test this later. BTW I already tried http://people.freedesktop.org/~glisse/0001-drm-radeon-extra-type-safe-for-fence-emission.patch which came up on the dri-devel list, but that did not fix it.
Comment 8 Tormod Volden 2012-08-28 18:29:07 UTC
No, booting with radeon.no_wb=1 didn't help.
Comment 9 Tormod Volden 2012-09-10 20:27:54 UTC
Created attachment 66942 [details] [review]
backport of Christian's patch

I tried backporting Christian's patch from https://bugs.freedesktop.org/show_bug.cgi?id=54129#c11 but it did not help either. I suppose the following /sys/kernel/debug/dri/0/radeon_fence_info output indicates that the patch took effect, since the emitted numbers are above 0x100000000LL?

--- ring 0 ---
Last signaled fence 0x000000020000149f
Last emitted  0x0000000100001a9a

--- ring 0 ---
Last signaled fence 0x000000020000149f
Last emitted  0x0000000100002041

--- ring 0 ---
Last signaled fence 0x000000020000149f
Last emitted  0x000000010000294a
Comment 10 Tormod Volden 2012-09-11 18:42:32 UTC
Created attachment 66986 [details] [review]
backport of Christian's v2 patch

I tried backporting the v2 patch from http://lists.freedesktop.org/archives/dri-devel/2012-September/027608.html to kernel 3.5.2, see attached, but it did not help either. Maybe my card has another issue?

Output from /sys/kernel/debug/dri/0/radeon_fence_info

--- ring 0 ---
Last signaled fence 0x00000000deadbeef
Last emitted  0x0000000000000670

--- ring 0 ---
Last signaled fence 0x00000000deadbeef
Last emitted  0x0000000000000c44
Comment 11 Christian König 2012-09-12 09:55:57 UTC
(In reply to comment #10)
> Output from /sys/kernel/debug/dri/0/radeon_fence_info
> 
> --- ring 0 ---
> Last signaled fence 0x00000000deadbeef
> Last emitted  0x0000000000000670
> 
> --- ring 0 ---
> Last signaled fence 0x00000000deadbeef
> Last emitted  0x0000000000000c44

WTF? Well that's a very interesting information you've got us here, thanks allot.

"deadbeef" is a pattern we usually use for ring and IB tests, and I have no idea how that ended up as last signaled fence value.

Could you try Jeromes debugging patch (http://people.freedesktop.org/~glisse/0001-debug-fence-emission-reception.patch) and attach the resulting output.

Thx,
Christian.
Comment 12 Christian König 2012-09-12 11:49:46 UTC
Created attachment 67047 [details] [review]
Possible fix
Comment 13 Christian König 2012-09-12 11:50:58 UTC
Please give the attached V3 version of the patch a try, it adds the last emitted fence as an upper limit and so should be able to even handle "deadbeef" values.

Cheers,
Christian.
Comment 14 Tormod Volden 2012-09-12 18:18:32 UTC
Yes, v3 works! I applied it to 3.5.2 by replacing rdev->fence_drv[ring].sync_seq[ring] with rdev->fence_drv[ring].seq and there is no more corruption. The /sys/kernel/debug/dri/0/radeon_fence_info is now in sync, or off by one:

--- ring 0 ---
Last signaled fence 0x0000000000002651
Last emitted  0x0000000000002652
--- ring 0 ---
Last signaled fence 0x0000000000002703
Last emitted  0x0000000000002704

Do you still want me to run the debug patch? It seems you are not sure about the 0xdeadbeef and there could be other bugs?
Comment 15 Fabio Pedretti 2012-10-02 19:49:52 UTC
Applied in 3.5.5.
Comment 16 Christian König 2012-10-03 10:46:55 UTC
Great, sounds like we can close the bug now.
Comment 17 Florian Mickler 2012-10-15 20:47:38 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc6:

commit f492c171a38d77fc13a8998a0721f2da50835224
Author: Christian König <deathsimple@vodafone.de>
Date:   Thu Sep 13 10:33:47 2012 +0200

    drm/radeon: make 64bit fences more robust v3


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.