Bug 76501

Summary: fences regression
Product: DRI Reporter: Ortwin Glück <odi>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium CC: julien.isorce
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg 3.12
none
dmesg 3.14
none
last good commit
none
first bad commit
none
Possible fix none

Description Ortwin Glück 2014-03-23 09:47:02 UTC
Created attachment 96233 [details]
dmesg 3.12

I am seeing a GPU lockup from any v3.13 up to 3.14-rc7, which basically renders my computer unusable under recent kernels :-(

[   55.762710] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[   55.762715] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x000000000000000 on ring 5)
[   55.762717] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[   55.762720] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

Hardware is an iMac 11,2 with a Radeon 4670 M96XT (RV730), 256MB GDDR3. 

working up to 3.12, broken as of 3.13.

Xorg comes up after some dalays with a mostly black screen, some colored rectangular artifacts where the login fields are, a working mouse cursor.

Console fb still works.


Bisected to this commit:

commit f9eaf9ae782d6480f179850e27e6f4911ac10227
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Oct 29 20:14:47 2013 +0100

    drm/radeon: rework and fix reset detection v2

    Stop fiddling with jiffies, always wait for RADEON_FENCE_JIFFIES_TIMEOUT.
    Consolidate the two wait sequence implementations into just one function.
    Activate all waiters and remember if the reset was already done instead of
    trying to reset from only one thread.

    v2: clear reset flag earlier to avoid timeout in IB test
Comment 1 Ortwin Glück 2014-03-23 09:47:28 UTC
Created attachment 96234 [details]
dmesg 3.14
Comment 2 Ortwin Glück 2014-03-23 10:04:30 UTC
NB: the UVD init does not occur each time. But the "GPU lockup" message does.
Comment 3 Christian König 2014-03-23 15:29:58 UTC
please provide a dmesg from commit f9eaf9ae782d6480f179850e27e6f4911ac10227 and 1dac28eb726109e7ac256051b157baf60b21a5f7 as well.

Thansk in advance,
Christian.
Comment 4 Ortwin Glück 2014-03-24 20:43:31 UTC
Created attachment 96315 [details]
last good commit
Comment 5 Ortwin Glück 2014-03-24 20:43:58 UTC
Created attachment 96316 [details]
first bad commit
Comment 6 Ortwin Glück 2014-03-24 20:55:52 UTC
interestingly also the last good commit produces the following log:
[    7.573975] [drm] UVD initialized successfully.
[    7.574210] [drm] Enabling audio 0 support
[    7.574240] [drm] ib test on ring 0 succeeded in 0 usecs
[    7.574263] [drm] ib test on ring 3 succeeded in 0 usecs
[   17.730386] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[   17.730390] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000002 last fence id 0x0000000000000000)
[   17.730393] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[   17.730397] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).


So that seems unrelated to the issue at hand.
Comment 7 Christian König 2014-03-25 10:44:41 UTC
Created attachment 96360 [details] [review]
Possible fix
Comment 8 Ortwin Glück 2014-03-25 10:49:44 UTC
Thanks, I will test the patch tonight.

Also I will bisect the first commit that produces the GPU lockup (without visible artifacts), as that seems to me the real problem. Probably f9eaf9 only exposes that bug visibly.
Comment 9 Christian König 2014-03-25 10:52:27 UTC
(In reply to comment #6)
> interestingly also the last good commit produces the following log:
> [    7.573975] [drm] UVD initialized successfully.
> [    7.574210] [drm] Enabling audio 0 support
> [    7.574240] [drm] ib test on ring 0 succeeded in 0 usecs
> [    7.574263] [drm] ib test on ring 3 succeeded in 0 usecs
> [   17.730386] radeon 0000:01:00.0: GPU lockup CP stall for more than
> 10000msec
> [   17.730390] radeon 0000:01:00.0: GPU lockup (waiting for
> 0x0000000000000002 last fence id 0x0000000000000000)
> [   17.730393] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed
> (-35).
> [   17.730397] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB
> on ring 5 (-35).
> 
> 
> So that seems unrelated to the issue at hand.

Actually it is related, and now the behaviours makes perfect sense.

Somewhere between 3.12 and your "last good" commit we have a patch that breaks UVD IB testing. But that isn't critical (3D still works fine) until the reset detection rework, cause after that one we try to get the UVD ring working again with each new IOCTL made to the card.

Please give the attached patch a try, it clears the "needs_reset" flag if the IB test failed for some reason. So that if the initial bringup fails we won't try to get it working over and over again.

Additional to that please bisect what commit breaks UVD IB testing between 3.12 and the "last good" commit and open up a new bug report for this issue.

Thanks for the help,
Christian.
Comment 10 Ortwin Glück 2014-03-25 19:42:42 UTC
I confirm that the patch fixes the screen output (3D). The GPU lockup is still present in dmesg, as expected. Bisecting now and will open a new bug report for it.
Comment 11 Christian König 2014-03-26 13:53:30 UTC
Perfect, thanks for the help.

Patch is on it's way upstream so any objections to closing this bug then?
Comment 12 Ortwin Glück 2014-03-26 14:10:30 UTC
OK to closing.

The other problem has resolved itself, by the way. For convenience I had always booted these kernels via kexec, which was the reason for the GPU lockup. After a normal warm boot the problem went away.
Comment 13 Christian König 2014-03-26 14:13:32 UTC
Ah, ok. That makes sense, cause kexec and UVD are known to not work together by design.

Closing this.
Comment 14 Shawn Starr 2014-04-03 16:31:51 UTC
Does this fix issues where when GPU locks up X is able to resume? when X resumes for me, there is no ability to resume using the display server and most of the time it just GPU wedges and then I need to do a reset.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.