Bug 66425

Summary:

"failed testing IB on ring 5" when suspending to disk

Product:

DRI

Reporter:

Austin Lund <austin.lund>

Component:

DRM/Radeon

Assignee:

Default DRI bug account <dri-devel>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

high

CC:

bruce, h.judt

Version:

DRI git

Hardware:

x86 (IA32)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Full log from suspend test	none
netconsole.log	none
dmesg.out	none
dmesg-suspend-resume.out	none
dmesg-after-vanilla-kernel-hibernate.out	none
Debugging patch	none
dmesg-hibernate.out	none
Possible fix.	none
dmesg 3.11rc5	none

Description Austin Lund 2013-07-01 01:19:05 UTC

Created attachment 81770 [details]
Full log from suspend test

With kernel 3.10 suspend to disk seems to case a problem with my GPU.  I did this to test suspend:

echo devices > /sys/power/pm_test
echo disk > /sys/power/state

It takes quite a while to return to the console and the system becomes unstable.  Strangely suspend to ram doesn't seem to have any problems.

The relevant log lines seem to be:

PM: Allocated 2886788 kbytes in 0.39 seconds (7402.02 MB/s)
Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
Suspending console(s) (use no_console_suspend to debug)
apple-gmux 00:07: System wakeup disabled by ACPI
radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000174118 and cpu addr 0xffffc9001d233118
PM: freeze of devices complete after 292.114 msecs
hibernation debug: Waiting for 5 seconds.
[drm] Wrong MCH_SSKPD value: 0x16040307
[drm] This can cause pipe underruns and display issues.
[drm] Please upgrade your BIOS to fix this.
[drm] PCIE gen 2 link speeds already enabled
[drm] PCIE GART of 512M enabled (table at 0x0000000000142000).
radeon 0000:01:00.0: WB enabled
radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88025f328c00
radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88025f328c0c
radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000a9d118 and cpu addr 0xffffc9001e3b2118
[drm] ring test on 0 succeeded in 2 usecs
[drm] ring test on 3 succeeded in 1 usecs
[drm] ring test on 5 succeeded in 1 usecs
[drm] UVD initialized successfully.
[drm] ib test on ring 0 succeeded in 0 usecs
[drm] ib test on ring 3 succeeded in 1 usecs
radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002)
[drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35).
[drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

Full log attached.

$ uname -a
Linux lund-macbookpro 3.10.0+ #14 SMP PREEMPT Mon Jul 1 09:12:13 EST 2013 x86_64 GNU/Linux

(+ due to two patches which are unrelated to this driver, but otherwise vanilla 3.10)

$ sudo lspci -v -s 01:00.0
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Whistler [Radeon HD 6630M/6650M/6750M/7670M/7690M] (prog-if 00 [VGA controller])
	Subsystem: Apple Inc. MacBookPro8,2 [Core i7, 15", Late 2011]
	Flags: bus master, fast devsel, latency 0, IRQ 49
	Memory at 90000000 (64-bit, prefetchable) [size=256M]
	Memory at b0800000 (64-bit, non-prefetchable) [size=128K]
	I/O ports at 2000 [size=256]
	Expansion ROM at b0820000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Kernel driver in use: radeon
	Kernel modules: radeon

Comment 1 Alex Deucher 2013-07-01 13:24:08 UTC

This is a mac?

Comment 2 Austin Lund 2013-07-01 21:35:32 UTC

(In reply to comment #1)
> This is a mac?

Yes.  Macbookpro8,2

Also, this doesn't happen with a 3.9.7 kernel.

It seems to be related to the UVD stuff that was added to 3.10.  Ring 5 appears to be related to this and it doesn't appear in the 3.9 kernels.

Comment 3 Christian König 2013-07-02 09:00:39 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > This is a mac?
> 
> Yes.  Macbookpro8,2

What else should it be? I'm really wondering if we shouldn't just disable UVD on Macs (with an option to override it of course).

> 
> Also, this doesn't happen with a 3.9.7 kernel.
> 
> It seems to be related to the UVD stuff that was added to 3.10.  Ring 5
> appears to be related to this and it doesn't appear in the 3.9 kernels.

Yes, indeed UVD is ring 5 and that is not present in older kernels. Are you sure that the system instability is related to this? Cause except for non working UVD it shouldn't affect the driver at all.

Comment 4 Austin Lund 2013-07-02 11:23:51 UTC

(In reply to comment #3)
> 
> Yes, indeed UVD is ring 5 and that is not present in older kernels. Are you
> sure that the system instability is related to this? Cause except for non
> working UVD it shouldn't affect the driver at all.

Cannot say about the instability.  Maybe it's not related but hard to debug as the system just stalls soon after the screen gets back (after the fence timeout) and needs a reset and the logs are gone.

Comment 5 Austin Lund 2013-07-03 04:56:01 UTC

Pretty sure the instability has nothing to do with this.

So I guess this bug is about the failing IB test and long delay to resume the display.

As far as I can tell, this would happen whenever the system suspend fails after deactivating the drivers and the PM system restarts everything when the system hasn't actually suspended.  The "pm_test" file just seems to cause an error return value at an appropriate point in the suspend code.  When the system actually sleeps the uvd suspend code seems fine, but if it doesn't sleep then there is this delay.

I'm not sure if this would help in making a work-around.

Comment 6 Christian König 2013-07-03 13:55:01 UTC

(In reply to comment #5)
> Pretty sure the instability has nothing to do with this.

Good to know, well at least this bug loses a bit priority then.

> As far as I can tell, this would happen whenever the system suspend fails
> after deactivating the drivers and the PM system restarts everything when
> the system hasn't actually suspended.  The "pm_test" file just seems to
> cause an error return value at an appropriate point in the suspend code.
> When the system actually sleeps the uvd suspend code seems fine, but if it
> doesn't sleep then there is this delay.

Oh! Do I get this right that it only happens when you try to suspend the system but then doesn't really do the power cycle (for whatever reason)?

Well that would explain it, cause thise case isn't really supported by the hardware. A complete manual reset of the UVD block (without an external power cycle) is somewhere between very very tricky and impossible.

> I'm not sure if this would help in making a work-around.

At least it explains the behavior. We could try to get it working by playing around with the different soft reset methods, but I have my doubts that this will ever work correctly.

Comment 7 Harald Judt 2013-07-08 14:46:20 UTC

Created attachment 82190 [details]
netconsole.log

This is definitely not Mac-only. Behold dmesg (console.log) on Cayman HD6950. This is on a standard PC and the problem occurs on resume from hibernation. Resuming takes ages compared to 3.8.13 which is without UVD, one could almost think it failed, but then the screen comes online again, and... the computer fails and tries miserably to restore functionality.

Comment 8 Harald Judt 2013-07-08 14:59:04 UTC

> Oh! Do I get this right that it only happens when you try to suspend
> the system but then doesn't really do the power cycle (for whatever
> reason)?

Note that at least in my case, the system does the power cycle, because I hibernated/resumed it for sure. I also don't get that fix your BIOS error message. Maybe this is unrelated, should I report a separate bug?

BTW: 3.8.13 with inofficial UVD patch showed similar issues, though I didn't test that very thouroughly.

Comment 9 Christian König 2013-07-08 16:46:27 UTC

Can you attach the full dmesg output and not only the messages related to suspend/resume?

Comment 10 Harald Judt 2013-07-08 17:32:18 UTC

Created attachment 82194 [details]
dmesg.out

Here is dmesg output captured after a clean boot.

Comment 11 Harald Judt 2013-07-08 18:11:34 UTC

Created attachment 82196 [details]
dmesg-suspend-resume.out

Strangely, it only happens on hibernate-resume, not on suspend-resume (attachment shows clean suspend-resume cycle).

Since I'm using the tuxonice patch, I'll retry with really clean vanilla. Although I have tried with the patch applied but without tuxonice enabled and though the only thing that is different is that it freezes more kernel threads than in-kernel suspend, there is a chance that there's something wrong with it. So just to be sure...

Comment 12 Harald Judt 2013-07-08 18:41:04 UTC

Created attachment 82198 [details]
dmesg-after-vanilla-kernel-hibernate.out

A bit different with in-kernel hibernate (vanilla kernel, current git), using the same config.

Hibernate worked the first time. Again, it took a long time hanging at a blank screen and the computer did not shut off. Yet, after a hard reset, it resumed. There were no messages in dmesg about ring 5 this time (see attachment). However, the second hibernation attempt failed. It hang there immediately with a blank screen, and no image was written. This was with a 3.11-pre-rc1 kernel, maybe I'll better retest with 3.10 release...

Comment 13 Harald Judt 2013-07-08 18:49:38 UTC

It is the same with 3.10 vanilla release as described in comment #12. It takes a long time at the start of suspend and resume hanging at a blank screen, image is written the first time but computer doesn't turn off. I did not try a second hibernation/resume cycle, it is clear that something goes wrong here.

Comment 14 Austin Lund 2013-07-08 22:23:04 UTC

I don't get how this could be found in 3.8 when the patch for the uvd functions (according to my git log history) was added during 3.9-rc6 -> 3.9-rc7.

I'm unable to actually hibernate my machine fully due to some other bug, which I haven't tracked down yet (hence why I am using pm_test).

Comment 15 Harald Judt 2013-07-09 07:55:40 UTC

(In reply to comment #14)
> I don't get how this could be found in 3.8 when the patch for the uvd
> functions (according to my git log history) was added during 3.9-rc6 ->
> 3.9-rc7.

Read comment #8. Inofficial patch for backporting to 3.8. Not supported, and maybe it does not include all bug fixes that went into git since then. It's just an observation that I've made. You can grab them here: http://chithanh.blogspot.co.at/2013/04/new-mesa-features-for-adventurous.html

> I'm unable to actually hibernate my machine fully due to some other bug,
> which I haven't tracked down yet (hence why I am using pm_test).

Yes, maybe the problem is only with tuxonice then because the ring 5 messages did not occur with vanilla. That may be coincidence however and it would need more testing to be sure about it. I'll see when I can get to it. Although I deem it necessary to note that I have not seen any reports about problems with tuxonice and 3.10 yet. What's more, there is still at least one problem with hibernating with the latest vanilla kernel.

Comment 16 Christian König 2013-07-09 11:38:07 UTC

Created attachment 82226 [details] [review]
Debugging patch

It's just a temporary hack, but please test the attached patch if it changes anything.

Thanks in advance,
Christian.

Comment 17 Austin Lund 2013-07-10 01:29:13 UTC

(In reply to comment #16)
> Created attachment 82226 [details] [review] [review]
> Debugging patch
> 
> It's just a temporary hack, but please test the attached patch if it changes
> anything.

Tried this with "echo devices > /sys/power/pm_test". Makes things hard lock up and all fans go to full power.  I have to force a shutdown and reboot, then nothing in the logs. :(

Comment 18 Harald Judt 2013-07-10 22:23:57 UTC

Created attachment 82304 [details]
dmesg-hibernate.out

Ok, this is not a problem with tuxonice. I tried hibernating with current vanilla git again, and indeed the error message occurred when resuming from disk:

[  168.118207] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[  168.118208] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000006 last fence id 0x0000000000000004)
[  168.118209] [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35).
[  168.118212] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

Full dmesg attached. The computer resumed but would crash soon afterwards. Crashes persisted until reboot.

The long delay when suspending was caused by the serial console I had attached for debugging purposes; It really took a long time to resume but worked eventually.

Comment 19 Harald Judt 2013-07-10 22:37:19 UTC

I guess your debugging patch is only for rv770 and not for cayman?, but I applied it nevertheless and it had no effect. The machine hibernated fine the first time, but then the error message occurred the second time. The error message was exactly the same as before.

radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000006 last fence id 0x0000000000000004)

Comment 20 Christian König 2013-07-11 09:33:54 UTC

Created attachment 82325 [details] [review]
Possible fix.

I was able to reproduce the problem, and this patch (only a slightly modified version of the old one) seems to fix it for me.

Please retest and provide new dmesg logs (as far as that is possible).

Also please try it a couple of times, cause at least on my test system suspend/resume on 3.10 seems to be a bit unstable (even without the radeon driver).

Comment 21 Austin Lund 2013-07-11 09:51:24 UTC

(In reply to comment #20)
> Created attachment 82325 [details] [review] [review]
> Possible fix.
> 
> I was able to reproduce the problem, and this patch (only a slightly
> modified version of the old one) seems to fix it for me.
> 
> Please retest and provide new dmesg logs (as far as that is possible).
> 
> Also please try it a couple of times, cause at least on my test system
> suspend/resume on 3.10 seems to be a bit unstable (even without the radeon
> driver).

I got this compile warning: 

/home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c: In function ‘radeon_uvd_fini’:
/home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c:170:3: warning: ‘return’ with a value, in function returning void [enabled by default]
   return 0;
   ^

Haven't had a chance to test just yet.  Will report back as soon as possible.

Comment 22 Christian König 2013-07-11 12:59:33 UTC

(In reply to comment #21)
> I got this compile warning: 
> 
> /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c: In function
> ‘radeon_uvd_fini’:
> /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c:170:3: warning:
> ‘return’ with a value, in function returning void [enabled by default]
>    return 0;
>    ^

Just a stupid typo, going to fix that before I send it out to the list.

> Haven't had a chance to test just yet.  Will report back as soon as possible.

That would be greate, cause it's actually a quite serious bug.

I'm currently also locking into the other stability issues with 3.10, but can't (yet) say if it's radeons fault or not.

Comment 23 Harald Judt 2013-07-11 18:58:47 UTC

Thanks, I confirm that the patch fixes the problem!

I've tested this at least 5 times with both the vanilla and the tuxonice hibernation, and both now work pretty stable with 3.10. (As a side note: The BFQ IO scheduler patch makes my system hang when suspending, but that is a different issue and really not a concern for this bug report.)

Now I'm still plagued by bug #44772, which is similar in that it only happens when resuming from hibernation, not when suspending, and it seems to occur much more often with 3.10 with pm_async=0 than before.

As far as my machine is concerned, I consider this solved and 3.10 has become usable for me. Thanks!

Comment 24 Austin Lund 2013-07-14 08:23:15 UTC

Patch tested and works on my machine.  I now have problems for "processors" when doing pm_test, so I still cannot actually test this on a full resume from disk, but at least pm_test with "devices" works.

Comment 25 Christian König 2013-07-14 08:37:31 UTC

Sounds like we can merk this as resolved now.

Comment 26 wuruxu 2013-08-12 22:17:46 UTC

Hi I found that after enable radeon.dpm, this message [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). always show after suspend system to RAM. wait a minute, the X Server crash.  the attachment is output of demsg

Comment 27 wuruxu 2013-08-12 22:18:59 UTC

Hi I found that after enable radeon.dpm, this message [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). always show after suspend system to RAM. wait a minute, the X Server crash.  the attachment is output of demsg. I test with radeon HD6310, linux 3.11 rc5, mesa9.2 git,

Comment 28 Alex Deucher 2013-08-12 22:21:35 UTC

(In reply to comment #27)
> Hi I found that after enable radeon.dpm, this message
> [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5
> (-35). always show after suspend system to RAM. wait a minute, the X Server
> crash.  the attachment is output of demsg. I test with radeon HD6310, linux
> 3.11 rc5, mesa9.2 git,

If you are having problems with dpm enabled, please open a new bug as it may be a different issue.

Comment 29 Alex Deucher 2013-08-12 22:23:56 UTC

(In reply to comment #28)
> If you are having problems with dpm enabled, please open a new bug as it may
> be a different issue.


Also check to see if you can reproduce the problem with dpm disabled.

Comment 30 wuruxu 2013-08-12 22:50:47 UTC

Created attachment 83989 [details]
dmesg 3.11rc5

[  129.095684] [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35).
[  129.115566] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).

Comment 31 wuruxu 2013-08-13 13:16:18 UTC

Hi Alex Deucher 

  after disable dpm, no such error message in dmesg, that bug should be fixed.
but with dpm is enabled, resume cann't work correctly.

Comment 32 Alex Deucher 2013-08-13 13:21:39 UTC

(In reply to comment #31)
> Hi Alex Deucher 
> 
>   after disable dpm, no such error message in dmesg, that bug should be
> fixed.
> but with dpm is enabled, resume cann't work correctly.

Please file a new bug for that.

Comment 33 Christian König 2013-08-13 13:28:43 UTC

(In reply to comment #32)
> (In reply to comment #31)
> > Hi Alex Deucher 
> > 
> >   after disable dpm, no such error message in dmesg, that bug should be
> > fixed.
> > but with dpm is enabled, resume cann't work correctly.
> 
> Please file a new bug for that.

Totally agree on that, UVD/DPM interaction seems to be more tricky than we thought.

So let's close this bug and please open up a new one.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.