Summary: | "failed testing IB on ring 5" when suspending to disk | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Austin Lund <austin.lund> | ||||||||||||||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||
Priority: | high | CC: | bruce, h.judt | ||||||||||||||||||||
Version: | DRI git | ||||||||||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||
Attachments: |
|
This is a mac? (In reply to comment #1) > This is a mac? Yes. Macbookpro8,2 Also, this doesn't happen with a 3.9.7 kernel. It seems to be related to the UVD stuff that was added to 3.10. Ring 5 appears to be related to this and it doesn't appear in the 3.9 kernels. (In reply to comment #2) > (In reply to comment #1) > > This is a mac? > > Yes. Macbookpro8,2 What else should it be? I'm really wondering if we shouldn't just disable UVD on Macs (with an option to override it of course). > > Also, this doesn't happen with a 3.9.7 kernel. > > It seems to be related to the UVD stuff that was added to 3.10. Ring 5 > appears to be related to this and it doesn't appear in the 3.9 kernels. Yes, indeed UVD is ring 5 and that is not present in older kernels. Are you sure that the system instability is related to this? Cause except for non working UVD it shouldn't affect the driver at all. (In reply to comment #3) > > Yes, indeed UVD is ring 5 and that is not present in older kernels. Are you > sure that the system instability is related to this? Cause except for non > working UVD it shouldn't affect the driver at all. Cannot say about the instability. Maybe it's not related but hard to debug as the system just stalls soon after the screen gets back (after the fence timeout) and needs a reset and the logs are gone. Pretty sure the instability has nothing to do with this. So I guess this bug is about the failing IB test and long delay to resume the display. As far as I can tell, this would happen whenever the system suspend fails after deactivating the drivers and the PM system restarts everything when the system hasn't actually suspended. The "pm_test" file just seems to cause an error return value at an appropriate point in the suspend code. When the system actually sleeps the uvd suspend code seems fine, but if it doesn't sleep then there is this delay. I'm not sure if this would help in making a work-around. (In reply to comment #5) > Pretty sure the instability has nothing to do with this. Good to know, well at least this bug loses a bit priority then. > As far as I can tell, this would happen whenever the system suspend fails > after deactivating the drivers and the PM system restarts everything when > the system hasn't actually suspended. The "pm_test" file just seems to > cause an error return value at an appropriate point in the suspend code. > When the system actually sleeps the uvd suspend code seems fine, but if it > doesn't sleep then there is this delay. Oh! Do I get this right that it only happens when you try to suspend the system but then doesn't really do the power cycle (for whatever reason)? Well that would explain it, cause thise case isn't really supported by the hardware. A complete manual reset of the UVD block (without an external power cycle) is somewhere between very very tricky and impossible. > I'm not sure if this would help in making a work-around. At least it explains the behavior. We could try to get it working by playing around with the different soft reset methods, but I have my doubts that this will ever work correctly. Created attachment 82190 [details]
netconsole.log
This is definitely not Mac-only. Behold dmesg (console.log) on Cayman HD6950. This is on a standard PC and the problem occurs on resume from hibernation. Resuming takes ages compared to 3.8.13 which is without UVD, one could almost think it failed, but then the screen comes online again, and... the computer fails and tries miserably to restore functionality.
> Oh! Do I get this right that it only happens when you try to suspend
> the system but then doesn't really do the power cycle (for whatever
> reason)?
Note that at least in my case, the system does the power cycle, because I hibernated/resumed it for sure. I also don't get that fix your BIOS error message. Maybe this is unrelated, should I report a separate bug?
BTW: 3.8.13 with inofficial UVD patch showed similar issues, though I didn't test that very thouroughly.
Can you attach the full dmesg output and not only the messages related to suspend/resume? Created attachment 82194 [details]
dmesg.out
Here is dmesg output captured after a clean boot.
Created attachment 82196 [details]
dmesg-suspend-resume.out
Strangely, it only happens on hibernate-resume, not on suspend-resume (attachment shows clean suspend-resume cycle).
Since I'm using the tuxonice patch, I'll retry with really clean vanilla. Although I have tried with the patch applied but without tuxonice enabled and though the only thing that is different is that it freezes more kernel threads than in-kernel suspend, there is a chance that there's something wrong with it. So just to be sure...
Created attachment 82198 [details]
dmesg-after-vanilla-kernel-hibernate.out
A bit different with in-kernel hibernate (vanilla kernel, current git), using the same config.
Hibernate worked the first time. Again, it took a long time hanging at a blank screen and the computer did not shut off. Yet, after a hard reset, it resumed. There were no messages in dmesg about ring 5 this time (see attachment). However, the second hibernation attempt failed. It hang there immediately with a blank screen, and no image was written. This was with a 3.11-pre-rc1 kernel, maybe I'll better retest with 3.10 release...
It is the same with 3.10 vanilla release as described in comment #12. It takes a long time at the start of suspend and resume hanging at a blank screen, image is written the first time but computer doesn't turn off. I did not try a second hibernation/resume cycle, it is clear that something goes wrong here. I don't get how this could be found in 3.8 when the patch for the uvd functions (according to my git log history) was added during 3.9-rc6 -> 3.9-rc7. I'm unable to actually hibernate my machine fully due to some other bug, which I haven't tracked down yet (hence why I am using pm_test). (In reply to comment #14) > I don't get how this could be found in 3.8 when the patch for the uvd > functions (according to my git log history) was added during 3.9-rc6 -> > 3.9-rc7. Read comment #8. Inofficial patch for backporting to 3.8. Not supported, and maybe it does not include all bug fixes that went into git since then. It's just an observation that I've made. You can grab them here: http://chithanh.blogspot.co.at/2013/04/new-mesa-features-for-adventurous.html > I'm unable to actually hibernate my machine fully due to some other bug, > which I haven't tracked down yet (hence why I am using pm_test). Yes, maybe the problem is only with tuxonice then because the ring 5 messages did not occur with vanilla. That may be coincidence however and it would need more testing to be sure about it. I'll see when I can get to it. Although I deem it necessary to note that I have not seen any reports about problems with tuxonice and 3.10 yet. What's more, there is still at least one problem with hibernating with the latest vanilla kernel. Created attachment 82226 [details] [review] Debugging patch It's just a temporary hack, but please test the attached patch if it changes anything. Thanks in advance, Christian. (In reply to comment #16) > Created attachment 82226 [details] [review] [review] > Debugging patch > > It's just a temporary hack, but please test the attached patch if it changes > anything. Tried this with "echo devices > /sys/power/pm_test". Makes things hard lock up and all fans go to full power. I have to force a shutdown and reboot, then nothing in the logs. :( Created attachment 82304 [details]
dmesg-hibernate.out
Ok, this is not a problem with tuxonice. I tried hibernating with current vanilla git again, and indeed the error message occurred when resuming from disk:
[ 168.118207] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[ 168.118208] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000006 last fence id 0x0000000000000004)
[ 168.118209] [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 168.118212] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Full dmesg attached. The computer resumed but would crash soon afterwards. Crashes persisted until reboot.
The long delay when suspending was caused by the serial console I had attached for debugging purposes; It really took a long time to resume but worked eventually.
I guess your debugging patch is only for rv770 and not for cayman?, but I applied it nevertheless and it had no effect. The machine hibernated fine the first time, but then the error message occurred the second time. The error message was exactly the same as before. radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000006 last fence id 0x0000000000000004) Created attachment 82325 [details] [review] Possible fix. I was able to reproduce the problem, and this patch (only a slightly modified version of the old one) seems to fix it for me. Please retest and provide new dmesg logs (as far as that is possible). Also please try it a couple of times, cause at least on my test system suspend/resume on 3.10 seems to be a bit unstable (even without the radeon driver). (In reply to comment #20) > Created attachment 82325 [details] [review] [review] > Possible fix. > > I was able to reproduce the problem, and this patch (only a slightly > modified version of the old one) seems to fix it for me. > > Please retest and provide new dmesg logs (as far as that is possible). > > Also please try it a couple of times, cause at least on my test system > suspend/resume on 3.10 seems to be a bit unstable (even without the radeon > driver). I got this compile warning: /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c: In function ‘radeon_uvd_fini’: /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c:170:3: warning: ‘return’ with a value, in function returning void [enabled by default] return 0; ^ Haven't had a chance to test just yet. Will report back as soon as possible. (In reply to comment #21) > I got this compile warning: > > /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c: In function > ‘radeon_uvd_fini’: > /home/lund/src/linux/drivers/gpu/drm/radeon/radeon_uvd.c:170:3: warning: > ‘return’ with a value, in function returning void [enabled by default] > return 0; > ^ Just a stupid typo, going to fix that before I send it out to the list. > Haven't had a chance to test just yet. Will report back as soon as possible. That would be greate, cause it's actually a quite serious bug. I'm currently also locking into the other stability issues with 3.10, but can't (yet) say if it's radeons fault or not. Thanks, I confirm that the patch fixes the problem! I've tested this at least 5 times with both the vanilla and the tuxonice hibernation, and both now work pretty stable with 3.10. (As a side note: The BFQ IO scheduler patch makes my system hang when suspending, but that is a different issue and really not a concern for this bug report.) Now I'm still plagued by bug #44772, which is similar in that it only happens when resuming from hibernation, not when suspending, and it seems to occur much more often with 3.10 with pm_async=0 than before. As far as my machine is concerned, I consider this solved and 3.10 has become usable for me. Thanks! Patch tested and works on my machine. I now have problems for "processors" when doing pm_test, so I still cannot actually test this on a full resume from disk, but at least pm_test with "devices" works. Sounds like we can merk this as resolved now. Hi I found that after enable radeon.dpm, this message [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). always show after suspend system to RAM. wait a minute, the X Server crash. the attachment is output of demsg Hi I found that after enable radeon.dpm, this message [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). always show after suspend system to RAM. wait a minute, the X Server crash. the attachment is output of demsg. I test with radeon HD6310, linux 3.11 rc5, mesa9.2 git, (In reply to comment #27) > Hi I found that after enable radeon.dpm, this message > [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 > (-35). always show after suspend system to RAM. wait a minute, the X Server > crash. the attachment is output of demsg. I test with radeon HD6310, linux > 3.11 rc5, mesa9.2 git, If you are having problems with dpm enabled, please open a new bug as it may be a different issue. (In reply to comment #28) > If you are having problems with dpm enabled, please open a new bug as it may > be a different issue. Also check to see if you can reproduce the problem with dpm disabled. Created attachment 83989 [details]
dmesg 3.11rc5
[ 129.095684] [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35).
[ 129.115566] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Hi Alex Deucher after disable dpm, no such error message in dmesg, that bug should be fixed. but with dpm is enabled, resume cann't work correctly. (In reply to comment #31) > Hi Alex Deucher > > after disable dpm, no such error message in dmesg, that bug should be > fixed. > but with dpm is enabled, resume cann't work correctly. Please file a new bug for that. (In reply to comment #32) > (In reply to comment #31) > > Hi Alex Deucher > > > > after disable dpm, no such error message in dmesg, that bug should be > > fixed. > > but with dpm is enabled, resume cann't work correctly. > > Please file a new bug for that. Totally agree on that, UVD/DPM interaction seems to be more tricky than we thought. So let's close this bug and please open up a new one. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 81770 [details] Full log from suspend test With kernel 3.10 suspend to disk seems to case a problem with my GPU. I did this to test suspend: echo devices > /sys/power/pm_test echo disk > /sys/power/state It takes quite a while to return to the console and the system becomes unstable. Strangely suspend to ram doesn't seem to have any problems. The relevant log lines seem to be: PM: Allocated 2886788 kbytes in 0.39 seconds (7402.02 MB/s) Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. Suspending console(s) (use no_console_suspend to debug) apple-gmux 00:07: System wakeup disabled by ACPI radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000174118 and cpu addr 0xffffc9001d233118 PM: freeze of devices complete after 292.114 msecs hibernation debug: Waiting for 5 seconds. [drm] Wrong MCH_SSKPD value: 0x16040307 [drm] This can cause pipe underruns and display issues. [drm] Please upgrade your BIOS to fix this. [drm] PCIE gen 2 link speeds already enabled [drm] PCIE GART of 512M enabled (table at 0x0000000000142000). radeon 0000:01:00.0: WB enabled radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88025f328c00 radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88025f328c0c radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000a9d118 and cpu addr 0xffffc9001e3b2118 [drm] ring test on 0 succeeded in 2 usecs [drm] ring test on 3 succeeded in 1 usecs [drm] ring test on 5 succeeded in 1 usecs [drm] UVD initialized successfully. [drm] ib test on ring 0 succeeded in 0 usecs [drm] ib test on ring 3 succeeded in 1 usecs radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002) [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35). [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). Full log attached. $ uname -a Linux lund-macbookpro 3.10.0+ #14 SMP PREEMPT Mon Jul 1 09:12:13 EST 2013 x86_64 GNU/Linux (+ due to two patches which are unrelated to this driver, but otherwise vanilla 3.10) $ sudo lspci -v -s 01:00.0 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Whistler [Radeon HD 6630M/6650M/6750M/7670M/7690M] (prog-if 00 [VGA controller]) Subsystem: Apple Inc. MacBookPro8,2 [Core i7, 15", Late 2011] Flags: bus master, fast devsel, latency 0, IRQ 49 Memory at 90000000 (64-bit, prefetchable) [size=256M] Memory at b0800000 (64-bit, non-prefetchable) [size=128K] I/O ports at 2000 [size=256] Expansion ROM at b0820000 [disabled] [size=128K] Capabilities: [50] Power Management version 3 Capabilities: [58] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Kernel driver in use: radeon Kernel modules: radeon