Summary: | Radeon HD6950: UVD not responding, trying to reset the VCPU | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Harald Judt <h.judt> | ||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | bastian.triller, egorov_egor, vmerlet | ||||||||
Version: | XOrg git | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Harald Judt
2013-07-22 19:18:54 UTC
Well that's interesting. According to your logs it worked the first three times, then it doesn't worked twelve times and then the VCPU boots again but the IB test fails... Did you tried playing any video in between? That is likely, but I'm not sure about the influence of this and will investigate if there is any connection between using vdpau/uvd and suspend/resume or not. Ok, good news first: It doesn't seem to have to do anything with playing videos using UVD. In fact, playing videos works fine and doesn't have any bad effect on suspending or hibernating. But there are worse troubles now. The reason the error messages started to appear was because I set /sys/power/pm_async to 1 - to make it faster -, albeit for suspend/resume only. This explains why they appeared only for some cycles. I've always set pm_async to 0 because of my hibernate/resume problems described in bug #44772. Before the introduction of UVD support, suspend/resume worked reliably with pm_async=1. Now that's not the case anymore and pm_async=0 has become mandatory. But what's worse, even with pm_async set to 0, suspend/resume doesn't work reliable anymore. Sometimes the monitor simply turns off when suspending, and the system hangs. Previously, that was only the case with hibernate/resume at resume time. Now I wonder... Maybe the card simply takes more time to initialize? At least the error message states UVD failed to initialize... Is there some way to add a delay at resuming/initializing, maybe this would help? There is some inconsistency in my testing. As one can deduce from dmesg.out, I was using tuxonice hybrid suspend/resume (behold the alternating S4/S3 lines) and that might have a bad influence on in-kernel suspend. Tuxonice or in-kernel hibernation might be broken, I have seen other issues. I will test this again properly this or next week, using vanilla, to remove all uncertainties and try to narrow it down a bit more. Created attachment 83279 [details]
dmesg.out
Ok, I gave it a try with new vanilla 3.11-rc3. It ends with the following error message, though I'm not sure it isn't simply a follow-up problem caused by previous errors:
[ 2203.028013] [drm:r600_ring_test] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
As you can deduce from the dmesg.out file attached to the bug report, the first few tries work ok. The problem seems to be more easily reproducible (but not restricted to that) when pm_async is enabled. I'm not yet sure it happens with pm_async off, but by experience, the chances that it does are very likely.
What is a bit strange is that the message "entering sleep state S4" never appears in the dmesg.out. This _could_ be because I issued "echo reboot > /sys/power/disk". However, this is about hibernate/resume, as you can see when you look at the pm messages.
I believe the main problem occurs already at suspend time, because when it happens, suspend always takes longer than usual, hanging a bit with the monitor turned off. However, there are no error messages to be found in the log, only when resuming. Perhaps I should use no_suspend_console and/or net_console.
Also note how the UVD initialization fails, then the machine continues booting and later you can hibernate again, which will restore UVD but cause the ring 0 failure with CAFEDEAD. At first, the screen is blank, then shows an animated nostalgic noise picture similar to analog tv that has no reception. The machine is still responsive and one can ssh in.
Is there perhaps an option to enable for more debugging messages that could give more insight?
Note that the cayman: "suspending"/"suspend complete" and "resuming"/"resume complete" functions are only simple printks I added to the begin and end of the cayman_suspend/cayman_resume functions. Further tests show: * the 0xCAFEDEAD seems to have been a one-time error that usually does not occur * in case of problems, "UVD not responding" does not always appear and is successfully initialized, but the ring 5 error usually does radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000008 last fence id 0x0000000000000006) [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35). [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). * it does not always hang when suspending, yet resume might still fail * enabling pm_async makes suspend/resume fail sometimes with "uvd not responding", while the process appears to be quite stable with pm_async deactivated (may only be harder to reproduce, still needs more testing) * hibernate/resume will fail with pm_async deactivated sometimes, and pretty reliably with pm_async enabled Pretty weird... Finally! I've pulled http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.11 for rc3 and reliability of suspending and hibernating has improved *a lot*. The machine will even hibernate and resume with successfully with pm_async=1 now. There was only one time that a similar error message occurred in 1 out of 18 attempts, this time only slightly different from the other errors: [ 4379.361691] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec [ 4379.361692] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000026 last fence id 0x0000000000000024) [ 4379.361693] [drm:r600_uvd_ib_test] *ERROR* radeon: fence wait failed (-35). [ 4379.361695] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35). Let's see how more often this will occur. Apart from this one failure (I hope it was only bad luck) it works very well now. Thanks for the improvements. Ok, it happened a second time with a similar message; this seems to occur randomly and unpredictibly as always. It is definitely a UVD-only issue. I've applied the radeon.no_uvd patch some guy wrote from here: http://pastebin.com/0mRGb224. Now with uvd turned off, 32 out of 32 attempts have been successful. No more hibernate problems, but UVD doesn't work of course. I'm constantly improving the UVD support for 3.11 (and back porting the fixes to 3.10). So please retest with each new release candidate as long a those symptoms remain. *** Bug 67723 has been marked as a duplicate of this bug. *** I've run the tests again with the latest commits pulled from ~agd5f/linux, and hibernate&suspend/resume seems to work fine now with uvd enabled! That's 12 out of 12 successful attempts so far, so please let's leave this open for a while until I have properly tested this in production for three or four days. Resume from suspend only works when I have the power plug attached. It doesn't matter, if I suspended on AC or not as long I have attached it on resume. Created attachment 83867 [details]
dmesg
there was an error on the first suspend:
...
[drm:rv770_stop_dpm] *ERROR* Could not force DPM to low.
(In reply to comment #13) > Resume from suspend only works when I have the power plug attached. It > doesn't matter, if I suspended on AC or not as long I have attached it on > resume. Does resume work with dpm disabled? This bug is specifically about UVD. If there is an issue with dpm and resume, it should be filed separately. Just as I wanted to write that everything works ok with uvd enabled and this bug can be closed, after the 19th suspend/hibernate attempt the machine failed to resume, with the screen simply going black and reporting no signal :-( I've tried to resume from the same image 5 times but to no avail, so I'm back now with radeon.no_uvd=1 for a while to give it a test with real-world usage pattern. Just to be sure that this works in all cases too. Ok, status update. After 29 successful attempts and an uptime of approx 9.5 days with radeon.no_uvd=1 I am quite confident that the suspend/hibernate/resume problems can be attributed to the UVD code. While I am thankful for the new-gained stability, I will continue testing newer release candidates with UVD enabled next time I have to reboot my machine for system backup. Another status update; I've reenabled UVD after updating to kernel 3.11.4 and mesa git, and now hibernating and resuming worked without problems for 25 attempts. No more ring 5 test errors. Let's finally declare this fixed, many thanks! BTW: UVD works fine except sometimes there seems to be some kind of problem with initializing; green or pink blocks or strange white/black-patterned lines appear regardless of the video played, with videos that have been played flawlessly before. This issue goes away after awhile. I will collect more information and file a separate bug for this. Sounds good, please try the newest 3.12 rc first before opening another bug report about the playback artifacts. I've fixed a couple of different bugs in this release and I'm not sure they got all backported yet. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.