Bug 64867 - Hangs on Cayman (HD6950) when watching flash/using vdpau
Summary: Hangs on Cayman (HD6950) when watching flash/using vdpau
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/r600 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-22 14:13 UTC by Martin Bednar
Modified: 2015-01-27 10:27 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg after X freeze (343.92 KB, text/plain)
2013-05-22 14:13 UTC, Martin Bednar
Details
output after hang. (1.62 MB, image/jpeg)
2013-05-22 14:26 UTC, Martin Bednar
Details
netconsole.log (4 bytes, text/plain)
2013-06-16 18:36 UTC, Harald Judt
Details
netconsole.log (18.36 KB, text/plain)
2013-06-16 18:39 UTC, Harald Judt
Details

Description Martin Bednar 2013-05-22 14:13:03 UTC
Created attachment 79658 [details]
dmesg after X freeze

When watching a flash video (opera + flash-11.2.202.262 ) the kernel log starts filling up with 
[ 7009.603310] radeon 0000:01:00.0: GPU fault detected: 146 0x0e677004
[ 7009.603313] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 7009.603316] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000

to sometimes eventually hang the GPU. When this starts appearing there is graphic corruption everywhere. I also managed to reproduce this when playing back a video using mplayer with the vdpau backend. ( EnableLinuxHWVideoDecode=1 is commented out in /etc/adobe/mms.cfg)

I managed to get a dmesg output after the GPU hang (attached) and then tried to save what is in /sys/kernel/debug/dri/0 but not knowing which was important, tried to do it in a bash for cycle, and got a complete hang.

linux-3.10-rc1
mesa, libdrm, radeon ddx from git.

might be related to https://bugs.freedesktop.org/show_bug.cgi?id=62959 , but piglit just finished fine. (fine = didn't hang in this case, still a bunch of "radeon_gem_object_create:69 alloc size 1365Mb bigger than 256Mb limit" in the logs).
Additional info : I have a dual screen setup.

Anything more you need, I'll be happy to provide.
Comment 1 Martin Bednar 2013-05-22 14:26:41 UTC
Created attachment 79661 [details]
output after hang.

The output I got when trying to cat /sys/debug/kernel/dri/0/* 
Sorry for the bad quality.
Comment 2 Harald Judt 2013-06-03 23:33:16 UTC
I too get system hangs when watching a flash video in firefox. linux-3.8.13, libdrm, mesa etc. git. Screen simply becomes black (no signal) and machine is dead, leaving a hard reset as the only option. The dmesg is flooded with the following lines:

radeon 0000:01:00.0: GPU fault detected: 147 0x0d859002
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000012D8
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05090002
[...] repeated a hundred times with only the first line changing a bit [...]

then:
radeon 0000:01:00.0: GPU fault detected: 146 0x07151004
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[...] repeated a hundred times with only the first line changing a bit [...]

Times indicate this goes on for approximately two minutes before the hang.
Comment 3 Alex Deucher 2013-06-04 00:27:36 UTC
(In reply to comment #2)
> I too get system hangs when watching a flash video in firefox. linux-3.8.13,
> libdrm, mesa etc. git. Screen simply becomes black (no signal) and machine
> is dead, leaving a hard reset as the only option. The dmesg is flooded with
> the following lines:
> 
> radeon 0000:01:00.0: GPU fault detected: 147 0x0d859002
> radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000012D8
> radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05090002
> [...] repeated a hundred times with only the first line changing a bit [...]

Something in the mesa drivers is emitting a command buffer without a proper virtual address for CB5.
Comment 4 Harald Judt 2013-06-06 19:42:42 UTC
Hoping that it would be a workaround, I've applied the following patch from another bug report:

diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
index 5407459..959e7cf 100644
--- a/drivers/gpu/drm/radeon/radeon_cs.c
+++ b/drivers/gpu/drm/radeon/radeon_cs.c
@@ -477,6 +477,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
        if (r) {
                goto out;
        }
+    radeon_fence_wait(vm->fence, false);
        radeon_cs_sync_rings(parser);
        radeon_cs_sync_to(parser, vm->fence);
        radeon_cs_sync_to(parser, radeon_vm_grab_id(rdev, vm, parser->ring));


While the hang happened again while playing a flash video (I'll try if I can reproduce it somehow), this time I was able to vt switch, and X was killed and the following additional lines got appended to dmesg:

[30243.510949] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[30243.510951] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000005c90bd last fence id 0x00000000005c90b8)
[30243.510952] radeon 0000:01:00.0: couldn't schedule ib
[30243.510973] radeon 0000:01:00.0: Trying to sync to a disabled ring!
[30243.511047] radeon 0000:01:00.0: couldn't schedule ib
[30243.511048] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.511197] radeon 0000:01:00.0: couldn't schedule ib
[30243.511198] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.512323] radeon 0000:01:00.0: couldn't schedule ib
[30243.512324] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.512851] radeon 0000:01:00.0: couldn't schedule ib
[30243.512852] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30254.004957] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[30254.004959] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000334 last fence id 0x0000000000000333)
[30254.004961] radeon 0000:01:00.0: couldn't schedule ib
[30254.005052] radeon 0000:01:00.0: couldn't schedule ib
[30254.005064] radeon 0000:01:00.0: couldn't schedule ib
[30254.005070] radeon 0000:01:00.0: couldn't schedule ib
[30254.005084] radeon 0000:01:00.0: couldn't schedule ib
[30254.005092] radeon 0000:01:00.0: couldn't schedule ib
[30254.005097] radeon 0000:01:00.0: Trying to sync to a disabled ring!
[...]
[30243.510949] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[30243.510951] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000005c90bd last fence id 0x00000000005c90b8)
[30243.510952] radeon 0000:01:00.0: couldn't schedule ib
[30243.510973] radeon 0000:01:00.0: Trying to sync to a disabled ring!
[30243.511047] radeon 0000:01:00.0: couldn't schedule ib
[30243.511048] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.511197] radeon 0000:01:00.0: couldn't schedule ib
[30243.511198] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.512323] radeon 0000:01:00.0: couldn't schedule ib
[30243.512324] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30243.512851] radeon 0000:01:00.0: couldn't schedule ib
[30243.512852] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[30254.004957] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[30254.004959] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000334 last fence id 0x0000000000000333)
[30254.004961] radeon 0000:01:00.0: couldn't schedule ib
[30254.005052] radeon 0000:01:00.0: couldn't schedule ib
[30254.005064] radeon 0000:01:00.0: couldn't schedule ib
[30254.005070] radeon 0000:01:00.0: couldn't schedule ib
[30254.005084] radeon 0000:01:00.0: couldn't schedule ib
[30254.005092] radeon 0000:01:00.0: couldn't schedule ib
[30254.005097] radeon 0000:01:00.0: Trying to sync to a disabled ring!
[30254.012901] [drm:radeon_cs_ib_chunk] *ERROR* Failed to schedule IB !
[...] many similar repeated lines about IB [...]
[30264.498754] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[30264.498759] radeon 0000:01:00.0: GPU lockup (waiting for 0x00000000000090e5 last fence id 0x00000000000090e3)

Trying to restart X didn't work (X crashed), so I had to reboot the machine. Not sure if this brings any relevations. Anything I could do to provide more information next time when this happens?
Comment 5 Harald Judt 2013-06-06 20:58:57 UTC
Doesn't seem to be related to playing flash videos. A few moments ago parts of the screen started looking crazy/corrupted, then on verification again the flood in dmesg, and I was only able to reboot the machine using ssh. Looking at cayman bug reports, many people seem to have the same or similar problems. I'll try a 3.7 or maybe even a 3.6 kernel, perhaps that works reliably.
Comment 6 Martin Bednar 2013-06-06 21:11:10 UTC
adding R600_DEBUG=nodma to my environment makes the problem go away... Not pretty, but a workaround. Same question though : how could I help debugging this?
Comment 7 Harald Judt 2013-06-16 18:36:58 UTC
Created attachment 80920 [details]
netconsole.log

I've been able to get a more complete output using netconsole. I'm not sure if it helps, but here it is.

Here are the steps to reproduce the crash:

1) Go to youtube.com, start playing a video.
   => This prints these lines:
      radeon 0000:01:00.0: GPU fault detected: 146 0x0b95e004
      radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000010B9
      radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x050E0004
2) Close the web browser, activate xscreensaver with an opengl screensaver like photopile.
   => GPU lockup CP stall
Comment 8 Harald Judt 2013-06-16 18:39:05 UTC
Created attachment 80921 [details]
netconsole.log

Last upload didn't work.
Comment 9 Harald Judt 2013-06-16 18:47:46 UTC
> adding R600_DEBUG=nodma to my environment makes the problem go away...
> Not pretty, but a workaround. Same question though : how could I
> help debugging this?

I confirm this helps; Not against the GPU fault when playing the video, but against the crashes/hangs when an opengl xscreensaver etc. activates. Thanks for mentioning the workaround.

I've also applied https://bugs.freedesktop.org/attachment.cgi?id=72794, but it doesn't help.
Comment 10 Harald Judt 2013-09-14 02:42:29 UTC
With current up-to-date git versions of libdrm, mesa, xorg-server and xf86-video-ati, the R600_DEBUG=nodma hack no longer seems necessary (linux-3.11.0-rc6 with UVD disabled); the GPU faults have vanished and the system is stable.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.