Bug 29556

Summary: [rv620] GPU reset followed by black screen
Product: DRI Reporter: Stefano Carignano <scary.moo>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: marvin24
Version: XOrg gitKeywords: patch
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
system dmesg
none
Xorg log
none
rebased V2 blit patch from dri-devel none

Description Stefano Carignano 2010-08-13 04:01:14 UTC
Created attachment 37840 [details]
system dmesg

Using latest git (as of 12/08/2010) of libdrm, mesa(classic),xf86-video-ati and drm-radeon-testing (commit drm/radeon/kms: enable writeback on remaing asics ), 
gpu is a hd3470 mobility (rv620), forced to lowest power state 
(echo "low" > /sys/class/drm/card0/device/power_profile).
During normal web browsing, maybe while playing a flash video, the mouse cursor suddenly stops and immediately after I get a black screen from which I cannot recover unless I sysrq-reboot (haven't tried ssh). 
 Upon reboot a check of the system log shows 

[ 2325.450063] radeon 0000:01:00.0: GPU lockup CP stall for more than 1000msec
[ 2325.450066] ------------[ cut here ]------------
[ 2325.450083] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:239 radeon_fence_wait+0x35b/0x3c0 [radeon]()
[ 2325.450085] Hardware name: Satellite A300
[ 2325.450087] GPU lockup (waiting for 0x0000BDBC last fence id 0x0000BDBA)
[ 2325.450089] Modules linked in: radeon ttm ath5k drm_kms_helper cfbcopyarea cfbimgblt cfbfillrect i2c_i801 ath
[ 2325.450099] Pid: 1954, comm: X Not tainted 2.6.35+ #19
[ 2325.450101] Call Trace:
[ 2325.450108]  [<ffffffff8103a04a>] warn_slowpath_common+0x7a/0xb0
[ 2325.450111]  [<ffffffff8103a121>] warn_slowpath_fmt+0x41/0x50
[ 2325.450120]  [<ffffffffa009ba1b>] radeon_fence_wait+0x35b/0x3c0 [radeon]
[ 2325.450125]  [<ffffffff810521f0>] ? autoremove_wake_function+0x0/0x40
[ 2325.450134]  [<ffffffffa009c1fc>] radeon_sync_obj_wait+0xc/0x10 [radeon]
[ 2325.450139]  [<ffffffffa005ad69>] ttm_bo_wait+0xf9/0x1b0 [ttm]
[ 2325.450144]  [<ffffffffa005e11f>] ttm_bo_move_accel_cleanup+0x9f/0x2e0 [ttm]
[ 2325.450153]  [<ffffffffa009c32f>] radeon_move_blit+0x11f/0x180 [radeon]
[ 2325.450162]  [<ffffffffa009c786>] radeon_bo_move+0xb6/0x1e0 [radeon]
[ 2325.450166]  [<ffffffffa005b1a5>] ttm_bo_handle_move_mem+0x135/0x410 [ttm]
[ 2325.450170]  [<ffffffffa005d2c9>] ttm_bo_evict+0x1b9/0x3f0 [ttm]
[ 2325.450175]  [<ffffffff81090001>] ? __isolate_lru_page+0x81/0xa0
[ 2325.450179]  [<ffffffffa005c6f7>] ttm_mem_evict_first+0x147/0x1e0 [ttm]
[ 2325.450183]  [<ffffffffa005d059>] ttm_bo_mem_space+0x3e9/0x4a0 [ttm]
[ 2325.450187]  [<ffffffffa005d5e7>] ttm_bo_move_buffer+0xe7/0x160 [ttm]
[ 2325.450192]  [<ffffffff81260028>] ? drm_mapbufs+0x318/0x340
[ 2325.450196]  [<ffffffffa005d6f6>] ttm_bo_validate+0x96/0x120 [ttm]
[ 2325.450199]  [<ffffffffa005db35>] ttm_bo_init+0x2e5/0x340 [ttm]
[ 2325.450209]  [<ffffffffa009d198>] radeon_bo_create+0x128/0x220 [radeon]
[ 2325.450218]  [<ffffffffa009cf10>] ? radeon_ttm_bo_destroy+0x0/0xc0 [radeon]
[ 2325.450228]  [<ffffffffa00b1aa4>] radeon_gem_object_create+0x84/0x100 [radeon]
[ 2325.450232]  [<ffffffff810c9030>] ? pollwake+0x0/0x60
[ 2325.450242]  [<ffffffffa00b1f1f>] radeon_gem_create_ioctl+0x4f/0xe0 [radeon]
[ 2325.450246]  [<ffffffff81398e94>] ? sock_aio_read+0x134/0x150
[ 2325.450249]  [<ffffffff8126138c>] drm_ioctl+0x33c/0x410
[ 2325.450259]  [<ffffffffa00b1ed0>] ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
[ 2325.450262]  [<ffffffff810b8bf2>] ? do_sync_read+0xd2/0x110
[ 2325.450266]  [<ffffffff810c7b2c>] vfs_ioctl+0x3c/0xd0
[ 2325.450268]  [<ffffffff810c812c>] do_vfs_ioctl+0x7c/0x520
[ 2325.450271]  [<ffffffff810b9345>] ? vfs_read+0x105/0x140
[ 2325.450274]  [<ffffffff810c861a>] sys_ioctl+0x4a/0x80
[ 2325.450277]  [<ffffffff81004669>] ? do_device_not_available+0x9/0x10
[ 2325.450280]  [<ffffffff8100256b>] system_call_fastpath+0x16/0x1b
[ 2325.450283] ---[ end trace b2d00ea6bab57761 ]---
[ 2325.450295] [drm] Disabling audio support
[ 2325.451428] radeon 0000:01:00.0: GPU softreset 
[ 2325.451431] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003030
[ 2325.451435] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
[ 2325.451438] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200010C0
[ 2325.451452] radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
[ 2325.468635] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
[ 2325.484646] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003030
[ 2325.484651] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000003
[ 2325.484654] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
[ 2325.485660] radeon 0000:01:00.0: GPU reset succeed
[ 2325.506653] [drm] Clocks initialized !
[ 2382.495138] SysRq : Emergency Sync
Comment 1 Stefano Carignano 2010-08-13 04:03:30 UTC
Created attachment 37841 [details]
Xorg log
Comment 2 Marc Dietrich 2010-08-13 05:14:33 UTC
Created attachment 37845 [details] [review]
rebased V2 blit patch from dri-devel

can you try this patch ontop of d-r-t?
Comment 3 Marc Dietrich 2010-08-15 04:45:10 UTC
with the above patch the problem got not really cured here. It just happens more seldom. So there is still something wrong with the blit code in d-r-t.
Comment 4 Stefano Carignano 2010-08-15 07:02:15 UTC
(In reply to comment #3)
> with the above patch the problem got not really cured here. It just happens
> more seldom. So there is still something wrong with the blit code in d-r-t.

oh that's too bad, I tried the patch for a couple days now and it did seem to improve things, namely I haven't managed to crash the system anymore (I'm not using it heavily though, is this related to the gpu load or is it somewhat random ?)
Comment 5 Marc Dietrich 2010-08-16 10:40:17 UTC
In fact, the bug is a little different now. Instead of a GPU hang, Xorg just blocks in D-state, no dmesg output. I did a cat /proc/`pidof X`/stack and got:

[<ffffffffa01ac7b2>] radeon_fence_wait+0x1d1/0x2ea [radeon]
[<ffffffffa01acf41>] radeon_sync_obj_wait+0x11/0x13 [radeon]
[<ffffffffa009295c>] ttm_bo_wait+0xbe/0x153 [ttm]
[<ffffffffa0095b54>] ttm_bo_move_accel_cleanup+0x8b/0x29f [ttm]
[<ffffffffa01ad07d>] radeon_move_blit+0x12a/0x148 [radeon]
[<ffffffffa01ad420>] radeon_bo_move+0x114/0x13c [radeon]
[<ffffffffa0092da9>] ttm_bo_handle_move_mem+0x1b6/0x2b1 [ttm]
[<ffffffffa009449e>] ttm_bo_evict+0x2e1/0x34a [ttm]
[<ffffffffa009467d>] ttm_mem_evict_first+0x176/0x1a4 [ttm]
[<ffffffffa0094141>] ttm_bo_mem_space+0x3fd/0x479 [ttm]
[<ffffffffa0094b6e>] ttm_bo_move_buffer+0xb3/0x11b [ttm]
[<ffffffffa0094c83>] ttm_bo_validate+0xad/0xf6 [ttm]
[<ffffffffa0094ffe>] ttm_bo_init+0x332/0x36b [ttm]
[<ffffffffa01ae8e9>] radeon_bo_create+0x17f/0x246 [radeon]
[<ffffffffa01beac8>] radeon_gem_object_create+0x7d/0xda [radeon]
[<ffffffffa01beb72>] radeon_gem_create_ioctl+0x4d/0xab [radeon]
[<ffffffffa002543c>] drm_ioctl+0x255/0x34d [drm]
[<ffffffff810f719c>] vfs_ioctl+0x32/0xa6
[<ffffffff810f7aba>] do_vfs_ioctl+0x46a/0x4a3
[<ffffffff810f7b49>] sys_ioctl+0x56/0x79
[<ffffffff81002b9b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Comment 6 Marc Dietrich 2010-08-16 10:43:03 UTC
btw. this happens often when displaying images.google.com (with some images) in firefox and try to scroll. My screen has 1920x1200 res, but my computer at work crash also today with 1280x1024 resolution. Maybe this is not so relevant, just in case...
Comment 7 Alex Deucher 2010-08-16 11:02:01 UTC
Are you getting these issues specific to d-r-t or are you seeing them on 2.6.36-rc1 or drm-core-next?
Comment 8 Marc Dietrich 2010-08-16 14:13:39 UTC
it happens on d-r-t since the blit the cleanup http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commit;h=36dff284447cfd7dce032b760842952eefa7bddf
I also have V2 of the cleanup (see comment #2) applied (and 2.6.35.2 also).
Comment 9 Alex Deucher 2010-08-16 14:44:17 UTC
(In reply to comment #8)
> it happens on d-r-t since the blit the cleanup
> http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commit;h=36dff284447cfd7dce032b760842952eefa7bddf
> I also have V2 of the cleanup (see comment #2) applied (and 2.6.35.2 also).

That patch is currently busted as is.  You need to either revert it, or apply v2 that I posted on dri-devel.
Comment 10 Marc Dietrich 2010-08-17 01:39:36 UTC
well that's what I did. basicly, the patch in comment #2 is an interdiff of blit_V1 and blit_V2, so I should produce the same result as unapplying V1 and applying V2 - correct?

output of interdiff:

diff -u b/drivers/gpu/drm/radeon/r600_blit_kms.c b/drivers/gpu/drm/radeon/r600_blit_kms.c
--- b/drivers/gpu/drm/radeon/r600_blit_kms.c
+++ b/drivers/gpu/drm/radeon/r600_blit_kms.c
@@ -448,19 +448,8 @@
        int num_packet2s = 0;
 
        /* pin copy shader into vram if already initialized */
-       if (rdev->r600_blit.shader_obj) {
-               r = radeon_bo_reserve(rdev->r600_blit.shader_obj, false);
-               if (unlikely(r != 0))
-                       return r;
-               r = radeon_bo_pin(rdev->r600_blit.shader_obj, RADEON_GEM_DOMAIN_VRAM,
-                                 &rdev->r600_blit.shader_gpu_addr);
-               radeon_bo_unreserve(rdev->r600_blit.shader_obj);
-               if (r) {
-                       dev_err(rdev->dev, "(%d) pin blit object failed\n", r);
-                       return r;
-               }
-               return 0;
-       }
+       if (rdev->r600_blit.shader_obj)
+               goto done;
 
        mutex_init(&rdev->r600_blit.mutex);
        rdev->r600_blit.state_offset = 0;
@@ -519,6 +508,18 @@
        memcpy(ptr + rdev->r600_blit.ps_offset, r6xx_ps, r6xx_ps_size * 4);
        radeon_bo_kunmap(rdev->r600_blit.shader_obj);
        radeon_bo_unreserve(rdev->r600_blit.shader_obj);
+
+done:
+       r = radeon_bo_reserve(rdev->r600_blit.shader_obj, false);
+       if (unlikely(r != 0))
+               return r;
+       r = radeon_bo_pin(rdev->r600_blit.shader_obj, RADEON_GEM_DOMAIN_VRAM,
+                         &rdev->r600_blit.shader_gpu_addr);
+       radeon_bo_unreserve(rdev->r600_blit.shader_obj);
+       if (r) {
+               dev_err(rdev->dev, "(%d) pin blit object failed\n", r);
+               return r;
+       }
        return 0;
 }
Comment 11 Alex Deucher 2010-08-17 07:44:50 UTC
is this still a problem with the current d-r-t?
Comment 12 Marc Dietrich 2010-08-17 12:12:50 UTC
yes it does
Comment 13 Marc Dietrich 2010-08-17 12:14:24 UTC
(In reply to comment #12)
> yes it does

eh - is!
Comment 14 Alex Deucher 2010-08-17 12:38:46 UTC
Can you bisect to see what commit is causing the problem?
Comment 15 Marc Dietrich 2010-08-17 14:05:34 UTC
the bug is hard to trigger (10 min scrolling with firefox), so bisecting will take a lot of time. I booted with no_wb=1 just for testing and now it seems to work fine. So maybe the blit and the writeback change have some unhealthy relationship. Also somehow the git history got changed... I'm sure the writeback changes where there many days ago and before the blit change. Maybe it also helps, that original reporter (Stefano) has a rv620 chip which is (AFAIK) similar to the rs780/785 chips.
Comment 16 Alex Deucher 2010-08-17 14:09:48 UTC
(In reply to comment #15)
> the bug is hard to trigger (10 min scrolling with firefox), so bisecting will
> take a lot of time. I booted with no_wb=1 just for testing and now it seems to
> work fine. So maybe the blit and the writeback change have some unhealthy

writeback has nothing to do with the blit but it might cause problems on it's own.  If no_wb=1 fixes the issue, then writeback might not work well on your system.

> relationship. Also somehow the git history got changed... I'm sure the
> writeback changes where there many days ago and before the blit change. Maybe

The branch was rebased.
Comment 17 Andy Furniss 2010-08-18 03:20:48 UTC
(In reply to comment #0)
> Created an attachment (id=37840) [details]
> system dmesg
> 
> Using latest git (as of 12/08/2010) of libdrm, mesa(classic),xf86-video-ati and
> drm-radeon-testing (commit drm/radeon/kms: enable writeback on remaing asics ), 
> gpu is a hd3470 mobility (rv620), forced to lowest power state 
> (echo "low" > /sys/class/drm/card0/device/power_profile).

I managed to get the same with the last d-r-t, using low power + gits like you, but this was on a rv790.

It seems like seamonkey was involved, but just to confuse the issue I wasn't running a clean d-r-t or ddx - which may well be irrelevant but -

d-r-t had tiling fixed + 2 cs parser fixes from the list, ddx had wait for vline FALSE and dri2 sync was off in drirc.

Had tested without issue many games, mplayer and mesa demos over the day.

The lockup was triggered when I found a seamonkey bug that makes it spawn a new window every 1/2-1 sec. While this was happening as X was unuseable due to the constant new windows I was switching back and forth between vt2 and 7. Then it locked up and I didn't get the screen back. After sysrq reboot, the card was still in a state - alsa failed to probe hardware, but I carried on into X looked at kern log in an xterm OK, but as soon as I started seamonkey it went again.

Reboot went OK this time and try as hard as I could - triggering seamonkey bug and switching vts I couldn't reproduce.

Now running current vanilla d-r-t and ddx I have so far failed to trigger it, but then I ran the other d-r-t for days OK.
Comment 18 Marc Dietrich 2010-09-05 10:30:34 UTC
seems the bug I was seeing (see comment #5) is fixed by v3 fencing patch, so for me it is ready to be closed...
Comment 19 Andy Furniss 2010-09-05 15:45:22 UTC
(In reply to comment #18)
> seems the bug I was seeing (see comment #5) is fixed by v3 fencing patch, so
> for me it is ready to be closed...

I have failed to reproduce the one lockup I had with various d-r-ts, and now am running d-r-t + v3 fence.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.