Bug 25717 - Radeon: system locks up when sending many vertices.
Summary: Radeon: system locks up when sending many vertices.
Status: RESOLVED NOTABUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: Other All
: medium critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-18 12:34 UTC by Rafael Monica
Modified: 2014-01-23 11:31 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
program that locks up the system (1.06 KB, patch)
2009-12-18 12:34 UTC, Rafael Monica
no flags Details | Splinter Review

Description Rafael Monica 2009-12-18 12:34:31 UTC
Created attachment 32184 [details] [review]
program that locks up the system

Some programs cause a complete system lock-up here, most notably the program makehuman, and progs/demos/gltestperf. I suspect this is because the GPU or the drm has a problem handling too many vertices. I created a little test program that causes a guaranteed system lock-up here.

This is on a RS780, drm-next and mesa master from today.
Comment 1 Andy Furniss 2010-03-08 07:28:51 UTC
(In reply to comment #0)
> Created an attachment (id=32184) [details]
> program that locks up the system
> 
> Some programs cause a complete system lock-up here, most notably the program
> makehuman, and progs/demos/gltestperf. I suspect this is because the GPU or the
> drm has a problem handling too many vertices. I created a little test program
> that causes a guaranteed system lock-up here.
> 
> This is on a RS780, drm-next and mesa master from today.
> 

I can confirm a gltestperf GPU lockup (monitor off, no SysRq, nothing logged, reset button won't recover, soft power cycle will)  on my AGP RV670 running current gits drt,mesa,ddx,xorg. 

It only happens with KMS and is also present with a Jan27th drt kernel I tried.

agpmode=-1 doesn't help - it hangs almost instantly with that, with AGP gart I can get to benchmark 2 or 3 before it locks.

Running under UMS I get lots of the error -

bo(0x83f97a0, 65536) is mapped (-1) can't valide it.
invalid bo(0x83f97a0) [0xC1CDC000, 0xC1CEC000]

gltestperf will eventually hang, but only needs <ctrl><c> to kill it.
Comment 2 Andy Furniss 2010-03-15 07:57:52 UTC
(In reply to comment #1)

Running todays drt things have changed a bit -

With AGP gart I can now run gltestperf OK, though I suspect this could be by luck as the "glxgears type perf" is a bit lower than recent kernels @ 800fps so maybe the (poor) caching is slowing things down a bit and avoiding it.

With PCIE gart, where glxgears is 1200, I still hit the lockup, but now I am saved by the new GPU reset code. 

Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
Mar 15 14:29:37 nf7 kernel: ------------[ cut here ]------------
Mar 15 14:29:37 nf7 kernel: WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:234 radeon_fence_wait+0x262/0x2c0 [radeon]()
Mar 15 14:29:37 nf7 kernel: Hardware name:  
Mar 15 14:29:37 nf7 kernel: GPU lockup (waiting for 0x000017D3 last fence id 0x000017CB)
Mar 15 14:29:37 nf7 kernel: Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit softcursor fb mt352 saa7134_dvb videobuf_dvb dvb_core mt20xx tea5767 tda9887 tda8290 tuner saa7134 v4l2_common videodev v4l1_compat videobuf_dma_sg videobuf_core tveeprom i2c_core nvidia_agp ehci_hcd agpgart ohci_hcd usbhid usbcore snd_intel8x0 snd_ac97_codec ac97_bus forcedeth
Mar 15 14:29:37 nf7 kernel: Pid: 2526, comm: gltestperf Tainted: G        W  2.6.33-50657-g5894684 #1
Mar 15 14:29:37 nf7 kernel: Call Trace:
Mar 15 14:29:37 nf7 kernel:  [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<c102f811>] warn_slowpath_common+0x81/0xa0
Mar 15 14:29:37 nf7 kernel:  [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<c102f87b>] warn_slowpath_fmt+0x2b/0x30
Mar 15 14:29:37 nf7 kernel:  [<fa1164a2>] radeon_fence_wait+0x262/0x2c0 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<c104b950>] ? autoremove_wake_function+0x0/0x40
Mar 15 14:29:37 nf7 kernel:  [<fa117230>] ? radeon_sync_obj_wait+0x0/0x20 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<fa117241>] radeon_sync_obj_wait+0x11/0x20 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<f96b6e82>] ttm_bo_wait+0xf2/0x1d0 [ttm]
Mar 15 14:29:37 nf7 kernel:  [<fa12bb34>] radeon_gem_wait_idle_ioctl+0x84/0x120 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<f954f719>] drm_ioctl+0x259/0x3e0 [drm]
Mar 15 14:29:37 nf7 kernel:  [<fa12bab0>] ? radeon_gem_wait_idle_ioctl+0x0/0x120 [radeon]
Mar 15 14:29:37 nf7 kernel:  [<c10c05e9>] ? do_sync_read+0xb9/0xf0
Mar 15 14:29:37 nf7 kernel:  [<c10edd6e>] ? __fsnotify_parent+0xe/0x110
Mar 15 14:29:37 nf7 kernel:  [<f954f4c0>] ? drm_ioctl+0x0/0x3e0 [drm]
Mar 15 14:29:37 nf7 kernel:  [<c10ce35d>] vfs_ioctl+0x2d/0xa0
Mar 15 14:29:37 nf7 kernel:  [<c10ce51a>] do_vfs_ioctl+0x6a/0x560
Mar 15 14:29:37 nf7 kernel:  [<c10edb0e>] ? fsnotify+0xe/0x120
Mar 15 14:29:37 nf7 kernel:  [<c10edd6e>] ? __fsnotify_parent+0xe/0x110
Mar 15 14:29:37 nf7 kernel:  [<c10c0634>] ? rw_verify_area+0x14/0xc0
Mar 15 14:29:37 nf7 kernel:  [<c10c0e46>] ? vfs_read+0x146/0x150
Mar 15 14:29:37 nf7 kernel:  [<c10c1b44>] ? fget_light+0x14/0xe0
Mar 15 14:29:37 nf7 kernel:  [<c10c1b44>] ? fget_light+0x14/0xe0
Mar 15 14:29:37 nf7 kernel:  [<c10cea4e>] sys_ioctl+0x3e/0x60
Mar 15 14:29:37 nf7 kernel:  [<c1002c0c>] sysenter_do_call+0x12/0x22
Mar 15 14:29:37 nf7 kernel: ---[ end trace eae1ac44941f0643 ]---
Mar 15 14:29:37 nf7 kernel: [drm] Disabling audio support
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: f72844e0 unpin not necessary
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU softreset 
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xE7731CE0
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00880103
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200018C0
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003030
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000003
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200080C0
Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU reset succeed
Comment 3 Andy Furniss 2010-08-01 04:10:58 UTC
(In reply to comment #2)

I've now got a new PC with a PCIE rv770 and running current d-r-t and ddx/mesa gits can still reproduce this GPU reset.

I also notice that after a GPU reset power management no longer works and I am stuck with full clock speed.

radeon 0000:01:00.0: GPU lockup CP stall for more than 1000msec
------------[ cut here ]------------
WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:235 radeon_fence_wait+0x249/0x2b0 [radeon]()
Hardware name: System Product Name
GPU lockup (waiting for 0x0002AC4B last fence id 0x0002AC48)
Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit fbcon tileblit font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor powernow_k8 mperf cpufreq_ondemand saa7134_alsa mt352 saa7134_dvb videobuf_dvb dvb_core mt20xx tea5767 tda9887 tda8290 tuner snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec saa7134 v4l2_common videodev v4l1_compat videobuf_dma_sg videobuf_core ir_common firewire_ohci ir_core snd_pcsp snd_hwdep snd_pcm tveeprom firewire_core option snd_timer pata_atiixp usb_wwan pata_acpi r8169 usbserial snd i2c_piix4 serio_raw i2c_core crc_itu_t soundcore snd_page_alloc xhci_hcd asus_atk0110 mii ata_generic k10temp pata_jmicron
Pid: 1970, comm: gltestperf Not tainted 2.6.35-rc4-10804-gbffec3b #2
Call Trace:
 [<faceae89>] ? radeon_fence_wait+0x249/0x2b0 [radeon]
 [<c0444cac>] warn_slowpath_common+0x7c/0xa0
 [<faceae89>] ? radeon_fence_wait+0x249/0x2b0 [radeon]
 [<c0444d4e>] warn_slowpath_fmt+0x2e/0x30
 [<faceae89>] radeon_fence_wait+0x249/0x2b0 [radeon]
 [<c045e6c0>] ? autoremove_wake_function+0x0/0x40
 [<faceb760>] ? radeon_sync_obj_wait+0x0/0x10 [radeon]
 [<faceb76c>] radeon_sync_obj_wait+0xc/0x10 [radeon]
 [<fa168a34>] ttm_bo_wait+0xc4/0x150 [ttm]
 [<fad0034f>] radeon_gem_wait_idle_ioctl+0x7f/0x100 [radeon]
 [<f9f9cc19>] drm_ioctl+0x269/0x400 [drm]
 [<fad002d0>] ? radeon_gem_wait_idle_ioctl+0x0/0x100 [radeon]
 [<c04f8081>] ? do_sync_read+0xb1/0xf0
 [<f9f9c9b0>] ? drm_ioctl+0x0/0x400 [drm]
 [<c0504ed8>] vfs_ioctl+0x28/0xa0
 [<c050561a>] do_vfs_ioctl+0x6a/0x570
 [<c04f8c11>] ? vfs_read+0x171/0x180
 [<c0505b59>] sys_ioctl+0x39/0x60
 [<c040369f>] sysenter_do_call+0x12/0x28
---[ end trace 6ec64b6e53c6d554 ]---
[drm] Disabling audio support
radeon 0000:01:00.0: GPU softreset 
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xE77304A4
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00FF0F02
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x20003EC0
radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003028
radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
radeon 0000:01:00.0: GPU reset succeed
[drm] Clocks initialized !
[drm] ring test succeeded in 1 usecs
[drm] ib test succeeded in 1 usecs
[drm] Enabling audio support
Comment 4 Andy Furniss 2010-09-23 03:19:14 UTC
(In reply to comment #3)
> (In reply to comment #2)
> 
> I've now got a new PC with a PCIE rv770 and running current d-r-t and ddx/mesa
> gits can still reproduce this GPU reset.

I can no longer trigger this with gltestperf or the test app running the current versions of d-r-t, mesa and ddx as long as I use r600c.

I can with r600g - but then it's early days for that and maybe a separate issue.
Comment 5 Jerome Glisse 2011-02-09 07:48:42 UTC
Works here, closing
Comment 6 Nicolas Kaiser 2011-02-09 08:37:27 UTC
Running gltestperf with r600g locks up hard at my place in benchmark 3.

System environment:
-- system architecture: amd64
-- Linux distribution: Gentoo
-- GPU: RS780G
-- Model: ATI Radeon HD 3200
-- Display connector: VGA
-- xf86-video-ati: 6.14.0
-- xserver: 1.9.3.902
-- mesa: c26478680989bd3d7303c5d772f7fb2a76045191
-- drm: 550fe2ca3b29ad2191eab4fdfbed9ed21e25492d
-- kernel: 2.6.38-rc3
Comment 7 Marek Olšák 2011-02-09 09:25:52 UTC
Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9 seconds to complete. If it takes more than 10 seconds, kernel will consider it a lock-up, even though it's not.
Comment 8 Nicolas Kaiser 2011-02-09 10:44:27 UTC
(In reply to comment #7)
> Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9
> seconds to complete. If it takes more than 10 seconds, kernel will consider it
> a lock-up, even though it's not.

Yes, indeed. Now I disabled lockups like this ...

 drivers/gpu/drm/radeon/r100.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 5f15820..ce5ea3d 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -2036,7 +2036,7 @@ bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct r100_gpu_lockup *l
 	elapsed = jiffies_to_msecs(cjiffies - lockup->last_jiffies);
 	if (elapsed >= 10000) {
 		dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed);
-		return true;
+//		return true;
 	}
 	/* give a chance to the GPU ... */
 	return false;


... and gltestperf finished successfully.
According to the below results, it takes more than 13 seconds at my place:

Feb  9 19:32:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Feb  9 19:32:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec
Feb  9 19:32:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec
Feb  9 19:32:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec
Feb  9 19:32:42 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec
Feb  9 19:32:42 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec
Feb  9 19:32:43 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec
Feb  9 19:32:54 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Feb  9 19:32:55 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec
Feb  9 19:32:55 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec
Feb  9 19:32:56 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec
Feb  9 19:32:56 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec
Feb  9 19:32:57 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec
Feb  9 19:32:57 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec
Feb  9 19:33:09 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Feb  9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec
Feb  9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec
Feb  9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec
Feb  9 19:33:11 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec
Feb  9 19:33:12 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec
Feb  9 19:33:12 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec
Feb  9 19:33:24 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Feb  9 19:33:24 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec
Feb  9 19:33:25 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec
Feb  9 19:33:25 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec
Feb  9 19:33:26 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec
Feb  9 19:33:26 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec
Feb  9 19:33:27 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec
Feb  9 19:33:38 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Feb  9 19:33:39 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec
Feb  9 19:33:39 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec
Feb  9 19:33:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec
Feb  9 19:33:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec
Feb  9 19:33:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec
Feb  9 19:33:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec
Comment 9 Marek Olšák 2014-01-23 11:31:12 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9
> > seconds to complete. If it takes more than 10 seconds, kernel will consider it
> > a lock-up, even though it's not.
> 
> Yes, indeed. Now I disabled lockups like this ...

> ... and gltestperf finished successfully.

Closing, as it's not a lockup.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.