Summary: | Radeon: system locks up when sending many vertices. | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Rafael Monica <monraaf> | ||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||
Status: | RESOLVED NOTABUG | QA Contact: | |||||
Severity: | critical | ||||||
Priority: | medium | CC: | adf.lists, nikai, virtuousfox, zajec5 | ||||
Version: | unspecified | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Description
Rafael Monica
2009-12-18 12:34:31 UTC
(In reply to comment #0) > Created an attachment (id=32184) [details] > program that locks up the system > > Some programs cause a complete system lock-up here, most notably the program > makehuman, and progs/demos/gltestperf. I suspect this is because the GPU or the > drm has a problem handling too many vertices. I created a little test program > that causes a guaranteed system lock-up here. > > This is on a RS780, drm-next and mesa master from today. > I can confirm a gltestperf GPU lockup (monitor off, no SysRq, nothing logged, reset button won't recover, soft power cycle will) on my AGP RV670 running current gits drt,mesa,ddx,xorg. It only happens with KMS and is also present with a Jan27th drt kernel I tried. agpmode=-1 doesn't help - it hangs almost instantly with that, with AGP gart I can get to benchmark 2 or 3 before it locks. Running under UMS I get lots of the error - bo(0x83f97a0, 65536) is mapped (-1) can't valide it. invalid bo(0x83f97a0) [0xC1CDC000, 0xC1CEC000] gltestperf will eventually hang, but only needs <ctrl><c> to kill it. (In reply to comment #1) Running todays drt things have changed a bit - With AGP gart I can now run gltestperf OK, though I suspect this could be by luck as the "glxgears type perf" is a bit lower than recent kernels @ 800fps so maybe the (poor) caching is slowing things down a bit and avoiding it. With PCIE gart, where glxgears is 1200, I still hit the lockup, but now I am saved by the new GPU reset code. Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec Mar 15 14:29:37 nf7 kernel: ------------[ cut here ]------------ Mar 15 14:29:37 nf7 kernel: WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:234 radeon_fence_wait+0x262/0x2c0 [radeon]() Mar 15 14:29:37 nf7 kernel: Hardware name: Mar 15 14:29:37 nf7 kernel: GPU lockup (waiting for 0x000017D3 last fence id 0x000017CB) Mar 15 14:29:37 nf7 kernel: Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit softcursor fb mt352 saa7134_dvb videobuf_dvb dvb_core mt20xx tea5767 tda9887 tda8290 tuner saa7134 v4l2_common videodev v4l1_compat videobuf_dma_sg videobuf_core tveeprom i2c_core nvidia_agp ehci_hcd agpgart ohci_hcd usbhid usbcore snd_intel8x0 snd_ac97_codec ac97_bus forcedeth Mar 15 14:29:37 nf7 kernel: Pid: 2526, comm: gltestperf Tainted: G W 2.6.33-50657-g5894684 #1 Mar 15 14:29:37 nf7 kernel: Call Trace: Mar 15 14:29:37 nf7 kernel: [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon] Mar 15 14:29:37 nf7 kernel: [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon] Mar 15 14:29:37 nf7 kernel: [<c102f811>] warn_slowpath_common+0x81/0xa0 Mar 15 14:29:37 nf7 kernel: [<fa1164a2>] ? radeon_fence_wait+0x262/0x2c0 [radeon] Mar 15 14:29:37 nf7 kernel: [<c102f87b>] warn_slowpath_fmt+0x2b/0x30 Mar 15 14:29:37 nf7 kernel: [<fa1164a2>] radeon_fence_wait+0x262/0x2c0 [radeon] Mar 15 14:29:37 nf7 kernel: [<c104b950>] ? autoremove_wake_function+0x0/0x40 Mar 15 14:29:37 nf7 kernel: [<fa117230>] ? radeon_sync_obj_wait+0x0/0x20 [radeon] Mar 15 14:29:37 nf7 kernel: [<fa117241>] radeon_sync_obj_wait+0x11/0x20 [radeon] Mar 15 14:29:37 nf7 kernel: [<f96b6e82>] ttm_bo_wait+0xf2/0x1d0 [ttm] Mar 15 14:29:37 nf7 kernel: [<fa12bb34>] radeon_gem_wait_idle_ioctl+0x84/0x120 [radeon] Mar 15 14:29:37 nf7 kernel: [<f954f719>] drm_ioctl+0x259/0x3e0 [drm] Mar 15 14:29:37 nf7 kernel: [<fa12bab0>] ? radeon_gem_wait_idle_ioctl+0x0/0x120 [radeon] Mar 15 14:29:37 nf7 kernel: [<c10c05e9>] ? do_sync_read+0xb9/0xf0 Mar 15 14:29:37 nf7 kernel: [<c10edd6e>] ? __fsnotify_parent+0xe/0x110 Mar 15 14:29:37 nf7 kernel: [<f954f4c0>] ? drm_ioctl+0x0/0x3e0 [drm] Mar 15 14:29:37 nf7 kernel: [<c10ce35d>] vfs_ioctl+0x2d/0xa0 Mar 15 14:29:37 nf7 kernel: [<c10ce51a>] do_vfs_ioctl+0x6a/0x560 Mar 15 14:29:37 nf7 kernel: [<c10edb0e>] ? fsnotify+0xe/0x120 Mar 15 14:29:37 nf7 kernel: [<c10edd6e>] ? __fsnotify_parent+0xe/0x110 Mar 15 14:29:37 nf7 kernel: [<c10c0634>] ? rw_verify_area+0x14/0xc0 Mar 15 14:29:37 nf7 kernel: [<c10c0e46>] ? vfs_read+0x146/0x150 Mar 15 14:29:37 nf7 kernel: [<c10c1b44>] ? fget_light+0x14/0xe0 Mar 15 14:29:37 nf7 kernel: [<c10c1b44>] ? fget_light+0x14/0xe0 Mar 15 14:29:37 nf7 kernel: [<c10cea4e>] sys_ioctl+0x3e/0x60 Mar 15 14:29:37 nf7 kernel: [<c1002c0c>] sysenter_do_call+0x12/0x22 Mar 15 14:29:37 nf7 kernel: ---[ end trace eae1ac44941f0643 ]--- Mar 15 14:29:37 nf7 kernel: [drm] Disabling audio support Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: f72844e0 unpin not necessary Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU softreset Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xE7731CE0 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00880103 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200018C0 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003030 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000003 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200080C0 Mar 15 14:29:37 nf7 kernel: radeon 0000:02:00.0: GPU reset succeed (In reply to comment #2) I've now got a new PC with a PCIE rv770 and running current d-r-t and ddx/mesa gits can still reproduce this GPU reset. I also notice that after a GPU reset power management no longer works and I am stuck with full clock speed. radeon 0000:01:00.0: GPU lockup CP stall for more than 1000msec ------------[ cut here ]------------ WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:235 radeon_fence_wait+0x249/0x2b0 [radeon]() Hardware name: System Product Name GPU lockup (waiting for 0x0002AC4B last fence id 0x0002AC48) Modules linked in: radeon ttm drm_kms_helper drm i2c_algo_bit fbcon tileblit font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor powernow_k8 mperf cpufreq_ondemand saa7134_alsa mt352 saa7134_dvb videobuf_dvb dvb_core mt20xx tea5767 tda9887 tda8290 tuner snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec saa7134 v4l2_common videodev v4l1_compat videobuf_dma_sg videobuf_core ir_common firewire_ohci ir_core snd_pcsp snd_hwdep snd_pcm tveeprom firewire_core option snd_timer pata_atiixp usb_wwan pata_acpi r8169 usbserial snd i2c_piix4 serio_raw i2c_core crc_itu_t soundcore snd_page_alloc xhci_hcd asus_atk0110 mii ata_generic k10temp pata_jmicron Pid: 1970, comm: gltestperf Not tainted 2.6.35-rc4-10804-gbffec3b #2 Call Trace: [<faceae89>] ? radeon_fence_wait+0x249/0x2b0 [radeon] [<c0444cac>] warn_slowpath_common+0x7c/0xa0 [<faceae89>] ? radeon_fence_wait+0x249/0x2b0 [radeon] [<c0444d4e>] warn_slowpath_fmt+0x2e/0x30 [<faceae89>] radeon_fence_wait+0x249/0x2b0 [radeon] [<c045e6c0>] ? autoremove_wake_function+0x0/0x40 [<faceb760>] ? radeon_sync_obj_wait+0x0/0x10 [radeon] [<faceb76c>] radeon_sync_obj_wait+0xc/0x10 [radeon] [<fa168a34>] ttm_bo_wait+0xc4/0x150 [ttm] [<fad0034f>] radeon_gem_wait_idle_ioctl+0x7f/0x100 [radeon] [<f9f9cc19>] drm_ioctl+0x269/0x400 [drm] [<fad002d0>] ? radeon_gem_wait_idle_ioctl+0x0/0x100 [radeon] [<c04f8081>] ? do_sync_read+0xb1/0xf0 [<f9f9c9b0>] ? drm_ioctl+0x0/0x400 [drm] [<c0504ed8>] vfs_ioctl+0x28/0xa0 [<c050561a>] do_vfs_ioctl+0x6a/0x570 [<c04f8c11>] ? vfs_read+0x171/0x180 [<c0505b59>] sys_ioctl+0x39/0x60 [<c040369f>] sysenter_do_call+0x12/0x28 ---[ end trace 6ec64b6e53c6d554 ]--- [drm] Disabling audio support radeon 0000:01:00.0: GPU softreset radeon 0000:01:00.0: R_008010_GRBM_STATUS=0xE77304A4 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00FF0F02 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x20003EC0 radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 radeon 0000:01:00.0: R_008010_GRBM_STATUS=0x00003028 radeon 0000:01:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:01:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:01:00.0: GPU reset succeed [drm] Clocks initialized ! [drm] ring test succeeded in 1 usecs [drm] ib test succeeded in 1 usecs [drm] Enabling audio support (In reply to comment #3) > (In reply to comment #2) > > I've now got a new PC with a PCIE rv770 and running current d-r-t and ddx/mesa > gits can still reproduce this GPU reset. I can no longer trigger this with gltestperf or the test app running the current versions of d-r-t, mesa and ddx as long as I use r600c. I can with r600g - but then it's early days for that and maybe a separate issue. Works here, closing Running gltestperf with r600g locks up hard at my place in benchmark 3. System environment: -- system architecture: amd64 -- Linux distribution: Gentoo -- GPU: RS780G -- Model: ATI Radeon HD 3200 -- Display connector: VGA -- xf86-video-ati: 6.14.0 -- xserver: 1.9.3.902 -- mesa: c26478680989bd3d7303c5d772f7fb2a76045191 -- drm: 550fe2ca3b29ad2191eab4fdfbed9ed21e25492d -- kernel: 2.6.38-rc3 Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9 seconds to complete. If it takes more than 10 seconds, kernel will consider it a lock-up, even though it's not. (In reply to comment #7) > Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9 > seconds to complete. If it takes more than 10 seconds, kernel will consider it > a lock-up, even though it's not. Yes, indeed. Now I disabled lockups like this ... drivers/gpu/drm/radeon/r100.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index 5f15820..ce5ea3d 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -2036,7 +2036,7 @@ bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct r100_gpu_lockup *l elapsed = jiffies_to_msecs(cjiffies - lockup->last_jiffies); if (elapsed >= 10000) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); - return true; +// return true; } /* give a chance to the GPU ... */ return false; ... and gltestperf finished successfully. According to the below results, it takes more than 13 seconds at my place: Feb 9 19:32:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Feb 9 19:32:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec Feb 9 19:32:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec Feb 9 19:32:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec Feb 9 19:32:42 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec Feb 9 19:32:42 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec Feb 9 19:32:43 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec Feb 9 19:32:54 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Feb 9 19:32:55 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec Feb 9 19:32:55 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec Feb 9 19:32:56 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec Feb 9 19:32:56 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec Feb 9 19:32:57 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec Feb 9 19:32:57 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec Feb 9 19:33:09 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Feb 9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec Feb 9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec Feb 9 19:33:10 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec Feb 9 19:33:11 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec Feb 9 19:33:12 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec Feb 9 19:33:12 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec Feb 9 19:33:24 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Feb 9 19:33:24 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec Feb 9 19:33:25 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec Feb 9 19:33:25 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec Feb 9 19:33:26 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec Feb 9 19:33:26 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec Feb 9 19:33:27 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec Feb 9 19:33:38 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec Feb 9 19:33:39 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10500msec Feb 9 19:33:39 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11000msec Feb 9 19:33:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 11500msec Feb 9 19:33:40 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12000msec Feb 9 19:33:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 12500msec Feb 9 19:33:41 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 13000msec (In reply to comment #8) > (In reply to comment #7) > > Maybe gltestperf doesn't lock up, it's just slow. On my lappy, it takes 9 > > seconds to complete. If it takes more than 10 seconds, kernel will consider it > > a lock-up, even though it's not. > > Yes, indeed. Now I disabled lockups like this ... > ... and gltestperf finished successfully. Closing, as it's not a lockup. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.