Bug 29978

Summary: [RADEON:KMS:RS780:R600G] r600c r600g GPU lockup loading savefile in vegastrike
Product: Mesa Reporter: Nicolas Kaiser <nikai>
Component: Drivers/DRI/R600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: GPU lockup call trace
output of MESA_GLSL=dump
working output of MESA_GLSL=dump

Description Nicolas Kaiser 2010-09-02 12:31:51 UTC
Created attachment 38388 [details]
GPU lockup call trace

GPU lockup loading savefile in Vega Strike (SVN), with r600 classic driver.
Bisected to mesa: 5ad74779cea07cc6a19a52874cdaef8b018e2f1b
 (Eric Anholt, ir_to_mesa: Load all the STATE_VAR elements of a builtin uniform to a temp.)
Verified by reverting the above commit against mesa head (6e3cbeb3614152ea3aa188666d6166b484ee3f56).

System environment:
-- system architecture: amd64
-- Linux distribution: Gentoo
-- GPU: RS780G
-- Model: ATI Radeon HD 3200
-- Display connector: VGA
-- xf86-video-ati: 2b98ec1f7e931019a4ab699a56d5dfaa395946fb
-- xserver: 1.9.0
-- mesa: git
-- drm: 8a76244a0fd09d0e3298fe68af812d7eaa4dbcb5
-- kernel: 2.6.35.3
Comment 1 Andre Maasikas 2010-09-03 08:43:33 UTC
from testing with piglit glsl-vs-texturematrix2 the likely cause
seems that now we use more temp registers in vs than we'v programmed
to SQ_GPR_RESOURCE_MGMT. The commit above increased the register usage
but we should not hang the card anyway in this case.

Don't know how to fix this though, SQ_GPR_RESOURCE_MGMT is hardcoded for now
and to fix this correctly we should probably set the values to max we will
use in a cs. However there's no infrastructure in driver to change values already (in the beginning) of the cs.
Alos I don't know the exact constraints/dependencies for the RESOURCE_MGMT things
Comment 2 Nicolas Kaiser 2010-09-03 12:36:16 UTC
Narrowing down, as of mesa commit f061524f0737bf59dad6ab9bb2e0015df804e4b5:
This GPU lockup doesn't appear with shaders disabled.
It appears with shaders set to "simplest shaders", and it can be reproduced by starting a new campaign.

"Simplest shaders" are defined in
http://vegastrike.svn.sourceforge.net/viewvc/vegastrike/trunk/data/techniques/2_ps1.4/default_simple.technique?revision=12869&view=markup
which points to
http://vegastrike.svn.sourceforge.net/viewvc/vegastrike/trunk/data/programs/fixed5.vp?revision=12671&view=markup
Comment 3 Nicolas Kaiser 2010-09-04 16:47:34 UTC
For now, fixed with 280665be7026c978acead9713c10271c36a571ee
Comment 4 Eric Anholt 2010-09-07 15:11:40 UTC
OK, with the new version after the revert, this should (still) be fixed.  If it isn't, could you attach the output of MESA_GLSL=dump? (In general, this is probably a good idea for shader-related issues)
Comment 5 Nicolas Kaiser 2010-09-07 17:17:08 UTC
Created attachment 38535 [details]
output of MESA_GLSL=dump

Locking up again. I'm using mesa-git at commit a09a8ec12d76e1fb1583fa99cf9f48246c108d7b.

This is the output of "MESA_GLSL=dump vegastrike -j nothing.mission". This avoids the game menu in order to minimize output.

According to /var/log/messages, there were 26 lockups in a row.
I'm not sure how useful this output is - my impression was that the lockups started right after initialisation, but at the end of the dump I still see initialisation messages.
Comment 6 Nicolas Kaiser 2010-09-08 01:55:05 UTC
Created attachment 38540 [details]
working output of MESA_GLSL=dump

For comparison purposes, this is the same output with mesa-git a09a8ec12d76e1fb1583fa99cf9f48246c108d7b and reverting
acd7c21541110d7ae6b9e63647391f65946e5c5d and
6c0ba32fd1466e8c1700acab3003dc1fe1deb337.

Working nicely. Apparently, there's not much output after the final initialisation messages.
Comment 7 Nicolas Kaiser 2010-09-09 04:10:53 UTC
(Still) locking up using mesa-git at commit
777f352e6087e3ef05f7a88232f23e4f971bc5a0
Comment 8 Nicolas Kaiser 2010-09-14 09:42:52 UTC
I noticed that I can make some other applications lock up the GPU in a quite similar way, using MESA_GLSL=nopt:

Celestia locks up on start, but not immediately: Sun gets displayed without problem, the GPU locks up as soon as Earth should get rendered.
Oolite also locks up on start, as soon as a spaceship should get rendered.

Currently I'm using mesa 9476efe77ff196993937c3aa2e5bca725ceb0b41 and kernel 2.6.35.4 (with Marek Olšák's 10 seconds patch http://comments.gmane.org/gmane.comp.video.dri.devel/49821 ).

Without MESA_GLSL=nopt both are working without a problem. However, with it both also lock up using mesa
a09a8ec12d76e1fb1583fa99cf9f48246c108d7b and reverting
acd7c21541110d7ae6b9e63647391f65946e5c5d and
6c0ba32fd1466e8c1700acab3003dc1fe1deb337, as in comment #6.
Comment 9 Eric Anholt 2010-09-21 10:29:29 UTC
Removing myself from CC.  This bug is that the radeon driver needs to cleanly reject shaders at link time that it can't handle.  (incidentally, i915 has the same issue, and i965 to some extent).
Comment 10 Nicolas Kaiser 2010-12-04 12:09:12 UTC
Still the same. Here's a GPU lockup with r600g:

System environment:
-- system architecture: amd64
-- Linux distribution: Gentoo
-- GPU: RS780G
-- Model: ATI Radeon HD 3200
-- Display connector: VGA
-- xf86-video-ati: f9bbb26dd97254b66de11bb2abd821aa293ecba5
-- xserver: 1.9.2.901
-- mesa: 859106f196ade77f59f8787b071739901cd1a843
-- drm: 8420743301a36dc1316fadf53bf8e1478068400a
-- kernel: 2.6.37-rc4-next-20101203-03967-g43cebba

Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0: GPU lockup CP stall for more than 10000msec
Dec  4 20:38:23 absol kernel: ------------[ cut here ]------------
Dec  4 20:38:23 absol kernel: WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:244 radeon_fence_wait+0x235/0x2d3 [radeon]()
Dec  4 20:38:23 absol kernel: Hardware name: P-M3A3200
Dec  4 20:38:23 absol kernel: GPU lockup (waiting for 0x0001BF3E last fence id 0x0001BF3D)
Dec  4 20:38:23 absol kernel: Modules linked in: ipt_MASQUERADE iptable_nat nf_nat ipt_REJECT xt_limit ipt_ULOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter xt_multiport xt_iprange xt_mark ip_tables x_tables dm_mod hid_logitech usbhid radeon ttm drm_kms_helper drm i2c_piix4 i2c_algo_bit cfbcopyarea cfbimgblt cfbfillrect speedtch usbatm [last unloaded: usb_storage]
Dec  4 20:38:23 absol kernel: Pid: 14067, comm: vegastrike Not tainted 2.6.37-rc4-next-20101203-03967-g43cebba #115
Dec  4 20:38:23 absol kernel: Call Trace:
Dec  4 20:38:23 absol kernel: [<ffffffff81033180>] ? warn_slowpath_common+0x78/0x8c
Dec  4 20:38:23 absol kernel: [<ffffffff81033233>] ? warn_slowpath_fmt+0x45/0x4a
Dec  4 20:38:23 absol kernel: [<ffffffffa00d1829>] ? radeon_fence_wait+0x235/0x2d3 [radeon]
Dec  4 20:38:23 absol kernel: [<ffffffff81048b69>] ? autoremove_wake_function+0x0/0x2a
Dec  4 20:38:23 absol kernel: [<ffffffffa00987a8>] ? ttm_bo_wait+0xca/0x171 [ttm]
Dec  4 20:38:23 absol kernel: [<ffffffffa00e3e7a>] ? radeon_gem_wait_idle_ioctl+0x7d/0xe9 [radeon]
Dec  4 20:38:23 absol kernel: [<ffffffffa00470c4>] ? drm_ioctl+0x236/0x2ea [drm]
Dec  4 20:38:23 absol kernel: [<ffffffffa00e3dfd>] ? radeon_gem_wait_idle_ioctl+0x0/0xe9 [radeon]
Dec  4 20:38:23 absol kernel: [<ffffffff8102244d>] ? do_page_fault+0x306/0x33f
Dec  4 20:38:23 absol kernel: [<ffffffff81089dc1>] ? mmap_region+0x3a7/0x4bc
Dec  4 20:38:23 absol kernel: [<ffffffff810abe64>] ? do_vfs_ioctl+0x3f3/0x440
Dec  4 20:38:23 absol kernel: [<ffffffff810abeed>] ? sys_ioctl+0x3c/0x5c
Dec  4 20:38:23 absol kernel: [<ffffffff81001f3b>] ? system_call_fastpath+0x16/0x1b
Dec  4 20:38:23 absol kernel: ---[ end trace 9edad903af395b1b ]---
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0: GPU softreset 
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA25334E0
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000103
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20000040
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Dec  4 20:38:23 absol kernel: radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
Dec  4 20:38:24 absol kernel: radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
Dec  4 20:38:24 absol kernel: radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
Dec  4 20:38:24 absol kernel: radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20008040
Dec  4 20:38:24 absol kernel: radeon 0000:01:05.0: GPU reset succeed
Dec  4 20:38:24 absol kernel: radeon 0000:01:05.0: WB enabled
Dec  4 20:38:24 absol kernel: [drm] ring test succeeded in 1 usecs
Dec  4 20:38:24 absol kernel: [drm] ib test succeeded in 1 usecs
Comment 11 Nicolas Kaiser 2011-02-17 10:05:58 UTC
Yay! One week ago it was still locking up, but now it's fixed.
No lock-up any more, working with r600g as well as r600c.
Thanks!

System environment:
-- system architecture: amd64
-- Linux distribution: Gentoo
-- GPU: RS780
-- Model: ATI Radeon HD 3200 (780G)
-- Display connector: VGA
-- xf86-video-ati: 6.14.0
-- xserver: 1.9.4
-- mesa: 0adeaf00e6c4592e78cca36c3b365110b83c965d
-- drm: 550fe2ca3b29ad2191eab4fdfbed9ed21e25492d
-- kernel: 2.6.38-rc5

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.