Bug 1003

Summary:	running two instances of glsnake locks X
Product:	Mesa	Reporter:	Nicholas Miell <nmiell>
Component:	Drivers/DRI/r200	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	high
Version:	unspecified
Hardware:	x86 (IA32)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	269
Attachments:	drm_options=debug output of X and glsnake*2

Description Nicholas Miell 2004-08-06 13:07:39 UTC

Running two instances of glsnake (either the version of glsnake that comes with
xscreensaver, or the original version from http://spacepants.org/src/glsnake)
causes an X lockup. The mouse still moves, but no screen updates occur.

Connecting with ssh from another box, I see that X is using 99% of a CPU and
that one of the glsnake instances is still running.

The other instances has exited as follows:
ioctl(4, 0x6444, 0)                     = -1 EBUSY (Device or resource busy)
nanosleep({0, 1000}, NULL)              = 0
sched_yield()                           = 0
[... lots more of these repeated here ...]
ioctl(4, 0x6444, 0)                     = -1 EBUSY (Device or resource busy)
nanosleep({0, 1000}, NULL)              = 0
sched_yield()                           = 0
ioctl(4, 0x4008642b, 0x7fbfffee70)      = 0
write(2, "Error: R200 timed out... exiting"..., 33) = 33
exit_group(-1)

Attempting (as root) to attach strace or gdb to X fails.

ATI FireGL 8800 128MB, FC2/AMD64, xorg-x11-6.7.0-5

Comment 1 Nicholas Miell 2004-08-06 13:25:49 UTC

Actually, on further inspection of the strace log from the first instance, it
wasn't so much "still running" as it was "stuck in the kernel in a DRI ioctl".
It received a SIGTERM just fine, though.

On another note, as well as being untracable, X also ignores SIGKILL.

Comment 2 Roland Scheidegger 2004-08-06 15:15:50 UTC

This is not restricted to glsnake. Multiple simultaneous dri clients (basically
doesn't matter which ones) are known to lockup the GPU with the r200 DRI driver.
I have tried some very experimental driver hacks, they help quite a bit but not
always, it looks like the real cause is still undetermined.

Comment 3 Andreas Stenglein 2004-08-22 01:23:08 UTC

Could you try again with current mesa-cvs ?
Eric Anholt committed a fix against locking issues with r100/r200.
I tried it on r100 and it helped: running glxgears and some instances ipers at
the same time worked well: no "data/vertex-migration" between contexts. q3a and
glxgears at the same time seems to be ok now, too.

Comment 4 Bernhard Kaindl 2004-08-24 14:18:24 UTC

Andreas, thanks for your last comment. I'm using a ATI FireGL 8800 (R200)
on Xorg X11 6.8.0 RC2, but games like gl-117 and trackballs lock the X server
hard just after a few seconds (only one application)

I've also seen Bug 814 which is about hard locks with R200:

http://freedesktop.org/bugzilla/show_bug.cgi?id=814

So maybe all this could be fixed by this patch? (I think so)

I've done a Mesa CVS checkout as described on http://www.mesa3d.org/cvs_access.html
and with cvs2cl from http://www.red-bean.com/cvs2cl/

I find two commits:

2004-08-17 22:10  anholt

        * src/mesa/drivers/dri/: r200/r200_ioctl.c, r200/r200_lock.h,
          radeon/radeon_ioctl.c, radeon/radeon_lock.h: Revert the move of
          lost_context setting to UNLOCK_HARDWARE that was done in the last
          commit.  I've been convinced by keithw that it's sufficient, and
          put a note in the code about it.

          Close another race for state in the Clear functions.  I made the
          situation worse in my last commit, but this should fix things.
          Might be a slight performance hit, which could be regained by
          splitting the R*_FIREVERTICES calls in r*Clear up so that the
          EmitState doesn't happen in a separate new cmdbuf.


2004-08-17 03:41  anholt 

        * src/mesa/drivers/dri/: radeon/radeon_context.h,
          radeon/radeon_ioctl.c, radeon/radeon_ioctl.h,
          radeon/radeon_lock.h, radeon/radeon_state_init.c,
          radeon/radeon_swtcl.c, radeon/radeon_tcl.c, r200/r200_cmdbuf.c,
          r200/r200_context.h, r200/r200_ioctl.c, r200/r200_ioctl.h,
          r200/r200_lock.h, r200/r200_state_init.c, r200/r200_swtcl.c,
          r200/r200_tcl.c: Close some races with locking on R100 and R200
          which could manifest as rendering errors on r100 and rendering
          errors and hangs on r200 (same for R100 without OLD_PACKETS).

          If a command buffer filled after some state (EmitState or a
          VBPNTR write) was emitted, the lock was grabbed, the buffer
          flushed, a new buffer prepared, and the lock dropped.  Another
          client could come in, set its own state as part of rendering, and
          when the first client flushed the rendering commands depending on
          the previous state, it got the 2nd client's state.  This is fixed
          by checking for enough space before beginning a set of state
          emits and rendering, and flushing the buffer first if so.  This
          guarantees that the buffer won't wrap.

          Also, move the "lost_context = 1" from the end of cmdbuf flushing
          to UNLOCK_HARDWARE for clarity (at a minimum) that any time the
          lock is dropped, state may get overwritten.  We don't have enough
          information at the point of the LOCK_HARDWARE to reset our state
          to the last UNLOCK_HARDWARE point in the case that we did lose
          our context, but saving the information to rebuild that state may
          be a useful optimization (ipers data suggests up to 5%).

I've found that these two patches are included in X.Org CVS release 04-08-22.

Versions before this did somewhat work, but this locks up completely.

Some one of the changes applied to the tree lately even made things
worse. But I also rebooted to newer kernel with new drm modules built
from Xorg.

Comment 5 Nicholas Miell 2004-09-04 18:09:57 UTC

Created attachment 822 [details]
drm_options=debug output of X and glsnake*2

This is the DRM debug output from loading radeon.ko, starting X, and running
two instances of glsnake until the crash.

Comment 6 Dave Airlie 2004-09-04 19:06:26 UTC

looks like the chip locked up...

Comment 7 Dieter Nützel 2004-09-05 09:24:51 UTC

I've stopped after 10 parallel instances...;-) 
 
System: 
dual Athlon MP 1900+ 
MSI K7D Master-L (aka AMD 768MPX) 
1 GB DDR266 RAM, CL2 (2x 512 MB) 
ATI Radeon 8500 QL, AGP (r200), 64 MB DDR RAM 
all U160/320 ;-) 
 
SuSE 9.0 
* with latest 2.6.5-104 (official released 9.1) kernel 
* glibc-2.3.3-73 (NPTL) 
based on SuSE's 9.0 XFree86 4.3.0.1-46 
 
DRM CVS 
DRI CVS 
Mesa CVS 
+ r200-maybe-flush-less-3.diff 
+ DRI-TLS-01.patch & Mesa-TLS-01.patch 
+ r200_vertex.patch (test.patch by Philipp Klaus Krause) 
 
SOURCE/glsnake-0.8.9> 
[9]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[3]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[8]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[7]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[2]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[5]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[10]   Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[4]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[1]    Fertig                        ./glsnake 
SOURCE/glsnake-0.8.9> 
[6]    Fertig                        ./glsnake 
 
Cheers, 
    Dieter

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.