Immediately after opening its window, the SecondLife client hangs and freezes the X server (the mouse keeps moving, though). Killing the process from a remote session makes the X server usable again. To reproduce, just download the SecondLife client for Linux and run it on a system with a Radeon r300 card. My Device section does not contain any fancy options. Maybe relevant: this is an x86_64 system.
Ah, and other GL applications generally work, including Quake3.
Which version of Mesa are you using? There was a bug which caused applications that don't call glViewport initially to start with a blank window, which was only fixed recently in git. If this is the problem, you can work around it by changing the window size.
I realized I wasn't running the latest git snapshot of the 32bit version of Mesa. Updating improved somewhat the situation: now SecondLife hangs just after drawing some text on screen. Before, it was unreadable. I tried attaching gdb to the process, but this manage to hang the system completely (no ping, etc.). With the above evidence, is it correct to say that we have a locking bug in the DRM module instead of userspace? I could retry with the latest drm modules if you think it could help.
(In reply to comment #3) > I tried attaching gdb to the process, but this manage to > hang the system completely (no ping, etc.). Are you trying this via ssh? You can't do it from the same X display because gdb may interrupt the process while it's holding the DRM hardware lock, so the X server can't take it to handle the interaction with gdb. > I could retry with the latest drm modules if you think it > could help. Probably can't hurt.
(In reply to comment #4) > Are you trying this via ssh? You can't do it from the same X display because > gdb may interrupt the process while it's holding the DRM hardware lock, so the > X server can't take it to handle the interaction with gdb. Yes, I'm on ssh. Strangely, gdb hangs as soon as it tries to attach to the SecondLife client. And the same happens if I try to attach to the X server. With other processes, gdb works fine. > > I could retry with the latest drm modules if you think it > > could help. > > Probably can't hurt. Same result.
Since gdb is unworkable, I chose another route and analyzed strace output. It's pretty clear what's going on: futex(0x9b29a38, FUTEX_WAKE, 2147483647) = 0 ioctl(13, 0x4008642a, 0xffeda0b4) = 0 ioctl(13, 0x40106450, 0xffed9a8c) = 0 ioctl(13, 0xc018644e, 0xffed9a90) = -1 EAGAIN (Resource temporarily unavailable) nanosleep({0, 1000}, NULL) = 0 [last two lines repeated 4ever] Decoding the ioctl numbers (wonder why strace doesn't do it already) leads us to this: - 2a -> DRM_IOCTL_LOCK - 50 -> DRM_RADEON_CMDBUF - 4e -> DRM_RADEON_TEXTURE I checked and all sizes match the expected structures (32bit versions). So we should end up in radeon_ioc32.c:compat_radeon_cp_texture(), then radeon_cp_texture(), and finally radeon_cp_dispatch_texture(), where the fun starts. The return code of EAGAIN most probably comes from here: buf = radeon_freelist_get(dev); [...] if (!buf) { DRM_DEBUG("radeon_cp_dispatch_texture: EAGAIN\n"); if (DRM_COPY_TO_USER(tex->image, image, sizeof(*image))) return DRM_ERR(EFAULT); return DRM_ERR(EAGAIN); } Now the question is: how's can radeon_freelist() return NULL? And another good question is: why does the client need to retry allocating a texture in an infinite polling loop?
(In reply to comment #6) > Since gdb is unworkable, I chose another route and analyzed strace output. That's rarely very useful for this kind of problem unfortunately. > Now the question is: how's can radeon_freelist() return NULL? It means there are currently no indirect buffers available. This can happen intermittently during normal operation, but when it persists that probably means the card has locked up. Unfortunately, the symptoms of a lockup don't really say anything about its cause. Can you still kill the secondlife process and get X back working? > And another good question is: why does the client need to > retry allocating a texture in an infinite polling loop? That's not quite what happens. The ioctl transfers texture data from userspace memory to video RAM using indirect buffers. As an indirect buffer is only 64KB, it usually has to split up the data across several buffers, and when all of them are in use, it returns to userspace so it can later pick up where it left off.
(In reply to comment #7) > Can you still kill the secondlife process and get X back working? Yes, I need to use kill -9. The process does not even react to SIGSTOP/SIGCONT (^Z in the shell). > That's not quite what happens. The ioctl transfers texture data from userspace > memory to video RAM using indirect buffers. As an indirect buffer is only 64KB, > it usually has to split up the data across several buffers, and when all of > them are in use, it returns to userspace so it can later pick up where it left > off. But there's that 1ms pause between calls to ioctl(). And the arguments are always the same. Doesn't it mean it's a retry loop? ioctl(13, 0xc018644e, 0xffed9a90) = -1 EAGAIN (Resource temporarily unavailable) nanosleep({0, 1000}, NULL) = 0 ioctl(13, 0xc018644e, 0xffed9a90) = -1 EAGAIN (Resource temporarily unavailable) nanosleep({0, 1000}, NULL) = 0 ioctl(13, 0xc018644e, 0xffed9a90) = -1 EAGAIN (Resource temporarily unavailable) nanosleep({0, 1000}, NULL) = 0 ...
> It means there are currently no indirect buffers available. This can happen > intermittently during normal operation, but when it persists that probably > means the card has locked up. Unfortunately, the symptoms of a lockup don't > really say anything about its cause. The last DRM commands I see in the strace output are: ioctl(13, 0x4008642a, 0xffeda0b4) = 0 gettimeofday({1176052195, 567600}, NULL) = 0 ioctl(13, 0x40106450, 0xffeda04c) = 0 ioctl(13, 0xc0086451, 0xffeda06c) = 0 ioctl(13, 0xc010643a, 0xffeda06c) = 0 ioctl(13, 0x6447, 0) = 0(In reply to comment #7) Which translates to: DRM_IOCTL_LOCK DRM_RADEON_CMDBUF DRM_RADEON_GETPARAM DRM_IOCTL_WAIT_VBLANK DRM_RADEON_SWAP There's no UNLOCK! Is it implicit in the swap operation?
First of all, does this also happen with a 32 bit kernel? (In reply to comment #8) > > > Can you still kill the secondlife process and get X back working? > > Yes, I need to use kill -9. The process does not even react to > SIGSTOP/SIGCONT (^Z in the shell). Probably because it's holding the hardware lock. > But there's that 1ms pause between calls to ioctl(). That's up to the Mesa driver. I'm not sure the sleep is really useful, I think it should at least drop the hardware lock in addition to sleeping, if not instead of it. I can whip up a patch for that. > And the arguments are always the same. Doesn't it mean it's a retry loop? Sure, didn't (mean to) say it isn't.
Created attachment 9520 [details] [review] Don't sleep between texture upload ioctls, drop hardware lock instead Does this patch help for interrupting the process, attaching gdb to it, or possibly even the hang?
(In reply to comment #9) > There's no UNLOCK! Is it implicit in the swap operation? The locking fastpaths work without ioctls; the ioctls are only needed when a client is trying to grab the lock while another one is holding it.
I'm not sure it's relevant: there are no other GL clients running at the same time. But the SecondLife client spawns multiple threads. Maybe it's hitting a multithreading bug in libGL.
(In reply to comment #11) > Created an attachment (id=9520) [details] > Don't sleep between texture upload ioctls, drop hardware lock instead > > Does this patch help for interrupting the process, attaching gdb to it, or > possibly even the hang? I still need to kill it with -9, but: I can now attach gdb and the X server doesn't hang anymore! I enabled the diagnostic output and I see an infinite succession of "DRM_RADEON_TEXTURE: again!".
(In reply to comment #13) > But the SecondLife client spawns multiple threads. > Maybe it's hitting a multithreading bug in libGL. And this is the strace output with threads included (-f): [pid 29333] write(3, "2007-04-09T12:35:05Z INFO: login"..., 55) = 55 [pid 29333] gettimeofday({1176122105, 902851}, NULL) = 0 [pid 29333] gettimeofday({1176122105, 902888}, NULL) = 0 [pid 29333] brk(0xcc51000) = 0xcc51000 [pid 29333] brk(0xcc75000) = 0xcc75000 [pid 29333] futex(0x9b29a38, FUTEX_WAKE, 2147483647) = 0 [pid 29333] ioctl(13, 0x4008642a, 0xff94ab24) = 0 [pid 29333] ioctl(13, 0x40106450, 0xff94a4fc) = 0 [pid 29333] ioctl(13, 0xc018644e <unfinished ...> [pid 29340] <... nanosleep resumed> NULL) = 0 [pid 29340] gettimeofday({1176122105, 913238}, NULL) = 0 [pid 29340] gettimeofday({1176122105, 913269}, NULL) = 0 [pid 29340] nanosleep({0, 0}, NULL) = 0 [pid 29340] futex(0xc80b140, FUTEX_WAKE, 1) = 0 [pid 29340] nanosleep({0, 10000000}, <unfinished ...> [pid 29333] <... ioctl resumed> , 0xff94a500) = -1 EAGAIN (Resource temporarily unavailable) [pid 29333] write(2, "DRM_RADEON_TEXTURE: again!\n", 28DRM_RADEON_TEXTURE: again! As you can see, two threads are fiddling with GL at the same time. Is this supposed to work?
(In reply to comment #15) > As you can see, two threads are fiddling with GL at the same time. Actually, I only see one thread dealing with the DRM file descriptor. Am I missing something? A couple of random ideas to try and find out more about what's going on: * Try returning 0 instead of -EAGAIN or even immediately in radeon_cp_dispatch_texture(). This will cause textures to be corrupted, but does it avoid the hang? * Try adding debugging output in the DRM and/or Mesa code to see how it's (not) advancing on the texture upload.
I enabled RADEON_DEBUG=allmsg and this are the last lines of the output before hanging: r300Enable( GL_TEXTURE_2D = GL_TRUE ) r300Enable( GL_COLOR_MATERIAL = GL_TRUE ) r300Enable( GL_DITHER = GL_FALSE ) r300Enable( GL_DEPTH_TEST = GL_TRUE ) r300Clear Reemit state after flush (from r300EmitClearState) r300EmitState Begin reemit state Begin dirty state r300ClearBuffer: back buffer (4,52 800x600) r300Enable( GL_BLEND = GL_TRUE ) r300Enable( GL_ALPHA_TEST = GL_TRUE ) r300Enable( GL_DEPTH_TEST = GL_FALSE ) r300Enable( GL_TEXTURE_2D = GL_FALSE ) r300_run_tcl_render r300_run_vb_render r300IsGartMemory( 0xb352140 ) : 0 emit_vector count 4 size 2 stride 8 r300AllocDmaRegion 32 emit_vec8 count 4 stride 8 r300IsGartMemory( 0xb3198f8 ) : 0 emit_vector count 4 size 4 stride 0 r300AllocDmaRegion 16 emit_vec16 count 1 stride 0 mtu=8 TX_ENABLE: 00000000 last_hw_tmu=-1 r300EmitState Begin dirty state r300EmitAOS: nr=2, ofs=0x00000000 r300ReleaseDmaRegion from r300ReleaseArrays r300ReleaseDmaRegion from r300ReleaseArrays r300ReleaseDmaRegion from r300ReleaseArrays r300Enable( GL_TEXTURE_2D = GL_TRUE ) r300BindTexture( 0xca4e808 ) unit=0 r300_run_tcl_render pc=3************************************* Mesa program: ------------- # Fragment Program/Shader 0: TXP TEMP[0] (cond???..xxxx), INPUT[4], texture[0], 2D; 1: MUL OUTPUT[0].xyz (cond???..xxxx), TEMP[0], INPUT[1]; 2: MOV OUTPUT[0].w (cond???..xxxx), INPUT[1]; 3: END Hardware program ---------------- NODE 0: alu_offset: 0, tex_offset: 0, alu_end: 0, tex_end: 0 TEX: TXP t2, t0, texture[0] (00018080) 0: xyz: t2 t1 c0 -> o0.xyz (1c020042) w: t1 c0 c0 -> o0.w (01020801) xyz: t2.xyz t1.xyz 0.0 op: 00050200 w: t1.w 1.0 0.0 op: 00040889 r300_run_vb_render r300IsGartMemory( 0xb352140 ) : 0 emit_vector count 4 size 2 stride 16 r300AllocDmaRegion 32 emit_vec8 count 4 stride 16 r300IsGartMemory( 0xb3198f8 ) : 0 emit_vector count 4 size 4 stride 0 r300AllocDmaRegion 16 emit_vec16 count 1 stride 0 r300IsGartMemory( 0xb352148 ) : 0 emit_vector count 4 size 2 stride 16 r300AllocDmaRegion 32 emit_vec8 count 4 stride 16 r300Flush r300FlushCmdBufLocked from r300Flush - 1 cliprects r300UploadTexImages( 0xb318be0, 0xca4e808 ) sz=4096 lvls=0-0 uploadSubImage( 0xc9328c0, 0xca4e808 ) level/width/height/face = 0/32/32/0 uploadSubImage: image data is in normal memory upload image: 32,32 at 0,0 upload blit: 32,32 at 0,0 blit ofs: 0xe3866000 level: 0/0 mtu=8 Activating texture unit 0 TX_ENABLE: 00000001 last_hw_tmu=0 Reemit state after flush (from r300_run_vb_render) r300EmitState Begin reemit state Begin dirty state r300EmitState r300EmitAOS: nr=3, ofs=0x00000000 r300ReleaseDmaRegion from r300ReleaseArrays r300ReleaseDmaRegion from r300ReleaseArrays r300ReleaseDmaRegion from r300ReleaseArrays r300ReleaseDmaRegion from r300ReleaseArrays r300Enable( GL_ALPHA_TEST = GL_FALSE ) r300BindTexture( 0xc591c78 ) unit=0 r300_run_tcl_render r300_run_vb_render r300IsGartMemory( 0xb352140 ) : 0 emit_vector count 4 size 2 stride 16 r300AllocDmaRegion 32 emit_vec8 count 4 stride 16 r300IsGartMemory( 0xb3198f8 ) : 0 emit_vector count 4 size 4 stride 0 r300AllocDmaRegion 16 emit_vec16 count 1 stride 0 r300IsGartMemory( 0xb352148 ) : 0 emit_vector count 4 size 2 stride 16 r300AllocDmaRegion 32 emit_vec8 count 4 stride 16 r300Flush r300FlushCmdBufLocked from r300Flush - 1 cliprects r300UploadTexImages( 0xb318be0, 0xc591c78 ) sz=4194304 lvls=0-0 uploadSubImage( 0xc6d5be8, 0xc591c78 ) level/width/height/face = 0/1024/1024/0 uploadSubImage: image data is in normal memory upload image: 1024,1024 at 0,0 upload blit: 1024,1024 at 0,0 blit ofs: 0xe3867000 level: 0/0 DRM_RADEON_TEXTURE: again! DRM_RADEON_TEXTURE: again! DRM_RADEON_TEXTURE: again! [...]
(In reply to comment #16) > (In reply to comment #15) > > As you can see, two threads are fiddling with GL at the same time. > > Actually, I only see one thread dealing with the DRM file descriptor. Am I > missing something? No, you're right... I'm not sure I've seen it for real now... will try harder. > * Try returning 0 instead of -EAGAIN or even immediately in > radeon_cp_dispatch_texture(). This will cause textures to be corrupted, but > does it avoid the hang? I added a retry count in the while loop, and now the client exits with an error. The log looks like this: uploadSubImage( 0xdfb09828, 0xdb1b2808 ) level/width/height/face = 0/1024/1024/0 uploadSubImage: image data is in normal memory upload image: 1024,1024 at 0,0 upload blit: 1024,1024 at 0,0 blit ofs: 0xe3866000 level: 0/0 DRM_RADEON_TEXTURE: again! DRM_RADEON_TEXTURE: again! DRM_RADEON_TEXTURE: again! DRM_RADEON_TEXTURE: return = -11 offset=0xe3866000 image width=1024 height=1024 blit width=1024 height=1024 data=0xd9fff200 BTW, isn't 1024x1024 too big for a texture??? > * Try adding debugging output in the DRM and/or Mesa code to see how it's (not) advancing on the texture upload. Is the log attached to the previous comment useful? Or do you need me to add a printf in a more specific place?
By the way, can anybody try running the SecondLife client on their systems to see if the bug can be reproduced? It's here: http://secondlife.com/community/linux-alpha.php
(In reply to comment #18) > I added a retry count in the while loop, and now the client exits with an > error. Not what I asked for... > BTW, isn't 1024x1024 too big for a texture??? No, see glxinfo -l|grep MAX_TEXTURE_SIZE . Does the Mesa texdown demo also hang with 1024x1024? > Is the log attached to the previous comment useful? Not really. > Or do you need me to add a printf in a more specific place? Indeed. E.g., do the DRM and client side have a consistent view of the ioctl arguments over the repeated calls? Does the DRM side return -EAGAIN immediately in the end or does it process some buffers before? Does it make any progress before returning -EAGAIN for the first time? ...
(In reply to comment #19) > By the way, can anybody try running the SecondLife client on > their systems to see if the bug can be reproduced? I can't reproduce it on my PowerBook. Can you try a 32 bit kernel?
(In reply to comment #21) > I can't reproduce it on my PowerBook. Can you try a 32 bit kernel? Sorry, I can't. The userland is 64bit. Any other difference? Do you have any fancy option in xorg.conf? Is your card AGP or PCI-E? Mine is PCI-E. I'm using today's versions of dri, mesa and xserver.
I can get lockups with R300 and my own code, too. It seems to be similar to your lockup, but it's just when the buffers run out. I think the no-usleep patch should be committed as it helps to get gdb working. I also wonder if it's worth looking for places where we can get into an infinite loop and adding a timeout of X iterations... At least Second Life can reliably reproduce the lockup; my code doesn't seem to reproduce it 100% of the time. It does seem to happen quicker at higher resolutions though; probably because the card must work (much) harder then.
(In reply to comment #23) > I think the no-usleep patch should be committed as it helps to get gdb > working. Unfortunately, I realized in the meantime it doesn't work in general. The world can literally change between dropping the lock and taking it again - in particular, the video RAM reserved for the texture being uploaded can be reclaimed by another client. It's really just a bug that a texture can't be uploaded in finite time. Now, if we could get answers to the remaining open questions... > I also wonder if it's worth looking for places where we can get into an > infinite loop and adding a timeout of X iterations... Maybe, but note that as killing the process resumes normal operation, this is not really a lockup but more like a deadlock.
I really think we need some solid way to debug lockups; they are really annoying and hard to debug right now. I think Nicolai Haehnle has done some patches for this. I'm not sure if they work or not (they didn't work for my lockups) but maybe we can work out some way to reliably debug the lockups. It would be excellent if we could even narrow it down to the specific buffer that caused the lockup. I think it should be possible to do this. Maybe we should throw some ideas around.
The following patch fixed SecondLife for me by giving up after a few retries. The client uploads this 1024x1024 texture several times per second (but not every frame, I think). So to get a decent frame rate I actually lowered the retry count to 1 and removed the usleep() in my copy of mesa. Despite missing such a big texture, the game looks fine except for a tiny line of garbled pixels over the button bar. Of course this isn't a suggested fix. Maybe the retry thing should go in for robustness, but the real bug is obviously somewhere else. diff --git a/src/mesa/drivers/dri/r300/r300_texmem.c b/src/mesa/drivers/dri/r300/r300_texmem.c index c527677..241ec1f 100644 --- a/src/mesa/drivers/dri/r300/r300_texmem.c +++ b/src/mesa/drivers/dri/r300/r300_texmem.c @@ -43,7 +43,6 @@ SOFTWARE. #include "colormac.h" #include "macros.h" #include "simple_list.h" -#include "radeon_reg.h" /* gets definition for usleep */ #include "r300_context.h" #include "r300_state.h" #include "r300_cmdbuf.h" @@ -464,18 +463,22 @@ static void uploadSubImage( r300ContextPtr rmesa, r300TexObjPtr t, } LOCK_HARDWARE( &rmesa->radeon ); + int retries = 3; do { ret = drmCommandWriteRead( rmesa->radeon.dri.fd, DRM_RADEON_TEXTURE, &tex, sizeof(drm_radeon_texture_t) ); if (ret) { - if (RADEON_DEBUG & DEBUG_IOCTL) - fprintf(stderr, "DRM_RADEON_TEXTURE: again!\n"); - usleep(1); + if (RADEON_DEBUG & DEBUG_IOCTL) + fprintf(stderr, "DRM_RADEON_TEXTURE: again!\n"); + usleep(1); } - } while ( ret == -EAGAIN ); + } while (ret == -EAGAIN && --retries); UNLOCK_HARDWARE( &rmesa->radeon ); + if (ret == -EAGAIN) + return; + if ( ret ) { fprintf( stderr, "DRM_RADEON_TEXTURE: return = %d\n", ret ); fprintf( stderr, " offset=0x%08x\n",
(In reply to comment #26) > Maybe the retry thing should go in for robustness, but the real bug is > obviously somewhere else. Right, and even retries should only be counted when they're 'unsuccessful', i.e. when the ioctl arguments don't change significantly or at all. A 32 bit 2048x2048 texture may legitimately take 8 or more calls. PS: Please attach files instead of pasting content whenever reasonably possible to prevent clutter.
Looks like compat_radeon_cp_texture() is simply missing the code to update the 32 bit ioctl parameters before returning. So probably this will happen with any 32 bit app on a 64 bit kernel whenever radeon_cp_dispatch_texture() returns -EAGAIN.
Can you reproduce it with current master or radeon-rewrite branch of mesa?
(In reply to comment #29) > Can you reproduce it with current master or radeon-rewrite branch of mesa? Sorry, I don't have access to a machine equipped with an r300 at this time.
(In reply to comment #29) > Can you reproduce it with current master or radeon-rewrite branch of mesa? Mesa doesn't matter for this... Looking at compat_radeon_cp_texture() in drivers/gpu/drm/radeon/radeon_ioc32.c in a recent kernel tree, there still isn't anything after the drm_ioctl() call, so this can't really be fixed.
Mass version move, cvs -> git
The classic r300 driver has been abandoned long ago. It was replaced by the Gallium driver r300g. If you have issues with r300g please file a new bug report with component Drivers/Gallium/r300 Thanks.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.