Bug 9528

Summary: client hangs when using xcb-enabled libX11 locally
Product: XCB Reporter: Arkadiusz Miskiewicz <arekm>
Component: LibraryAssignee: Jamey Sharp <jamey>
Status: RESOLVED FIXED QA Contact: xcb mailing list dummy <xcb>
Severity: major    
Priority: high CC: joe, rafael
Version: 1.0   
Hardware: x86 (IA32)   
OS: Linux (All)   
URL: http://lists.freedesktop.org/archives/xcb/2007-August/002961.html
Whiteboard:
i915 platform: i915 features:
Attachments: copy of http://rafb.net/p/T6Hh8b60.html
Just removed the code which caused problems and everything started working

Description Arkadiusz Miskiewicz 2007-01-03 08:24:32 UTC
Copy from mailing list post:

I was tracking some nasty bug in cinelerra.org software. cinelerra GUI hangt 
when running locally. sshing to localhost and running cinelerra via x11 
forwarding - problem disappears.

Tracking reveals that this nasty bug is visible only if libX11 is compiled 
with xcb enabled. 

Rebuilding libX11 with xcb disabled and the problem disappears.

Problem is described here:
http://bugs.cinelerra.org/show_bug.cgi?id=383
the interesting gdb backtrace is here:
http://rafb.net/p/T6Hh8b60.html

I'm using PLD/Linux with latest software possible (whole xorg is latest releases). 
The other reporter (see cinelerra bugzilla) was using Arch/Linux.
Comment 1 Arkadiusz Miskiewicz 2007-01-03 08:25:53 UTC
Created attachment 8283 [details]
copy of http://rafb.net/p/T6Hh8b60.html

Copy of backtrace from pastebin like site.
Comment 2 Arkadiusz Miskiewicz 2007-11-04 12:50:01 UTC
No longer happens for me.

xorg-xserver-server-1.4-5.i686
xorg-driver-video-ati-6.7.195-2.i686
Mesa-libGL-7.0.1-2.i686
libxcb-1.0-4.i686
xorg-lib-libX11-1.1.3-1.i686

cinelerra is in the same version as before.
Comment 3 Jamey Sharp 2007-11-04 14:29:44 UTC
Per the original poster's report, I'm resolving this bug as not reproducible (any more). Thanks for the followup!

Calling it "INVALID" seems harsh and wrong, but that's the closest resolution I can find in Bugzilla.
Comment 4 Joe Drew 2008-01-29 13:23:39 UTC
This is very definitely still a problem for multiple people on many different platforms. I personally have been trying to track down a bug in Houdini (www.sidefx.com) which causes hangs and deadlocks, and it looks to be caused by this xlib-xcb hang. Simply upgrading my xlib from the regular version to the xcb version started the hanging.

At the URL I posted there is a bunch more analysis about this bug.
Comment 5 Joe Drew 2008-01-29 13:36:46 UTC
Some further comments: Not using threads with X causes this problem to disappear (this is probably unsurprising). Houdini, the 3D software package available for free download at www.sidefx.com, triggers this bug. You can vary Houdini's use of threads at runtime by using the environment variable HOUDINI_ENABLE_LINUX_THREADED_UI.

I am able to help with all manner of debugging, but I just don't know enough about libx11 and XCB to do this all myself.
Comment 6 Josh Triplett 2008-03-15 22:12:27 UTC
Jamey and I just announced a set of changes to XCB and Xlib/XCB which, among other things, should address all the outstanding synchronization problems that we know of.  Could anyone experiencing this bug please build XCB and Xlib with the patches found at http://lists.freedesktop.org/archives/xcb/2008-March/003347.html and retest?
Comment 7 Rafael Diniz 2008-05-02 13:24:01 UTC
Are the patches published here:
http://lists.freedesktop.org/archives/xcb/2008-March/003347.html
already applied?

Comment 8 James Braid 2008-11-27 10:12:32 UTC
(In reply to comment #6)
> Jamey and I just announced a set of changes to XCB and Xlib/XCB which, among
> other things, should address all the outstanding synchronization problems that
> we know of.  Could anyone experiencing this bug please build XCB and Xlib with
> the patches found at
> http://lists.freedesktop.org/archives/xcb/2008-March/003347.html and retest?

We still hit this even with the latest xcb and xlib releases (which include the patches as far as I can tell). The only fix I've found is disabling xcb support in xlib. If there are any other patches available, I can test them out, no problem. Let me know if you need any more information.

We primarily hit this with Shake (commericial 2D compositing package) and Houdini as mentioned above. 
Comment 9 Jamey Sharp 2009-10-09 12:32:22 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > Jamey and I just announced a set of changes to XCB and Xlib/XCB which, among
> > other things, should address all the outstanding synchronization problems that
> > we know of.
> 
> We still hit this even with the latest xcb and xlib releases (which include the
> patches as far as I can tell). The only fix I've found is disabling xcb support
> in xlib. If there are any other patches available, I can test them out, no
> problem. Let me know if you need any more information.
> 
> We primarily hit this with Shake (commericial 2D compositing package) and
> Houdini as mentioned above. 

Although you believe you had the socket-handoff version when you last tested, I can't help hoping the socket-handoff work fixed the problem for you, because I don't know of any hangs we haven't fixed already. So could you confirm that you can still reproduce the bug?

I tried to download Houdini to test it myself, but the 200MB download, and that all closed-source, is just too much.

So if you can still reproduce the problem, would you start with "thread apply all bt full" in gdb, to show us what was going on when it hung? Preferably, have debug symbols for libX11 and libxcb installed first.
Comment 10 Jamey Sharp 2011-03-14 17:53:20 UTC
Since I last asked for testing (and nobody responded), we've released a much-improved version of libX11 with regards to multi-threaded applications. There's one known issue (#30450), but I don't know whether it affects this particular application. Please re-test with Xlib 1.3.4/1.4.0 or newer.
Comment 11 Arkadiusz Miskiewicz 2011-03-14 22:33:58 UTC
I'm using latest releases of xserver, libs etc from freedesktop and cinelerra works fine.
Comment 12 Jamey Sharp 2011-03-15 15:39:22 UTC
Well, let's try closing this bug again and see if anybody who can reproduce it with current library versions shows up again. Thanks Arkadiusz.

If you do want to re-open this bug: Please make sure you attach the output of "thread apply all bt" from gdb, like Arkadiusz did. There's some general documentation on doing that at http://wiki.debian.org/HowToGetABacktrace .
Comment 13 ThomasBub 2011-11-21 05:56:26 UTC
(In reply to comment #10)
> Since I last asked for testing (and nobody responded), we've released a
> much-improved version of libX11 with regards to multi-threaded applications.
> There's one known issue (#30450), but I don't know whether it affects this
> particular application. Please re-test with Xlib 1.3.4/1.4.0 or newer.

How/Where can I obtain this improved version of the libX11.
We are using Suse Linux Enterprise Server 11 SP1.
Any help/hint welcome on how to get this integrated.
Comment 14 ThomasBub 2011-11-22 07:13:12 UTC
Tried it out today under Suse Linux Enterprise Server (SLES) 11 SP1.
Installed

  xcb-1.7
  X11-1.4.4

which needed

  xcb-proto-1.6
  xproto-7.0.22

Same effect as with plain SLES11 SP1.
The glthreads example from:

http://www.opensource.apple.com/source/X11server/X11server-85/mesa/Mesa-7.2/progs/xdemos/glthreads.c

still locks up when not using the -l option.
Comment 15 Uli Schlachter 2011-11-22 09:40:41 UTC
This isn't an xcb bug, but rather a bug in libX11. gdb says:

(gdb) thread apply all bt

Thread 2 (Thread 0x7ffff3bf2700 (LWP 32070)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff7ae6795 in _XReply (dpy=0x605070, rep=0x7ffff3bf1c90, extra=0, discard=0) at xcb_io.c:623
#2  0x00007ffff788a712 in ?? () from /usr/lib/x86_64-linux-gnu/libGL.so.1
#3  0x00007ffff788866c in ?? () from /usr/lib/x86_64-linux-gnu/libGL.so.1
#4  0x00007ffff4ef8987 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#5  0x00007ffff4ef9705 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#6  0x00007ffff4ef7c95 in ?? () from /usr/lib/x86_64-linux-gnu/dri/i965_dri.so
#7  0x0000000000401b11 in draw_loop (wt=0x6033c0) at glthreads.c:186
#8  0x0000000000402198 in thread_function (p=0x6033c0) at glthreads.c:384
#9  0x00007ffff7633b40 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007ffff737e36d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7fd7720 (LWP 32067)):
#0  0x00007ffff7373723 in *__GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007ffff7094fa2 in _xcb_conn_wait (c=0x606780, cond=<optimized out>, vector=0x0, count=0x0) at xcb_conn.c:369
#2  0x00007ffff70966ef in xcb_wait_for_event (c=0x606780) at xcb_in.c:522
#3  0x00007ffff7ae6328 in _XReadEvents (dpy=0x605070) at xcb_io.c:400
#4  0x00007ffff7ad5688 in XNextEvent (dpy=0x605070, event=0x7fffffffe0c0) at NextEvent.c:50
#5  0x0000000000401cae in event_loop (dpy=0x605070) at glthreads.c:238
#6  0x000000000040278d in main (argc=3, argv=0x7fffffffe2c8) at glthreads.c:527


Now if we look at xcb_io.c, shortly before line 623:

	/* FIXME: That event might be after this reply,
	 * and might never even come--or there might be
	 * multiple threads trying to get events. */

I think we are hitting the "might never come"-case and I also think that there are other bug reports which hit that case, so this might be a dupe now.

"Patch" which "fixes" this issue for me is attached, but I guess there is a reason why Xlib tries to do what it is doing, so just removing this code won't help. (What is that reason? How badly would stuff break? Which stuff would break? Could that other stuff be fixed so that this code can be removed?)

BTW this hangs in libxcb if I use the libxcb version from debian instead of the current git version, so there really was a bug in xcb, but that one definitely was already fixed.
Comment 16 Uli Schlachter 2011-11-22 09:41:36 UTC
Created attachment 53778 [details] [review]
Just removed the code which caused problems and everything started working
Comment 17 ThomasBub 2011-11-23 07:51:52 UTC
(In reply to comment #16)
> Created attachment 53778 [details] [review] [review]
> Just removed the code which caused problems and everything started working

Didn't work for me.
The X-server didn't even show up anymore, while it did before removing the suggested lines.
As said before I'm trying this under SLES 11 SP1, maybe this is the reason why it does not work for me.
Comment 18 ThomasBub 2011-11-29 07:35:00 UTC
Re-did my test today with latest and greatest openSuse 12.1 and the libX11-1.4.4 with the same bad results as state in my previous reply
Comment 19 ThomasBub 2011-12-09 01:04:47 UTC
Fixed with patch for bug 40372 that hasn't been relased in an official version of libxcb yet.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.