When using libxcb (1.0) for building libX11 (1.1.3) some common signals get ignored. See Novell Bugzilla for more details. https://bugzilla.novell.com/show_bug.cgi?id=283914 This includes results of investigations Egbert already did. Unfortunately he doesn't have the time to investigate this any further at the moment. This happens at least on xorg-server 1.3.0.0.
I've managed to reproduce the bug, at least, but I really am pretty swamped right now and really don't have time to debug it myself. I note that xv makes a bunch of Xlib API calls from its signal handler. As far as I can recall this is technically illegal in Xlib; furthermore I'm pretty sure it will cause trouble in Xlib/XCB. My guess is that the malloc pool is getting corrupted: Xlib/XCB uses malloc pretty heavily, and Xlib basically doesn't use it, so this may be the difference?
I checked with Keith Packard, and he says that he is surprised old Xlib would put up with the calls from an interrupt handler either---this definitely isn't supposed to work. This sure seems likely to be an xv bug to me, triggered by the greater use of malloc in Xlib/XCB. I'm marking it NOTOURBUG until someone convinces me it is, I'm afraid.
Given that your assumptions about xv are correct it's still an dissatisfying situation. xv is not developed active anymore and therefore it's unlikely to see a fixed xv version anytime soon. It doesn't increase the acceptance of libxcb either considering all these libxcb assertions due to unbalanced use of XLockDisplay/XUnlockdisplay (Java, etc.). xv is still commonly in use by a lot of users and therefore another reason for a Linux distribution not to switch to libX11/libxcb combo. Well, openSUSE 10.3 did. Unfortunately not with the best results. :-(
Created attachment 19289 [details] test case of bad code using XNextEvent and a signal handler I ran into this myself with some old window manager code, and made a minimal test case. There's some commentary inside, but to summarize, just trying to make an Xlib request is enough to trigger the bug. I don't malloc in it at all. The backtrace looks like: #0 0x00007f4560277f50 in pthread_cond_wait () from /lib/libc.so.6 #1 0x00007f455fd89b78 in ?? () from /usr/lib/libxcb.so.1 #2 0x00007f455fd8b575 in xcb_wait_for_reply () from /usr/lib/libxcb.so.1 #3 0x00007f456053f0ce in _XReply () from /usr/lib/libX11.so.6 #4 0x00007f4560532ce3 in XSync () from /usr/lib/libX11.so.6 #5 0x00007f4560511b10 in XCloseDisplay () from /usr/lib/libX11.so.6 #6 <signal handler called> #7 0x00007f4560265ec3 in select () from /lib/libc.so.6 #8 0x00007f455fd899e6 in ?? () from /usr/lib/libxcb.so.1 #9 0x00007f455fd8b242 in xcb_wait_for_event () from /usr/lib/libxcb.so.1 #10 0x00007f456053eb47 in ?? () from /usr/lib/libX11.so.6 #11 0x00007f456053eeb5 in ?? () from /usr/lib/libX11.so.6 #12 0x00007f4560525898 in XNextEvent () from /usr/lib/libX11.so.6 So I would theorize that xcb_wait_for_reply falls over if xcb_wait_for_event is in the middle of executing at the same time (this is just a guess). Now, trying to do something from the signal handler is wrong, so I wouldn't demand kludging that to work again, but it would be nice if the failure mode was "crash with an assertion shaming the client author" instead of "hang". Hopefully it wouldn't be too difficult to detect this sort of call chain? I'm not familiar with the XCB internals though. If that is not worth the effort/complication, well, here's some proof that the signal handler is to blame, at least. :-) I'm also curious how anyone else here would prefer to rewrite this test case so that it is correct and works on both XCB/Xlib and the historical Xlib implementation.
(In reply to comment #4) > Created an attachment (id=19289) [details] > test case of bad code using XNextEvent and a signal handler > > I ran into this myself with some old window manager code, and made a minimal > test case. There's some commentary inside, but to summarize, just trying to > make an Xlib request is enough to trigger the bug. I don't malloc in it at > all. Thanks much for the nice bug report and test case! > So I would theorize that xcb_wait_for_reply falls over if > xcb_wait_for_event is > in the middle of executing at the same time (this is just a guess). It's a reasonable guess. The only really tricky code in XCB is the code that tries to make sure that there's always exactly one reader blocked on the server connection, and that it's the right one. It's worth noting that xcb_disconnect() actually does free some memory before it tries to close the file descriptor, so it's not inconceivable that it's still partly a heap trashing thing. I moved an allocation, but I doubt it will fix it. > Now, trying to do something from the signal handler is wrong, so I wouldn't > demand kludging that to work again, but it would be nice if the failure mode > was "crash with an assertion shaming the client author" instead of "hang". > Hopefully it wouldn't be too difficult to detect this sort of call chain? I'm > not familiar with the XCB internals though. It seems like we could have an assertion here, or just have the second thing trying to wait fail. > If that is not worth the effort/complication, well, here's some proof that the > signal handler is to blame, at least. :-) I'm also curious how anyone > else here > would prefer to rewrite this test case so that it is correct and works on both > XCB/Xlib and the historical Xlib implementation. It's hard, but the interrupt handler really shouldn't try to "clean up", either. Just get rid of the XCloseDisplay() and the whole signal handler; slamming the connection shut on interrupt will be fine. You should only close the display in response to the event indicating the client has asked it to be closed. I've reopened this bug, but honestly it probably doesn't have such high priority for us right now. For one thing, the "mythical handoff code" might fix it; I'll have to think about that. For another thing, the only actual app I've ever seen reported to have it is xv, and I use this app succesfully every day. I can reproduce the bug, but it's easy enough to just hit ^Z and then kill xv at that point, and I have to work to have it happen anyhow. In any case, I'm poking at the code right now. If I find an easy fix, I'll post it.
Probably something we should continue to look at.
Created attachment 19293 [details] Test code for signal handling bug Fixed to actually compile under gcc on my Linux box. Compile with -lX11 -lXext
Our xv maintainer (Werner Fink) fixed the issue in xv. I'll attach the patch.
Created attachment 20510 [details] [review] xv-3.10a-signal.dif Fix for xv.
Closing as fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.