If you do: - send a void request (no reply expected) with checked() - receive an event (which will have sequence number of void request) - DO NOT send another request - block for the request cookie Then things will hang. What's supposed to happen (unless I'm confused) is that xcb_request_check() does a GetInputFocus because the sequence number you're checking is newer than the expected or completed: if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected) && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed)) { free(xcb_get_input_focus_reply(c, xcb_get_input_focus(c), &ret)); This then gets us a reply that proves the void request is complete and read_packet can mark us complete. What actually happens is this, in read_packet(): if(XCB_SEQUENCE_COMPARE(c->in.request_read, >, c->in.request_expected)) c->in.request_expected = c->in.request_read; In this case the request_read sequence number matches that of the void request, BUT it's just coming from an event, since events have the sequence number of the last request. xcb_request_check() now thinks a reply is coming for the void request, and hangs forever. I'm not sure of the exact correct fix; it worked for me avoid updating request_expected if genrep.response_type == XCB_REPLY but I can also imagine that it's correct to never update request_expected in this spot in read_packet. Apps can work around the hang by manually doing their own GetInputFocus.
"avoid updating request_expected if genrep.response_type == XCB_REPLY" obviously should be "only update request_expected if ..."
Created attachment 37905 [details] [review] improve xcb_request_check docs I think xcb_request_check() docs assumed a bit too much, here's a patch to spell things out more.
Huh, yeah, that isn't quite right, is it? (In reply to comment #0) > if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected) > && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed)) I think it should be correct and sufficient to change this check to be if the cookie is greater than *or equal to* request_expected. Since we have a void cookie, request_expected can't have been set equal to this sequence number when the request was sent; it can only have become equal due to the arrival of an event or error. If it became equal due to an event then we still need to sync. If it became equal due to an error, then request_completed will have been updated, which means we correctly won't sync. I don't want to change read_packet's handling of request_expected, because that's for deferring automatic syncs for 16-bit sequence number wrap as long as possible, so we'd wind up sending more syncs if we ignored events. Could you test that changing the first > to >= in xcb_request_check fixes the bug for you?
That still hangs in my test, in the backtrace looking at the connection struct, gdb shows request_expected = 17161, request_read = 17160, request_completed = 17159 At this point we are blocking for 17160 and stuck. A theory on this is that request_expected was < 17160 going into xcb_request_check, so we didn't GetInputFocus. Then we had sent some other void request, which resulted in an event with sequence 17161 while we were spinning in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus.
btw, I guess it's too late to change the API and this could be an API change, but I think a nicer way for things to work might be: - send a fire-and-forget (discard reply) GetInputFocus anytime we flush and there has not been a non-void request since the most recent void-and-checked request. That way, the normal sane usage pattern would Just Work: send_nonvoid_request() send_void_and_checked_request() send_void_request() xcb_flush() /* sends GetInputFocus because void_and_checked with no nonvoid */ note that on flush, we would not *block* on the GetInputFocus reply, of course. We don't care about the reply except to update request_read later. the GetInputFocus would also be sent on implicit flush (say on xcb_take_socket or buffer filling up or whatever), so it would always happen eventually. might send a few more GetInputFocus but that's far less of a problem than GetInputFocus *round trips* in xcb_request_check another sane approach I think would just require the app to send a getinputfocus and warn/complain on stderr if xcb_request_check is invoked without a non-void request since the void request was sent. i.e. make this the app's problem and declare it a bug if we get to xcb_request_check and it's going to hang. I guess my point is, the blocking GetInputFocus round trip in xcb_request_check is never really desirable and apps really ought to avoid it or XCB should avoid it for them. But, all this is more future enhancement, I know there's a much smaller fix to the immediate bug here.
Created attachment 38126 [details] minimal test case This test app reproduces the hang you described. However, the patch I proposed does fix this test case (so I pushed that patch to master). Would you post a minimal test case that still fails on libxcb git master please?
(In reply to comment #4) > That still hangs in my test, in the backtrace looking at the connection struct, > gdb shows > request_expected = 17161, request_read = 17160, request_completed = 17159 > > At this point we are blocking for 17160 and stuck. I assume you mean that cookie.sequence was 17160. > A theory on this is that request_expected was < 17160 going into > xcb_request_check, so we didn't GetInputFocus. The condition was cookie.sequence > request_expected, so you've stated a case where it *would* sync, unless cookie.sequence wasn't also greater than request_completed at the time. > Then we had sent some other void > request, which resulted in an event with sequence 17161 while we were spinning > in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus. That theory can't explain the sequence numbers you observed: If you got an event with sequence 17161, then request_read would be 17161 and request_completed would be 17160. Instead I think you must have gotten an event or reply (not an error) for 17160, and also sent a non-void request for 17161. Which raises the question of how you could be blocked indefinitely... Perhaps that non-void request didn't get flushed? Maybe xcb_request_check needs to flush when it doesn't sync? Also: Is your test app making requests from multiple threads? That's supposed to Just Work, but it would be important to know. BTW, I want to think about your proposed API changes after this bug is sorted out; but I'm ignoring them until then. :-)
Looking at a sample hang again, and I happen to have XCB full of debug logging... during the hang, request_expected = 8493, request_read = 8492, request_completed = 8491 3 0x00a42752 in xcb_wait_for_reply (c=0x9d0a198, request=8492, e=0xbfb6419c) at xcb_in.c:512 i.e. wait_for_reply is blocking on 8492 some of my debug logging was: PAINT: GOT X REPLY 8486 PAINT: EXPECTING X REPLY SEQ 8488 PAINT: 0 bytes buffered out PAINT: GOT X REPLY 8488 PAINT: xcb_writev request 8489 PAINT: first of 1 requests opcode 148 minor_opcode 1 PAINT: 0 bytes buffered out PAINT: xcb_writev request 8490 PAINT: first of 1 requests opcode 148 minor_opcode 142 PAINT: 20464 bytes buffered out PAINT: EXPECTING X REPLY SEQ 8491 PAINT: EXPECTING X REPLY SEQ 8492 PAINT: 216 bytes buffered out PAINT: GOT X REPLY 8491 ignore "PAINT" that is just what I'm debugging. "bytes buffered out" is number of bytes in the outgoing kernel buffer on the socket, ioctl(SIOCOUTQ). GOT X REPLY is in read_packet. EXPECTING X REPLY is _xcb_in_expect_reply. I'll add some logging to flush and at the start of request_check and see if I can reproduce again. (My test case is "run huge app and do stuff for a while" so it's a little tricky.)
Here's a better log. PAINT: GOT X REPLY 2870 PAINT: _xcb_wait_for_reply(2866) expected=2870 read=2870 completed=2869 PAINT: _xcb_request_check(2867) expected=2870 read=2870 completed=2869 PAINT: _xcb_wait_for_reply(2867) expected=2870 read=2870 completed=2869 PAINT: xcb_writev request 2871 PAINT: first of 1 requests opcode 148 minor_opcode 1 PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869 PAINT: 0 bytes buffered out PAINT: xcb_writev request 2872 PAINT: first of 1 requests opcode 148 minor_opcode 142 PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869 PAINT: 0 bytes buffered out PAINT: EXPECTING X REPLY SEQ 2873 PAINT: EXPECTING X REPLY SEQ 2874 PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869 PAINT: _xcb_conn_wait expected=2873 read=2870 completed=2869 PAINT: 216 bytes buffered out PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869 PAINT: _xcb_wait_for_reply(2873) expected=2875 read=2870 completed=2869 PAINT: _xcb_out_flush_to(2873) expected=2875 read=2870 completed=2869 PAINT: _xcb_conn_wait expected=2875 read=2870 completed=2869 PAINT: GOT X REPLY 2873 PAINT: _xcb_request_check(2874) expected=2875 read=2873 completed=2872 PAINT: _xcb_wait_for_reply(2874) expected=2875 read=2873 completed=2872 PAINT: _xcb_out_flush_to(2874) expected=2875 read=2873 completed=2872 PAINT: _xcb_conn_wait expected=2875 read=2873 completed=2872 PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873 PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873 PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873 PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873 PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873 [at this point we're spinning forever]
Even more detail. It looks like there's one unflushed request, which was your theory. PAINT: xcb_writev request 1217 PAINT: first of 1 requests opcode 148 minor_opcode 142 PAINT: START _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1216 PAINT: _xcb_conn_wait expected=1215 read=1215 completed=1214 PAINT: 133944 bytes buffered out PAINT: END _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1217 PAINT: _xcb_send_request expected=1215 read=1215 completed=1214 PAINT: get_socket_back expected=1215 read=1215 completed=1214 PAINT: _xcb_send_request assigned request=1218 PAINT: EXPECTING X REPLY SEQ 1218 PAINT: _xcb_send_request expected=1218 read=1215 completed=1214 PAINT: get_socket_back expected=1218 read=1215 completed=1214 PAINT: _xcb_send_request assigned request=1219 PAINT: EXPECTING X REPLY SEQ 1219 PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214 PAINT: START _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1217 PAINT: _xcb_conn_wait expected=1218 read=1215 completed=1214 PAINT: 216 bytes buffered out PAINT: END _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1219 PAINT: xcb_take_socket expected=1218 read=1215 completed=1214 PAINT: get_socket_back expected=1218 read=1215 completed=1214 PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214 PAINT: _xcb_send_request expected=1218 read=1215 completed=1214 PAINT: get_socket_back expected=1218 read=1215 completed=1214 PAINT: _xcb_send_request assigned request=1220 PAINT: _xcb_wait_for_reply(1218) expected=1220 read=1215 completed=1214 PAINT: _xcb_out_flush_to(1218) expected=1220 read=1215 completed=1214 PAINT: _xcb_conn_wait expected=1220 read=1215 completed=1214 PAINT: GOT X REPLY 1218 PAINT: _xcb_request_check(1219) expected=1220 read=1218 completed=1217 PAINT: _xcb_wait_for_reply(1219) expected=1220 read=1218 completed=1217 PAINT: _xcb_out_flush_to(1219) expected=1220 read=1218 completed=1217 PAINT: _xcb_conn_wait expected=1220 read=1218 completed=1217 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218 PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
I'm stupid. All those logs lack the fix to do >=request_expected.
A log WITH the fix (still seems to be missing flush issue) PAINT: xcb_writev request 13972 PAINT: first of 1 requests opcode 148 minor_opcode 142 PAINT: START _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13971 PAINT: _xcb_conn_wait expected=13970 read=13970 completed=13969 PAINT: 0 bytes buffered out PAINT: END _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13972 PAINT: _xcb_send_request expected=13970 read=13970 completed=13969 PAINT: get_socket_back expected=13970 read=13970 completed=13969 PAINT: _xcb_send_request assigned request=13973 PAINT: EXPECTING X REPLY SEQ 13973 PAINT: _xcb_send_request expected=13973 read=13970 completed=13969 PAINT: get_socket_back expected=13973 read=13970 completed=13969 PAINT: _xcb_send_request assigned request=13974 PAINT: EXPECTING X REPLY SEQ 13974 PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969 PAINT: START _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13972 PAINT: _xcb_conn_wait expected=13973 read=13970 completed=13969 PAINT: 216 bytes buffered out PAINT: END _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13974 PAINT: xcb_take_socket expected=13973 read=13970 completed=13969 PAINT: get_socket_back expected=13973 read=13970 completed=13969 PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969 PAINT: _xcb_send_request expected=13973 read=13970 completed=13969 PAINT: get_socket_back expected=13973 read=13970 completed=13969 PAINT: _xcb_send_request assigned request=13975 PAINT: _xcb_wait_for_reply(13973) expected=13975 read=13970 completed=13969 PAINT: _xcb_out_flush_to(13973) expected=13975 read=13970 completed=13969 PAINT: _xcb_conn_wait expected=13975 read=13970 completed=13969 PAINT: GOT X REPLY 13973 PAINT: _xcb_request_check(13974) expected=13975 read=13973 completed=13972 PAINT: _xcb_wait_for_reply(13974) expected=13975 read=13973 completed=13972 PAINT: _xcb_out_flush_to(13974) expected=13975 read=13973 completed=13972 PAINT: _xcb_conn_wait expected=13975 read=13973 completed=13972 PAINT: _xcb_conn_wait expected=13975 read=13974 completed=13973
... and if I add xcb_flush() to xcb_request_check(), if not doing GetInputFocus, then I haven't reproduced the hang after a good bit longer than it usually takes. So that looks like it. Re: API ideas, just throwing ideas out there. Feel free to ignore or maybe you want to open a new bug if you decide you want to make a change. Thanks
Hey Havoc, sorry I dropped this. Since your last report, I've cleaned up xcb_request_check to hold locks appropriately and use XCB_REQUEST_DISCARD_REPLY on syncs, and Rami noticed some issues with threads not always being awakened as soon as they could. And as far as I can tell, all necessary requests are flushed before it waits for the reply. I don't currently see any way your tests could be failing. :-) So could you re-test against libxcb git master and report whether all is well now? Thanks in advance.
The test setup here was a bunch of local hacks to a large codebase that's since been changed all around, so it would probably take me hours to figure it out again (at least). If you think it's fixed we should just go with that probably.
> If you think it's fixed we should just go with that probably. Yeah.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.