| Summary: | hang in xcb_request_check() due to expecting a reply on void request | ||
|---|---|---|---|
| Product: | XCB | Reporter: | Havoc Pennington <hp> |
| Component: | Library | Assignee: | Jamey Sharp <jamey> |
| Status: | RESOLVED FIXED | QA Contact: | xcb mailing list dummy <xcb> |
| Severity: | normal | ||
| Priority: | medium | ||
| Version: | unspecified | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| i915 platform: | i915 features: | ||
| Attachments: |
improve xcb_request_check docs
minimal test case |
||
"avoid updating request_expected if genrep.response_type == XCB_REPLY" obviously should be "only update request_expected if ..." Created attachment 37905 [details] [review] improve xcb_request_check docs I think xcb_request_check() docs assumed a bit too much, here's a patch to spell things out more. Huh, yeah, that isn't quite right, is it? (In reply to comment #0) > if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected) > && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed)) I think it should be correct and sufficient to change this check to be if the cookie is greater than *or equal to* request_expected. Since we have a void cookie, request_expected can't have been set equal to this sequence number when the request was sent; it can only have become equal due to the arrival of an event or error. If it became equal due to an event then we still need to sync. If it became equal due to an error, then request_completed will have been updated, which means we correctly won't sync. I don't want to change read_packet's handling of request_expected, because that's for deferring automatic syncs for 16-bit sequence number wrap as long as possible, so we'd wind up sending more syncs if we ignored events. Could you test that changing the first > to >= in xcb_request_check fixes the bug for you? That still hangs in my test, in the backtrace looking at the connection struct, gdb shows request_expected = 17161, request_read = 17160, request_completed = 17159 At this point we are blocking for 17160 and stuck. A theory on this is that request_expected was < 17160 going into xcb_request_check, so we didn't GetInputFocus. Then we had sent some other void request, which resulted in an event with sequence 17161 while we were spinning in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus. btw, I guess it's too late to change the API and this could be an API change, but I think a nicer way for things to work might be: - send a fire-and-forget (discard reply) GetInputFocus anytime we flush and there has not been a non-void request since the most recent void-and-checked request. That way, the normal sane usage pattern would Just Work: send_nonvoid_request() send_void_and_checked_request() send_void_request() xcb_flush() /* sends GetInputFocus because void_and_checked with no nonvoid */ note that on flush, we would not *block* on the GetInputFocus reply, of course. We don't care about the reply except to update request_read later. the GetInputFocus would also be sent on implicit flush (say on xcb_take_socket or buffer filling up or whatever), so it would always happen eventually. might send a few more GetInputFocus but that's far less of a problem than GetInputFocus *round trips* in xcb_request_check another sane approach I think would just require the app to send a getinputfocus and warn/complain on stderr if xcb_request_check is invoked without a non-void request since the void request was sent. i.e. make this the app's problem and declare it a bug if we get to xcb_request_check and it's going to hang. I guess my point is, the blocking GetInputFocus round trip in xcb_request_check is never really desirable and apps really ought to avoid it or XCB should avoid it for them. But, all this is more future enhancement, I know there's a much smaller fix to the immediate bug here. Created attachment 38126 [details]
minimal test case
This test app reproduces the hang you described. However, the patch I proposed does fix this test case (so I pushed that patch to master).
Would you post a minimal test case that still fails on libxcb git master please?
(In reply to comment #4) > That still hangs in my test, in the backtrace looking at the connection struct, > gdb shows > request_expected = 17161, request_read = 17160, request_completed = 17159 > > At this point we are blocking for 17160 and stuck. I assume you mean that cookie.sequence was 17160. > A theory on this is that request_expected was < 17160 going into > xcb_request_check, so we didn't GetInputFocus. The condition was cookie.sequence > request_expected, so you've stated a case where it *would* sync, unless cookie.sequence wasn't also greater than request_completed at the time. > Then we had sent some other void > request, which resulted in an event with sequence 17161 while we were spinning > in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus. That theory can't explain the sequence numbers you observed: If you got an event with sequence 17161, then request_read would be 17161 and request_completed would be 17160. Instead I think you must have gotten an event or reply (not an error) for 17160, and also sent a non-void request for 17161. Which raises the question of how you could be blocked indefinitely... Perhaps that non-void request didn't get flushed? Maybe xcb_request_check needs to flush when it doesn't sync? Also: Is your test app making requests from multiple threads? That's supposed to Just Work, but it would be important to know. BTW, I want to think about your proposed API changes after this bug is sorted out; but I'm ignoring them until then. :-) Looking at a sample hang again, and I happen to have XCB full of debug logging...
during the hang,
request_expected = 8493, request_read = 8492, request_completed = 8491
3 0x00a42752 in xcb_wait_for_reply (c=0x9d0a198, request=8492, e=0xbfb6419c) at xcb_in.c:512
i.e. wait_for_reply is blocking on 8492
some of my debug logging was:
PAINT: GOT X REPLY 8486
PAINT: EXPECTING X REPLY SEQ 8488
PAINT: 0 bytes buffered out
PAINT: GOT X REPLY 8488
PAINT: xcb_writev request 8489
PAINT: first of 1 requests opcode 148 minor_opcode 1
PAINT: 0 bytes buffered out
PAINT: xcb_writev request 8490
PAINT: first of 1 requests opcode 148 minor_opcode 142
PAINT: 20464 bytes buffered out
PAINT: EXPECTING X REPLY SEQ 8491
PAINT: EXPECTING X REPLY SEQ 8492
PAINT: 216 bytes buffered out
PAINT: GOT X REPLY 8491
ignore "PAINT" that is just what I'm debugging. "bytes buffered out" is number of bytes in the outgoing kernel buffer on the socket, ioctl(SIOCOUTQ). GOT X REPLY is in read_packet. EXPECTING X REPLY is _xcb_in_expect_reply.
I'll add some logging to flush and at the start of request_check and see if I can reproduce again. (My test case is "run huge app and do stuff for a while" so it's a little tricky.)
Here's a better log.
PAINT: GOT X REPLY 2870
PAINT: _xcb_wait_for_reply(2866) expected=2870 read=2870 completed=2869
PAINT: _xcb_request_check(2867) expected=2870 read=2870 completed=2869
PAINT: _xcb_wait_for_reply(2867) expected=2870 read=2870 completed=2869
PAINT: xcb_writev request 2871
PAINT: first of 1 requests opcode 148 minor_opcode 1
PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869
PAINT: 0 bytes buffered out
PAINT: xcb_writev request 2872
PAINT: first of 1 requests opcode 148 minor_opcode 142
PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869
PAINT: 0 bytes buffered out
PAINT: EXPECTING X REPLY SEQ 2873
PAINT: EXPECTING X REPLY SEQ 2874
PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869
PAINT: _xcb_conn_wait expected=2873 read=2870 completed=2869
PAINT: 216 bytes buffered out
PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869
PAINT: _xcb_wait_for_reply(2873) expected=2875 read=2870 completed=2869
PAINT: _xcb_out_flush_to(2873) expected=2875 read=2870 completed=2869
PAINT: _xcb_conn_wait expected=2875 read=2870 completed=2869
PAINT: GOT X REPLY 2873
PAINT: _xcb_request_check(2874) expected=2875 read=2873 completed=2872
PAINT: _xcb_wait_for_reply(2874) expected=2875 read=2873 completed=2872
PAINT: _xcb_out_flush_to(2874) expected=2875 read=2873 completed=2872
PAINT: _xcb_conn_wait expected=2875 read=2873 completed=2872
PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
[at this point we're spinning forever]
Even more detail. It looks like there's one unflushed request, which was your theory.
PAINT: xcb_writev request 1217
PAINT: first of 1 requests opcode 148 minor_opcode 142
PAINT: START _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1216
PAINT: _xcb_conn_wait expected=1215 read=1215 completed=1214
PAINT: 133944 bytes buffered out
PAINT: END _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1217
PAINT: _xcb_send_request expected=1215 read=1215 completed=1214
PAINT: get_socket_back expected=1215 read=1215 completed=1214
PAINT: _xcb_send_request assigned request=1218
PAINT: EXPECTING X REPLY SEQ 1218
PAINT: _xcb_send_request expected=1218 read=1215 completed=1214
PAINT: get_socket_back expected=1218 read=1215 completed=1214
PAINT: _xcb_send_request assigned request=1219
PAINT: EXPECTING X REPLY SEQ 1219
PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214
PAINT: START _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1217
PAINT: _xcb_conn_wait expected=1218 read=1215 completed=1214
PAINT: 216 bytes buffered out
PAINT: END _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1219
PAINT: xcb_take_socket expected=1218 read=1215 completed=1214
PAINT: get_socket_back expected=1218 read=1215 completed=1214
PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214
PAINT: _xcb_send_request expected=1218 read=1215 completed=1214
PAINT: get_socket_back expected=1218 read=1215 completed=1214
PAINT: _xcb_send_request assigned request=1220
PAINT: _xcb_wait_for_reply(1218) expected=1220 read=1215 completed=1214
PAINT: _xcb_out_flush_to(1218) expected=1220 read=1215 completed=1214
PAINT: _xcb_conn_wait expected=1220 read=1215 completed=1214
PAINT: GOT X REPLY 1218
PAINT: _xcb_request_check(1219) expected=1220 read=1218 completed=1217
PAINT: _xcb_wait_for_reply(1219) expected=1220 read=1218 completed=1217
PAINT: _xcb_out_flush_to(1219) expected=1220 read=1218 completed=1217
PAINT: _xcb_conn_wait expected=1220 read=1218 completed=1217
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
I'm stupid. All those logs lack the fix to do >=request_expected. A log WITH the fix (still seems to be missing flush issue)
PAINT: xcb_writev request 13972
PAINT: first of 1 requests opcode 148 minor_opcode 142
PAINT: START _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13971
PAINT: _xcb_conn_wait expected=13970 read=13970 completed=13969
PAINT: 0 bytes buffered out
PAINT: END _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13972
PAINT: _xcb_send_request expected=13970 read=13970 completed=13969
PAINT: get_socket_back expected=13970 read=13970 completed=13969
PAINT: _xcb_send_request assigned request=13973
PAINT: EXPECTING X REPLY SEQ 13973
PAINT: _xcb_send_request expected=13973 read=13970 completed=13969
PAINT: get_socket_back expected=13973 read=13970 completed=13969
PAINT: _xcb_send_request assigned request=13974
PAINT: EXPECTING X REPLY SEQ 13974
PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969
PAINT: START _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13972
PAINT: _xcb_conn_wait expected=13973 read=13970 completed=13969
PAINT: 216 bytes buffered out
PAINT: END _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13974
PAINT: xcb_take_socket expected=13973 read=13970 completed=13969
PAINT: get_socket_back expected=13973 read=13970 completed=13969
PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969
PAINT: _xcb_send_request expected=13973 read=13970 completed=13969
PAINT: get_socket_back expected=13973 read=13970 completed=13969
PAINT: _xcb_send_request assigned request=13975
PAINT: _xcb_wait_for_reply(13973) expected=13975 read=13970 completed=13969
PAINT: _xcb_out_flush_to(13973) expected=13975 read=13970 completed=13969
PAINT: _xcb_conn_wait expected=13975 read=13970 completed=13969
PAINT: GOT X REPLY 13973
PAINT: _xcb_request_check(13974) expected=13975 read=13973 completed=13972
PAINT: _xcb_wait_for_reply(13974) expected=13975 read=13973 completed=13972
PAINT: _xcb_out_flush_to(13974) expected=13975 read=13973 completed=13972
PAINT: _xcb_conn_wait expected=13975 read=13973 completed=13972
PAINT: _xcb_conn_wait expected=13975 read=13974 completed=13973
... and if I add xcb_flush() to xcb_request_check(), if not doing GetInputFocus, then I haven't reproduced the hang after a good bit longer than it usually takes. So that looks like it. Re: API ideas, just throwing ideas out there. Feel free to ignore or maybe you want to open a new bug if you decide you want to make a change. Thanks Hey Havoc, sorry I dropped this. Since your last report, I've cleaned up xcb_request_check to hold locks appropriately and use XCB_REQUEST_DISCARD_REPLY on syncs, and Rami noticed some issues with threads not always being awakened as soon as they could. And as far as I can tell, all necessary requests are flushed before it waits for the reply. I don't currently see any way your tests could be failing. :-) So could you re-test against libxcb git master and report whether all is well now? Thanks in advance. The test setup here was a bunch of local hacks to a large codebase that's since been changed all around, so it would probably take me hours to figure it out again (at least). If you think it's fixed we should just go with that probably. > If you think it's fixed we should just go with that probably.
Yeah.
|
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
If you do: - send a void request (no reply expected) with checked() - receive an event (which will have sequence number of void request) - DO NOT send another request - block for the request cookie Then things will hang. What's supposed to happen (unless I'm confused) is that xcb_request_check() does a GetInputFocus because the sequence number you're checking is newer than the expected or completed: if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected) && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed)) { free(xcb_get_input_focus_reply(c, xcb_get_input_focus(c), &ret)); This then gets us a reply that proves the void request is complete and read_packet can mark us complete. What actually happens is this, in read_packet(): if(XCB_SEQUENCE_COMPARE(c->in.request_read, >, c->in.request_expected)) c->in.request_expected = c->in.request_read; In this case the request_read sequence number matches that of the void request, BUT it's just coming from an event, since events have the sequence number of the last request. xcb_request_check() now thinks a reply is coming for the void request, and hangs forever. I'm not sure of the exact correct fix; it worked for me avoid updating request_expected if genrep.response_type == XCB_REPLY but I can also imagine that it's correct to never update request_expected in this spot in read_packet. Apps can work around the hang by manually doing their own GetInputFocus.