Bug 29599 - hang in xcb_request_check() due to expecting a reply on void request
Summary: hang in xcb_request_check() due to expecting a reply on void request
Status: ASSIGNED
Alias: None
Product: XCB
Classification: Unclassified
Component: Library (show other bugs)
Version: unspecified
Hardware: All All
: medium normal
Assignee: Jamey Sharp
QA Contact: xcb mailing list dummy
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-16 20:25 UTC by Havoc Pennington
Modified: 2011-03-25 12:10 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
improve xcb_request_check docs (2.11 KB, patch)
2010-08-16 21:11 UTC, Havoc Pennington
Details | Splinter Review
minimal test case (1.98 KB, text/x-csrc)
2010-08-24 09:32 UTC, Jamey Sharp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Havoc Pennington 2010-08-16 20:25:14 UTC
If you do:

 - send a void request (no reply expected) with checked()
 - receive an event (which will have sequence number of void request)
 - DO NOT send another request
 - block for the request cookie

Then things will hang. What's supposed to happen (unless I'm confused) is that xcb_request_check() does a GetInputFocus because the sequence number you're checking is newer than the expected or completed:

 if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected)
       && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed))
    {
        free(xcb_get_input_focus_reply(c, xcb_get_input_focus(c), &ret));

This then gets us a reply that proves the void request is complete and read_packet can mark us complete.

What actually happens is this, in read_packet():
    if(XCB_SEQUENCE_COMPARE(c->in.request_read, >, c->in.request_expected))
            c->in.request_expected = c->in.request_read;

In this case the request_read sequence number matches that of the void request, BUT it's just coming from an event, since events have the sequence number of the last request.

xcb_request_check() now thinks a reply is coming for the void request, and hangs forever.

I'm not sure of the exact correct fix; it worked for me avoid updating request_expected if genrep.response_type == XCB_REPLY but I can also imagine that it's correct to never update request_expected in this spot in read_packet.

Apps can work around the hang by manually doing their own GetInputFocus.
Comment 1 Havoc Pennington 2010-08-16 20:31:26 UTC
"avoid updating request_expected if genrep.response_type == XCB_REPLY" obviously should be "only update request_expected if ..."
Comment 2 Havoc Pennington 2010-08-16 21:11:02 UTC
Created attachment 37905 [details] [review]
improve xcb_request_check docs

I think xcb_request_check() docs assumed a bit too much, here's a patch to spell things out more.
Comment 3 Jamey Sharp 2010-08-17 13:36:44 UTC
Huh, yeah, that isn't quite right, is it?

(In reply to comment #0)
>  if(XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_expected)
>        && XCB_SEQUENCE_COMPARE_32(cookie.sequence,>,c->in.request_completed))

I think it should be correct and sufficient to change this check to be if the cookie is greater than *or equal to* request_expected.

Since we have a void cookie, request_expected can't have been set equal to this sequence number when the request was sent; it can only have become equal due to the arrival of an event or error. If it became equal due to an event then we still need to sync. If it became equal due to an error, then request_completed will have been updated, which means we correctly won't sync.

I don't want to change read_packet's handling of request_expected, because that's for deferring automatic syncs for 16-bit sequence number wrap as long as possible, so we'd wind up sending more syncs if we ignored events.

Could you test that changing the first > to >= in xcb_request_check fixes the bug for you?
Comment 4 Havoc Pennington 2010-08-18 14:23:37 UTC
That still hangs in my test, in the backtrace looking at the connection struct, gdb shows
 request_expected = 17161, request_read = 17160, request_completed = 17159

At this point we are blocking for 17160 and stuck.

A theory on this is that request_expected was < 17160 going into xcb_request_check, so we didn't GetInputFocus. Then we had sent some other void request, which resulted in an event with sequence 17161 while we were spinning in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus.
Comment 5 Havoc Pennington 2010-08-18 14:38:06 UTC
btw, I guess it's too late to change the API and this could be an API change, but I think a nicer way for things to work might be:

 - send a fire-and-forget (discard reply) GetInputFocus anytime we
   flush and there has not been a non-void request since the most
   recent void-and-checked request.

That way, the normal sane usage pattern would Just Work:

 send_nonvoid_request()
 send_void_and_checked_request()
 send_void_request()
 xcb_flush()  /* sends GetInputFocus because void_and_checked with no nonvoid */

note that on flush, we would not *block* on the GetInputFocus reply, of course.
We don't care about the reply except to update request_read later.

the GetInputFocus would also be sent on implicit flush (say on xcb_take_socket or buffer filling up or whatever), so it would always happen eventually.

might send a few more GetInputFocus but that's far less of a problem than GetInputFocus *round trips* in xcb_request_check

another sane approach I think would just require the app to send a getinputfocus and warn/complain on stderr if xcb_request_check is invoked without a non-void request since the void request was sent. i.e. make this the app's problem and declare it a bug if we get to xcb_request_check and it's going to hang.

I guess my point is, the blocking GetInputFocus round trip in xcb_request_check is never really desirable and apps really ought to avoid it or XCB should avoid it for them.

But, all this is more future enhancement, I know there's a much smaller fix to the immediate bug here.
Comment 6 Jamey Sharp 2010-08-24 09:32:05 UTC
Created attachment 38126 [details]
minimal test case

This test app reproduces the hang you described. However, the patch I proposed does fix this test case (so I pushed that patch to master).

Would you post a minimal test case that still fails on libxcb git master please?
Comment 7 Jamey Sharp 2010-08-24 09:48:50 UTC
(In reply to comment #4)
> That still hangs in my test, in the backtrace looking at the connection struct,
> gdb shows
>  request_expected = 17161, request_read = 17160, request_completed = 17159
> 
> At this point we are blocking for 17160 and stuck.

I assume you mean that cookie.sequence was 17160.

> A theory on this is that request_expected was < 17160 going into
> xcb_request_check, so we didn't GetInputFocus.

The condition was cookie.sequence > request_expected, so you've stated a case where it *would* sync, unless cookie.sequence wasn't also greater than request_completed at the time.

> Then we had sent some other void
> request, which resulted in an event with sequence 17161 while we were spinning
> in xcb_wait_for_reply(). This is too late to trigger the GetInputFocus.

That theory can't explain the sequence numbers you observed: If you got an event with sequence 17161, then request_read would be 17161 and request_completed would be 17160. Instead I think you must have gotten an event or reply (not an error) for 17160, and also sent a non-void request for 17161. Which raises the question of how you could be blocked indefinitely...

Perhaps that non-void request didn't get flushed? Maybe xcb_request_check needs to flush when it doesn't sync?

Also: Is your test app making requests from multiple threads? That's supposed to Just Work, but it would be important to know.

BTW, I want to think about your proposed API changes after this bug is sorted out; but I'm ignoring them until then. :-)
Comment 8 Havoc Pennington 2010-08-24 12:59:51 UTC
Looking at a sample hang again, and I happen to have XCB full of debug logging...

during the hang,

request_expected = 8493, request_read = 8492, request_completed = 8491

3  0x00a42752 in xcb_wait_for_reply (c=0x9d0a198, request=8492, e=0xbfb6419c) at xcb_in.c:512

i.e. wait_for_reply is blocking on 8492

some of my debug logging was:
       PAINT: GOT X REPLY 8486
       PAINT: EXPECTING X REPLY SEQ 8488
       PAINT: 0 bytes buffered out
       PAINT: GOT X REPLY 8488
       PAINT: xcb_writev request 8489
       PAINT: first of 1 requests opcode 148 minor_opcode 1
       PAINT: 0 bytes buffered out
       PAINT: xcb_writev request 8490
       PAINT: first of 1 requests opcode 148 minor_opcode 142
       PAINT: 20464 bytes buffered out
       PAINT: EXPECTING X REPLY SEQ 8491
       PAINT: EXPECTING X REPLY SEQ 8492
       PAINT: 216 bytes buffered out
       PAINT: GOT X REPLY 8491

ignore "PAINT" that is just what I'm debugging. "bytes buffered out" is number of bytes in the outgoing kernel buffer on the socket, ioctl(SIOCOUTQ). GOT X REPLY is in read_packet. EXPECTING X REPLY is _xcb_in_expect_reply.

I'll add some logging to flush and at the start of request_check and see if I can reproduce again. (My test case is "run huge app and do stuff for a while" so it's a little tricky.)
Comment 9 Havoc Pennington 2010-08-24 13:10:41 UTC
Here's a better log.

      PAINT: GOT X REPLY 2870
       PAINT: _xcb_wait_for_reply(2866) expected=2870 read=2870 completed=2869
       PAINT: _xcb_request_check(2867) expected=2870 read=2870 completed=2869
       PAINT: _xcb_wait_for_reply(2867) expected=2870 read=2870 completed=2869
       PAINT: xcb_writev request 2871
       PAINT: first of 1 requests opcode 148 minor_opcode 1
       PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869
       PAINT: 0 bytes buffered out
       PAINT: xcb_writev request 2872
       PAINT: first of 1 requests opcode 148 minor_opcode 142
       PAINT: _xcb_conn_wait expected=2870 read=2870 completed=2869
       PAINT: 0 bytes buffered out
       PAINT: EXPECTING X REPLY SEQ 2873
       PAINT: EXPECTING X REPLY SEQ 2874
       PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869
       PAINT: _xcb_conn_wait expected=2873 read=2870 completed=2869
       PAINT: 216 bytes buffered out
       PAINT: _xcb_out_flush_to(2874) expected=2873 read=2870 completed=2869
       PAINT: _xcb_wait_for_reply(2873) expected=2875 read=2870 completed=2869
       PAINT: _xcb_out_flush_to(2873) expected=2875 read=2870 completed=2869
       PAINT: _xcb_conn_wait expected=2875 read=2870 completed=2869
       PAINT: GOT X REPLY 2873
       PAINT: _xcb_request_check(2874) expected=2875 read=2873 completed=2872
       PAINT: _xcb_wait_for_reply(2874) expected=2875 read=2873 completed=2872
       PAINT: _xcb_out_flush_to(2874) expected=2875 read=2873 completed=2872
       PAINT: _xcb_conn_wait expected=2875 read=2873 completed=2872
       PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
       PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
       PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
       PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
       PAINT: _xcb_conn_wait expected=2875 read=2874 completed=2873
[at this point we're spinning forever]
Comment 10 Havoc Pennington 2010-08-24 13:21:13 UTC
Even more detail. It looks like there's one unflushed request, which was your theory.

       PAINT: xcb_writev request 1217
       PAINT: first of 1 requests opcode 148 minor_opcode 142
       PAINT: START _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1216
       PAINT: _xcb_conn_wait expected=1215 read=1215 completed=1214
       PAINT: 133944 bytes buffered out
       PAINT: END   _xcb_out_send expected=1215 read=1215 completed=1214 out.request=1217 out.written=1217
       PAINT: _xcb_send_request expected=1215 read=1215 completed=1214
       PAINT: get_socket_back expected=1215 read=1215 completed=1214
       PAINT: _xcb_send_request assigned request=1218
       PAINT: EXPECTING X REPLY SEQ 1218
       PAINT: _xcb_send_request expected=1218 read=1215 completed=1214
       PAINT: get_socket_back expected=1218 read=1215 completed=1214
       PAINT: _xcb_send_request assigned request=1219
       PAINT: EXPECTING X REPLY SEQ 1219
       PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214
       PAINT: START _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1217
       PAINT: _xcb_conn_wait expected=1218 read=1215 completed=1214
       PAINT: 216 bytes buffered out
       PAINT: END   _xcb_out_send expected=1218 read=1215 completed=1214 out.request=1219 out.written=1219
       PAINT: xcb_take_socket expected=1218 read=1215 completed=1214
       PAINT: get_socket_back expected=1218 read=1215 completed=1214
       PAINT: _xcb_out_flush_to(1219) expected=1218 read=1215 completed=1214
       PAINT: _xcb_send_request expected=1218 read=1215 completed=1214
       PAINT: get_socket_back expected=1218 read=1215 completed=1214
       PAINT: _xcb_send_request assigned request=1220
       PAINT: _xcb_wait_for_reply(1218) expected=1220 read=1215 completed=1214
       PAINT: _xcb_out_flush_to(1218) expected=1220 read=1215 completed=1214
       PAINT: _xcb_conn_wait expected=1220 read=1215 completed=1214
       PAINT: GOT X REPLY 1218
       PAINT: _xcb_request_check(1219) expected=1220 read=1218 completed=1217
       PAINT: _xcb_wait_for_reply(1219) expected=1220 read=1218 completed=1217
       PAINT: _xcb_out_flush_to(1219) expected=1220 read=1218 completed=1217
       PAINT: _xcb_conn_wait expected=1220 read=1218 completed=1217
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
       PAINT: _xcb_conn_wait expected=1220 read=1219 completed=1218
Comment 11 Havoc Pennington 2010-08-24 13:26:57 UTC
I'm stupid. All those logs lack the fix to do >=request_expected.
Comment 12 Havoc Pennington 2010-08-24 13:28:36 UTC
A log WITH the fix (still seems to be missing flush issue)

       PAINT: xcb_writev request 13972
       PAINT: first of 1 requests opcode 148 minor_opcode 142
       PAINT: START _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13971
       PAINT: _xcb_conn_wait expected=13970 read=13970 completed=13969
       PAINT: 0 bytes buffered out
       PAINT: END   _xcb_out_send expected=13970 read=13970 completed=13969 out.request=13972 out.written=13972
       PAINT: _xcb_send_request expected=13970 read=13970 completed=13969
       PAINT: get_socket_back expected=13970 read=13970 completed=13969
       PAINT: _xcb_send_request assigned request=13973
       PAINT: EXPECTING X REPLY SEQ 13973
       PAINT: _xcb_send_request expected=13973 read=13970 completed=13969
       PAINT: get_socket_back expected=13973 read=13970 completed=13969
       PAINT: _xcb_send_request assigned request=13974
       PAINT: EXPECTING X REPLY SEQ 13974
       PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969
       PAINT: START _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13972
       PAINT: _xcb_conn_wait expected=13973 read=13970 completed=13969
       PAINT: 216 bytes buffered out
       PAINT: END   _xcb_out_send expected=13973 read=13970 completed=13969 out.request=13974 out.written=13974
       PAINT: xcb_take_socket expected=13973 read=13970 completed=13969
       PAINT: get_socket_back expected=13973 read=13970 completed=13969
       PAINT: _xcb_out_flush_to(13974) expected=13973 read=13970 completed=13969
       PAINT: _xcb_send_request expected=13973 read=13970 completed=13969
       PAINT: get_socket_back expected=13973 read=13970 completed=13969
       PAINT: _xcb_send_request assigned request=13975
       PAINT: _xcb_wait_for_reply(13973) expected=13975 read=13970 completed=13969
       PAINT: _xcb_out_flush_to(13973) expected=13975 read=13970 completed=13969
       PAINT: _xcb_conn_wait expected=13975 read=13970 completed=13969
       PAINT: GOT X REPLY 13973
       PAINT: _xcb_request_check(13974) expected=13975 read=13973 completed=13972
       PAINT: _xcb_wait_for_reply(13974) expected=13975 read=13973 completed=13972
       PAINT: _xcb_out_flush_to(13974) expected=13975 read=13973 completed=13972
       PAINT: _xcb_conn_wait expected=13975 read=13973 completed=13972
       PAINT: _xcb_conn_wait expected=13975 read=13974 completed=13973
Comment 13 Havoc Pennington 2010-08-24 13:32:30 UTC
... and if I add xcb_flush() to xcb_request_check(), if not doing GetInputFocus, then I haven't reproduced the hang after a good bit longer than it usually takes.
So that looks like it.

Re: API ideas, just throwing ideas out there. Feel free to ignore or maybe you want to open a new bug if you decide you want to make a change.

Thanks
Comment 14 Jamey Sharp 2011-03-18 23:32:11 UTC
Hey Havoc, sorry I dropped this. Since your last report, I've cleaned up xcb_request_check to hold locks appropriately and use XCB_REQUEST_DISCARD_REPLY on syncs, and Rami noticed some issues with threads not always being awakened as soon as they could. And as far as I can tell, all necessary requests are flushed before it waits for the reply. I don't currently see any way your tests could be failing. :-)

So could you re-test against libxcb git master and report whether all is well now? Thanks in advance.
Comment 15 Havoc Pennington 2011-03-25 12:10:00 UTC
The test setup here was a bunch of local hacks to a large codebase that's since been changed all around, so it would probably take me hours to figure it out again (at least). If you think it's fixed we should just go with that probably.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.