Bug 71338

Summary: Application crashes after many hours of drawing
Product: xorg Reporter: alexander.elbs
Component: Lib/XlibAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: jh4913895, jnsptrsn1, me4xorg, psychon, xcb
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Reproducer
none
patch to fix
none
temporary traces for libxcb to track down bug none

Description alexander.elbs 2013-11-07 09:17:13 UTC
We have an application that does a lot of XDrawString and XDrawLine. After several hours the application is exited by an XIOError.

The XIOError is called in libX11 in the file xcb_io.c, function _XReply.
It didn't get a response from xcb_wait_for_reply.

libxcb 1.5 is fine, libxcb 1.8.1 is not.
Bisecting libxcb points to this commit:

  commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9
  Author: Jamey Sharp <jamey@minilop.net>
  Date:   Sat Oct 9 17:13:45 2010 -0700

    xcb_in: Use 64-bit sequence numbers internally everywhere.
    
    Widen sequence numbers on entry to those public APIs that still take
    32-bit sequence numbers.
    
    Signed-off-by: Jamey Sharp <jamey@minilop.net>

Reverting it on top of 1.8.1 helps.

Adding traces to libxcb I found that the last request numbers used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in the while loop of the _XReply function), half a second later: 63215 (then XIOError is called).
The widen_request is also 63215, I would have expected 63215+2^32.
Therefore it seems that the request is not correctly widened.

The commit above also changed the compares in poll_for_reply from XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE.
Maybe the widening never worked correctly, but it was never observed, because only the lower 32bits were compared.
Comment 1 Uli Schlachter 2013-11-10 21:13:28 UTC
This could be a bug in libX11. There is a patch series titled "[PATCH libX11 0/2] Xlib 32-bit sequence number wrap bug(fix)" on xorg-devel right now. Sadly, fdo seems to be having problems and the mail archive is down.
Comment 2 alexander.elbs 2013-11-11 08:28:31 UTC
Hello,

yes this is exactly my problem. By now I also wrote a tiny reproducer that just does:

  for(;;) {
    XDrawLine(dpy, w, gc, 10, 60, 180, 20);
    XFlush(dpy);
  }

To fix _XSend I changed the requests type from uint64_t to unsigned long. I didn't find the second location though.

Regards,
Alexander
Comment 3 alexander.elbs 2013-11-11 08:29:48 UTC
Created attachment 88996 [details]
Reproducer

On my machine it takes ~2h to reproduce the issue. Only 32bit platforms are affected.
Comment 4 alexander.elbs 2013-11-11 08:34:58 UTC
Technical explanation:
http://lists.x.org/archives/xorg-devel/2013-October/038370.html
Comment 5 alexander.elbs 2013-11-11 08:54:10 UTC
Created attachment 89001 [details]
patch to fix

This patch is simpler than what is proposed by Jonas Petersen. But both should fix the issue.
Comment 6 Uli Schlachter 2013-11-11 13:36:18 UTC
(In reply to comment #3)
> Created attachment 88996 [details]
> Reproducer
> 
> On my machine it takes ~2h to reproduce the issue. Only 32bit platforms are
> affected.

I think that this becomes a lot faster if you use XNoOp() instead of XDrawLine() and remove the call to XFlush().

However, I couldn't reproduce this issue on a FreeBSD system (but that might be due to too old library versions).
Comment 7 alexander.elbs 2013-11-11 13:56:30 UTC
Created attachment 89026 [details]
temporary traces for libxcb to track down bug

I used these traces to track down the bug.
Comment 8 alexander.elbs 2013-11-11 14:08:24 UTC
So far I think these conditions are needed to reproduce:
- libxcb >= 1.8 (i.e. includes the commit ed37b08)
- compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
- the sequence counter wraps. With the additional traces from the libxcb patch I just added you should see something like that:

[...removed ~45MB...]
xcb_wait_for_reply: request=4294900234, c->out.request=4294965258 widen_request=4294900234
wait_for_reply: request=4294900234
insert_reader: request=4294900234
poll_for_reply: request < c->in.request_read: 4294900234 < 4294965258
poll_for_reply: return 1
remove_reader: reader->request=4294900234
xcb_wait_for_reply: ret=NULL
xcb_wait_for_reply: request=4294965258, c->out.request=4294965258 widen_request=4294965258
wait_for_reply: request=4294965258
insert_reader: request=4294965258
poll_for_reply: request == c->in.request_read && c->in.current_reply: 4294965258 == 4294965258 && 0
poll_for_reply: return 1
remove_reader: reader->request=4294965258
xcb_wait_for_reply: ret.response_type=1, ret.sequence=63498
xcb_writev: requests=18446744069414584321, c->out.request=0, old out.request=4294967295
backtrace() returned 8 addresses
/tmp/lib_draw/libxcb.so.1(printBacktrace+0x2b) [0x55866302]
/tmp/lib_draw/libxcb.so.1(xcb_writev+0xde) [0x558656a3]
/usr/lib/libX11.so.6(_XSend+0x151) [0x555c5bd1]
/usr/lib/libX11.so.6(_XFlush+0x3a) [0x555c661a]
/usr/lib/libX11.so.6(XFlush+0x32) [0x555a6812]
/net/homes/elbs/work/xdraw() [0x8048b4a]
/lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6]
/net/homes/elbs/work/xdraw() [0x80487a1]
poll_for_reply: request=18446744073709549578
backtrace() returned 10 addresses
/tmp/lib_draw/libxcb.so.1(printBacktrace+0x2b) [0x55866302]
/tmp/lib_draw/libxcb.so.1(+0xa3e7) [0x558663e7]
/tmp/lib_draw/libxcb.so.1(xcb_poll_for_reply+0xae) [0x55866d34]
/usr/lib/libX11.so.6(+0x3a250) [0x555c6250]
/usr/lib/libX11.so.6(_XEventsQueued+0x7f) [0x555c659f]
/usr/lib/libX11.so.6(_XFlush+0x4a) [0x555c662a]
/usr/lib/libX11.so.6(XFlush+0x32) [0x555a6812]
/net/homes/elbs/work/xdraw() [0x8048b4a]
/lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6]
/net/homes/elbs/work/xdraw() [0x80487a1]
poll_for_reply: request < c->in.request_read: 18446744073709549578 < 4294965258
poll_for_reply: return 1
xcb_wait_for_reply: request=62986, c->out.request=62986 widen_request=62986
wait_for_reply: request=62986
insert_reader: request=62986
poll_for_reply: request < c->in.request_read: 62986 < 4294965258
poll_for_reply: return 1
remove_reader: reader->request=62986
xcb_wait_for_reply: ret=NULL
ERROR Received a X IO error on display=8cc9008.
backtrace() returned 10 addresses
/net/homes/elbs/work/xdraw() [0x8048853]
/net/homes/elbs/work/xdraw() [0x80488f4]
/usr/lib/libX11.so.6(_XIOError+0x57) [0x555c77b7]
/usr/lib/libX11.so.6(_XReply+0x3d5) [0x555c6075]
/usr/lib/libX11.so.6(+0x3c36f) [0x555c836f]
/usr/lib/libX11.so.6(+0x3c485) [0x555c8485]
/usr/lib/libX11.so.6(XDrawLine+0x9b) [0x555a439b]
/net/homes/elbs/work/xdraw() [0x8048b3b]
/lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6]
/net/homes/elbs/work/xdraw() [0x80487a1]
Comment 9 alexander.elbs 2013-11-11 14:19:34 UTC
yes, with a for loop like this:

  for(;;) {
    XNoOp(dpy);
  }

the crash happens after a few minutes.
Comment 10 Franz Reinhardt 2013-12-04 15:03:09 UTC
*** Bug 29144 has been marked as a duplicate of this bug. ***
Comment 11 Franz Reinhardt 2013-12-04 15:51:22 UTC
(In reply to comment #4)
> Technical explanation:
> http://lists.x.org/archives/xorg-devel/2013-October/038370.html

http://lists.x.org/archives/xorg-devel/2013-November/039007.html

Where is the final patch available ?
Comment 12 Franz Reinhardt 2013-12-05 07:48:50 UTC
ok I found them.
http://patchwork.freedesktop.org/project/Xorg/list/
Comment 13 alexander.elbs 2013-12-05 13:27:11 UTC
Hello,

you could also have a look at the fix attached to this bug report (second attachment). It is simpler (no need for a temporary variable).
But I haven't submitted it to 'patchwork' -- didn't know about that.

Regards,
Alexander
Comment 14 Franz Reinhardt 2013-12-09 09:54:35 UTC
(In reply to comment #13)
> Hello,
> 
> you could also have a look at the fix attached to this bug report (second
> attachment). It is simpler (no need for a temporary variable).
> But I haven't submitted it to 'patchwork' -- didn't know about that.
> 
> Regards,
> Alexander

Hi Alexander,

the truth is that I don't completely understand the design in relation to this issue. For me, it's all mixed up with unsigned long and uint64. And should work on 32 and 64 bit systems as well. So why not just use 64 bit for all platforms ?
The perfect solution would be, if someone really understands this piece of code, would provide a proper solution for this.

Best regards,

Franz
Comment 15 Franz Reinhardt 2013-12-09 16:42:38 UTC
(In reply to comment #12)
> ok I found them.
> http://patchwork.freedesktop.org/project/Xorg/list/

http://patchwork.freedesktop.org/patch/16297/
Comment 16 Jonas Petersen 2013-12-10 00:48:05 UTC
Hi,

the patch by Alexander has a nice twist by using unsinged long for the 'request' variable. This indeed does the job for one of the two affected locations. It accomplishes that (on 32 bit systems) by wrapping an unsigned 64 below 0 and then truncating it with a cast to an unsigned 32. It does the math, not sure if this is good practice or a reliable technique on all platforms though. Also, maybe it's not so intuitive by obfuscating some logic? Just some guesses. Might be the way to go.

But as I can see, the patch also has a flaw. Simply casting 'sequence' to unsigned long in the for-loop condition won't be sufficient. The loop (and comment) implies that 'request' might be 2 or more ahead of 'last_flushed'. Now assume that 'last_flushed' is 0xfffffffe and 'request' has just wrapped to 0. The loop will start with 'sequence = last_flushed + 1' which is 0xffffffff. The cast to unsigned long will have no effect now.  'sequence <= request' will be false and the body of the for-loop is executed not even once. Even though we have two pending requests.

Looking at it, the variable 'request' itself is actually unnecessary (then again this might be one of those spots where it gets optimized out by the compiler anyway...). There would even be a way without any of those variables ('requests' and 'unwrapped_request'), but this would require putting the unwrapping into the for-loop condition, plus optionally the above "trick". This could save a variable at the cost of some efficiency.

Anyway, maybe we should see what the list thinks about the casting logic. If that's a valid option then the temporary (unwrapped) variable could be placed into the for-loop only, possibly optimizing the likely path a bit.

Regarding the mix up of 32 and 64 bit:
Xlib completely uses unsigned long internally. Explicit 64-bit is used only in the XCB connection tier (which is limited to xcb_io.c and Xxcbint.h). Maybe it's a compatibility thing, or it has more impact than we could think of right now?

Regards,
Jonas
Comment 17 dave.kinsell 2015-01-07 21:23:04 UTC
I ran into this on a product we sell, normally running on RedHat or CentOS.  With a fully patched RHEL6u6 (64bit), it's still there for 32 bits executables,  using libxcb-1.9.1-2.  One of our end customers created a program just displaying the time in a window, in a tight loop, which caused crashes in about 32 hrs.  Testing   while (1) XNoOp(dsp);   it dies in less than five minutes on modest hardware.

For a bug that apparently has been around since 2006, it's still causing headaches in 2015.  Are the vendors ever going to push a fix out for this thing?
Comment 18 Uli Schlachter 2015-05-14 07:14:58 UTC
Some new API was added to libxcb for this:

commit cb621341a62e6d2233db3e337611f6fdd4f675a6
Author: Christian Linhart <chris@demorecorder.com>
Date:   Wed Apr 29 09:11:37 2015 +0200

    expose 64-bit sequence numbers for XLib
    
    While XCB uses 64-bit sequence number internally, it only exposes
    "unsigned int" so that, on 32-bit architecture, Xlib based applications
    may see their sequence number wrap which causes the connection to the X
    server to be lost.
    
    Expose 64-bit sequence number from XCB API so that Xlib and others can
    use it even on 32-bit environment.
    
    This implies the following API addition:
    
      xcb_send_request64()
      xcb_discard_reply64()
      xcb_wait_for_reply64()
      xcb_poll_for_reply64()
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=71338
    
    Reviewed-by: Uli Schlachter <psychon@znc.in>
    Signed-off-by: Christian Linhart <chris@demorecorder.com>
    Signed-off-by: Olivier Fourdan <ofourdan@redhat.com>
Comment 19 jh4913895 2015-06-10 19:39:29 UTC
Our QA team ran into this issue recently.  Upon further testing, the XNoOp test still causes the IO error for a 32-bit program with libxcb 1.10-3 on Debian 8 and libxcb 1.10-2 on Ubuntu 14.04 LTS.  I don't see any mention of a fix related to this issue in the libxcb 1.11 announcement.

Are there any plans to make a fix for this?
Comment 20 Olivier Fourdan 2015-06-11 07:45:30 UTC
(In reply to jh4913895 from comment #19)
> Our QA team ran into this issue recently.  Upon further testing, the XNoOp
> test still causes the IO error for a 32-bit program with libxcb 1.10-3 on
> Debian 8 and libxcb 1.10-2 on Ubuntu 14.04 LTS.  I don't see any mention of
> a fix related to this issue in the libxcb 1.11 announcement.

Because libxcb-1.11 predates the fix, see comment #18
 
> Are there any plans to make a fix for this?

The fix is made of 2 different patches for 2 different packages.

1. XCB needs to expose the 64bit sequence number by an API addition, this is comment #18

2. Xlib needs to make use of it, that's a separate (bigger) patch for libX11.

But for #2 to happen, it requires an official release of libxcb that includes #1 (not 1.11) so that libX11 can depend on the appropriate XCB version that exposes the 64bit API.
Comment 21 Olivier Fourdan 2015-09-22 12:04:39 UTC
Both fixes for libxcb and libX11 have been merged:

http://cgit.freedesktop.org/xcb/libxcb/commit/?id=0ba6fe9

http://cgit.freedesktop.org/xorg/lib/libX11/commit/?id=a72d2d0

Requires libxcb-1.11.1

http://cgit.freedesktop.org/xcb/libxcb/tag/?id=1.11.1

=> Closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.