Summary: | Application crashes after many hours of drawing | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | alexander.elbs | ||||||||
Component: | Lib/Xlib | Assignee: | Xorg Project Team <xorg-team> | ||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||
Severity: | normal | ||||||||||
Priority: | medium | CC: | jh4913895, jnsptrsn1, me4xorg, psychon, xcb | ||||||||
Version: | unspecified | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
alexander.elbs
2013-11-07 09:17:13 UTC
This could be a bug in libX11. There is a patch series titled "[PATCH libX11 0/2] Xlib 32-bit sequence number wrap bug(fix)" on xorg-devel right now. Sadly, fdo seems to be having problems and the mail archive is down. Hello, yes this is exactly my problem. By now I also wrote a tiny reproducer that just does: for(;;) { XDrawLine(dpy, w, gc, 10, 60, 180, 20); XFlush(dpy); } To fix _XSend I changed the requests type from uint64_t to unsigned long. I didn't find the second location though. Regards, Alexander Created attachment 88996 [details]
Reproducer
On my machine it takes ~2h to reproduce the issue. Only 32bit platforms are affected.
Technical explanation: http://lists.x.org/archives/xorg-devel/2013-October/038370.html Created attachment 89001 [details]
patch to fix
This patch is simpler than what is proposed by Jonas Petersen. But both should fix the issue.
(In reply to comment #3) > Created attachment 88996 [details] > Reproducer > > On my machine it takes ~2h to reproduce the issue. Only 32bit platforms are > affected. I think that this becomes a lot faster if you use XNoOp() instead of XDrawLine() and remove the call to XFlush(). However, I couldn't reproduce this issue on a FreeBSD system (but that might be due to too old library versions). Created attachment 89026 [details]
temporary traces for libxcb to track down bug
I used these traces to track down the bug.
So far I think these conditions are needed to reproduce: - libxcb >= 1.8 (i.e. includes the commit ed37b08) - compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c - the sequence counter wraps. With the additional traces from the libxcb patch I just added you should see something like that: [...removed ~45MB...] xcb_wait_for_reply: request=4294900234, c->out.request=4294965258 widen_request=4294900234 wait_for_reply: request=4294900234 insert_reader: request=4294900234 poll_for_reply: request < c->in.request_read: 4294900234 < 4294965258 poll_for_reply: return 1 remove_reader: reader->request=4294900234 xcb_wait_for_reply: ret=NULL xcb_wait_for_reply: request=4294965258, c->out.request=4294965258 widen_request=4294965258 wait_for_reply: request=4294965258 insert_reader: request=4294965258 poll_for_reply: request == c->in.request_read && c->in.current_reply: 4294965258 == 4294965258 && 0 poll_for_reply: return 1 remove_reader: reader->request=4294965258 xcb_wait_for_reply: ret.response_type=1, ret.sequence=63498 xcb_writev: requests=18446744069414584321, c->out.request=0, old out.request=4294967295 backtrace() returned 8 addresses /tmp/lib_draw/libxcb.so.1(printBacktrace+0x2b) [0x55866302] /tmp/lib_draw/libxcb.so.1(xcb_writev+0xde) [0x558656a3] /usr/lib/libX11.so.6(_XSend+0x151) [0x555c5bd1] /usr/lib/libX11.so.6(_XFlush+0x3a) [0x555c661a] /usr/lib/libX11.so.6(XFlush+0x32) [0x555a6812] /net/homes/elbs/work/xdraw() [0x8048b4a] /lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6] /net/homes/elbs/work/xdraw() [0x80487a1] poll_for_reply: request=18446744073709549578 backtrace() returned 10 addresses /tmp/lib_draw/libxcb.so.1(printBacktrace+0x2b) [0x55866302] /tmp/lib_draw/libxcb.so.1(+0xa3e7) [0x558663e7] /tmp/lib_draw/libxcb.so.1(xcb_poll_for_reply+0xae) [0x55866d34] /usr/lib/libX11.so.6(+0x3a250) [0x555c6250] /usr/lib/libX11.so.6(_XEventsQueued+0x7f) [0x555c659f] /usr/lib/libX11.so.6(_XFlush+0x4a) [0x555c662a] /usr/lib/libX11.so.6(XFlush+0x32) [0x555a6812] /net/homes/elbs/work/xdraw() [0x8048b4a] /lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6] /net/homes/elbs/work/xdraw() [0x80487a1] poll_for_reply: request < c->in.request_read: 18446744073709549578 < 4294965258 poll_for_reply: return 1 xcb_wait_for_reply: request=62986, c->out.request=62986 widen_request=62986 wait_for_reply: request=62986 insert_reader: request=62986 poll_for_reply: request < c->in.request_read: 62986 < 4294965258 poll_for_reply: return 1 remove_reader: reader->request=62986 xcb_wait_for_reply: ret=NULL ERROR Received a X IO error on display=8cc9008. backtrace() returned 10 addresses /net/homes/elbs/work/xdraw() [0x8048853] /net/homes/elbs/work/xdraw() [0x80488f4] /usr/lib/libX11.so.6(_XIOError+0x57) [0x555c77b7] /usr/lib/libX11.so.6(_XReply+0x3d5) [0x555c6075] /usr/lib/libX11.so.6(+0x3c36f) [0x555c836f] /usr/lib/libX11.so.6(+0x3c485) [0x555c8485] /usr/lib/libX11.so.6(XDrawLine+0x9b) [0x555a439b] /net/homes/elbs/work/xdraw() [0x8048b3b] /lib/libc.so.6(__libc_start_main+0xe6) [0x556dbce6] /net/homes/elbs/work/xdraw() [0x80487a1] yes, with a for loop like this: for(;;) { XNoOp(dpy); } the crash happens after a few minutes. *** Bug 29144 has been marked as a duplicate of this bug. *** (In reply to comment #4) > Technical explanation: > http://lists.x.org/archives/xorg-devel/2013-October/038370.html http://lists.x.org/archives/xorg-devel/2013-November/039007.html Where is the final patch available ? ok I found them. http://patchwork.freedesktop.org/project/Xorg/list/ Hello, you could also have a look at the fix attached to this bug report (second attachment). It is simpler (no need for a temporary variable). But I haven't submitted it to 'patchwork' -- didn't know about that. Regards, Alexander (In reply to comment #13) > Hello, > > you could also have a look at the fix attached to this bug report (second > attachment). It is simpler (no need for a temporary variable). > But I haven't submitted it to 'patchwork' -- didn't know about that. > > Regards, > Alexander Hi Alexander, the truth is that I don't completely understand the design in relation to this issue. For me, it's all mixed up with unsigned long and uint64. And should work on 32 and 64 bit systems as well. So why not just use 64 bit for all platforms ? The perfect solution would be, if someone really understands this piece of code, would provide a proper solution for this. Best regards, Franz (In reply to comment #12) > ok I found them. > http://patchwork.freedesktop.org/project/Xorg/list/ http://patchwork.freedesktop.org/patch/16297/ Hi, the patch by Alexander has a nice twist by using unsinged long for the 'request' variable. This indeed does the job for one of the two affected locations. It accomplishes that (on 32 bit systems) by wrapping an unsigned 64 below 0 and then truncating it with a cast to an unsigned 32. It does the math, not sure if this is good practice or a reliable technique on all platforms though. Also, maybe it's not so intuitive by obfuscating some logic? Just some guesses. Might be the way to go. But as I can see, the patch also has a flaw. Simply casting 'sequence' to unsigned long in the for-loop condition won't be sufficient. The loop (and comment) implies that 'request' might be 2 or more ahead of 'last_flushed'. Now assume that 'last_flushed' is 0xfffffffe and 'request' has just wrapped to 0. The loop will start with 'sequence = last_flushed + 1' which is 0xffffffff. The cast to unsigned long will have no effect now. 'sequence <= request' will be false and the body of the for-loop is executed not even once. Even though we have two pending requests. Looking at it, the variable 'request' itself is actually unnecessary (then again this might be one of those spots where it gets optimized out by the compiler anyway...). There would even be a way without any of those variables ('requests' and 'unwrapped_request'), but this would require putting the unwrapping into the for-loop condition, plus optionally the above "trick". This could save a variable at the cost of some efficiency. Anyway, maybe we should see what the list thinks about the casting logic. If that's a valid option then the temporary (unwrapped) variable could be placed into the for-loop only, possibly optimizing the likely path a bit. Regarding the mix up of 32 and 64 bit: Xlib completely uses unsigned long internally. Explicit 64-bit is used only in the XCB connection tier (which is limited to xcb_io.c and Xxcbint.h). Maybe it's a compatibility thing, or it has more impact than we could think of right now? Regards, Jonas I ran into this on a product we sell, normally running on RedHat or CentOS. With a fully patched RHEL6u6 (64bit), it's still there for 32 bits executables, using libxcb-1.9.1-2. One of our end customers created a program just displaying the time in a window, in a tight loop, which caused crashes in about 32 hrs. Testing while (1) XNoOp(dsp); it dies in less than five minutes on modest hardware. For a bug that apparently has been around since 2006, it's still causing headaches in 2015. Are the vendors ever going to push a fix out for this thing? Some new API was added to libxcb for this: commit cb621341a62e6d2233db3e337611f6fdd4f675a6 Author: Christian Linhart <chris@demorecorder.com> Date: Wed Apr 29 09:11:37 2015 +0200 expose 64-bit sequence numbers for XLib While XCB uses 64-bit sequence number internally, it only exposes "unsigned int" so that, on 32-bit architecture, Xlib based applications may see their sequence number wrap which causes the connection to the X server to be lost. Expose 64-bit sequence number from XCB API so that Xlib and others can use it even on 32-bit environment. This implies the following API addition: xcb_send_request64() xcb_discard_reply64() xcb_wait_for_reply64() xcb_poll_for_reply64() Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=71338 Reviewed-by: Uli Schlachter <psychon@znc.in> Signed-off-by: Christian Linhart <chris@demorecorder.com> Signed-off-by: Olivier Fourdan <ofourdan@redhat.com> Our QA team ran into this issue recently. Upon further testing, the XNoOp test still causes the IO error for a 32-bit program with libxcb 1.10-3 on Debian 8 and libxcb 1.10-2 on Ubuntu 14.04 LTS. I don't see any mention of a fix related to this issue in the libxcb 1.11 announcement. Are there any plans to make a fix for this? (In reply to jh4913895 from comment #19) > Our QA team ran into this issue recently. Upon further testing, the XNoOp > test still causes the IO error for a 32-bit program with libxcb 1.10-3 on > Debian 8 and libxcb 1.10-2 on Ubuntu 14.04 LTS. I don't see any mention of > a fix related to this issue in the libxcb 1.11 announcement. Because libxcb-1.11 predates the fix, see comment #18 > Are there any plans to make a fix for this? The fix is made of 2 different patches for 2 different packages. 1. XCB needs to expose the 64bit sequence number by an API addition, this is comment #18 2. Xlib needs to make use of it, that's a separate (bigger) patch for libX11. But for #2 to happen, it requires an official release of libxcb that includes #1 (not 1.11) so that libX11 can depend on the appropriate XCB version that exposes the 64bit API. Both fixes for libxcb and libX11 have been merged: http://cgit.freedesktop.org/xcb/libxcb/commit/?id=0ba6fe9 http://cgit.freedesktop.org/xorg/lib/libX11/commit/?id=a72d2d0 Requires libxcb-1.11.1 http://cgit.freedesktop.org/xcb/libxcb/tag/?id=1.11.1 => Closing |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.