Bug 27552

Summary: Lots of processes crash in XCreatePixmap() with _XAllocID: Assertion `ret != inval_id' failed
Product: xorg Reporter: Martin Pitt <martin.pitt>
Component: Lib/XlibAssignee: Jamey Sharp <jamey>
Status: NEW --- QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: degeneracypressure, jeremyhu, joris.guisson, xcb
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard: 2011BRB_Reviewed

Description Martin Pitt 2010-04-09 06:02:34 UTC
libx11, 1.3.2 (sorry, I didn't see a libX11 component)

We are getting tons of crash reports in Ubuntu (https://launchpad.net/bugs/507062) about programs crashing with

  _XAllocID: Assertion `ret != inval_id' failed

At first we thought this would be the same problem that was recently discussed and fixed in libXext:

http://lists.freedesktop.org/archives/xcb/2009-October/005102.html
http://cgit.freedesktop.org/xorg/lib/libXext/commit/?id=956fd30e1046e5779ac0b6c07ec4f0e87250869a

However, we already have that fixed version, and the stack traces of above bug reports does not go through XShmAttach(), so it does seem to be a different cause.

They all have this piece in common:

#4  0xb74d7199 in _XAllocID (dpy=0x8116770) at ../../src/xcb_io.c:378
	ret = 4294967295
	__PRETTY_FUNCTION__ = "_XAllocID"
#5  0xb74ad048 in XCreatePixmap (dpy=0x8116770, d=265, width=24, height=24, 
    depth=32) at ../../src/CrPixmap.c:58

i. e. they all come through XCreatePixmap() (which is called from various functions in the duplicate bugs, like XcursorImageLoadCursor(),  _cairo_xlib_surface_create_similar_with_format(), etc.)

I checked that the current libX11's XCreatePixmap() already calls _XAllocID() in a LockDisplay() block, so it's not the same cause as the recent libXext fix.

Beyond that I'm afraid I don't know enough about this API to be able to continue debugging on my own. Obviously nothing must call _XAllocID() two times in succession without an _XIDHandler() in between (the only other place where next_xid is set is _XConnectXCB(), but that's only called on program initialization through XOpenDisplay(), right?)

Do you have some further hints how to debug this, or what could go wrong here?

For reference, here are some links to the full stack traces:
http://launchpadlibrarian.net/23717282/Stacktrace.txt
http://launchpadlibrarian.net/35702381/Stacktrace.txt
http://launchpadlibrarian.net/39577843/Stacktrace.txt
http://launchpadlibrarian.net/37855566/Stacktrace.txt

Thanks!
Comment 1 Jamey Sharp 2010-04-09 17:35:03 UTC
It would be really nice to know where the last call to XIDAlloc was before the one that fails, and for that I'd love to have a minimal test case.

Any chance these applications are using libX11 from more than one thread? I notice in some stack traces, like <http://launchpadlibrarian.net/31837770/ThreadStacktrace.txt> that a second thread is waiting in libpulse; and in <http://launchpadlibrarian.net/31032259/ThreadStacktrace.txt>, g_main_loop_run is in a different thread than the cairo call that precedes the assert.

I could believe there's a race in how Xlib asks XCB for XIDs, but I'd like to know if that would be a plausible explanation for the reports you're actually seeing. It looks like a fair number of the reported stack traces are single-threaded, making that an unlikely explanation for those reports. :-) It's possible there are two separate bugs here, I suppose...

In lieu of a minimal test, perhaps you can tell me which X extensions these applications have in common? Perhaps the client library for some extension allocates an XID but is missing a SyncHandle call; if the next X request also tries to allocate an XID you'd get this assertion failure.

I've just reviewed everything in xorg/lib, just in case, for bad use of SyncHandle, XAllocID, or XAllocIDs. I found quite a few things to quibble with, but nothing that seems likely to have caused these reports. I've pushed fixes in these modules, so you can check whether anything there seems plausible.

libXcomposite/
libXdamage/
libXext/
libXfixes/
libXi/
libXrender/
libXTrap/

In short, I don't have a good hypothesis yet.

> http://launchpadlibrarian.net/23717282/Stacktrace.txt

I'm ignoring this one: the application was exiting, and a destructor tried to access the Display to free a resource. I'd guess the display was already gone at that point. I've seen that kind of pattern in quite a few Qt stack traces.
Comment 2 Martin Pitt 2010-04-15 05:48:01 UTC
(In reply to comment #1)
> It would be really nice to know where the last call to XIDAlloc was before the
> one that fails, and for that I'd love to have a minimal test case.

Indeed; I'm afraid I can't say more about this myself, but I asked for a reproduction recipe on the downstream bug a few days ago. Once we have a way to reproduce this, we can strip that down to a test case.

> Any chance these applications are using libX11 from more than one thread?

Some certainly do, but amongst the many duplicates are a lot of progams which are single-threaded; for example services-admin (from gnome-system-tools, LP#432122) or gigolo (LP#468544)

> It looks like a fair number of the reported stack traces are
> single-threaded, making that an unlikely explanation for those reports. :-)
> It's possible there are two separate bugs here, I suppose...

Possibly. It seems that about half of the reports are in gnome-panel and firefox, but since they are also much more popular than the apps mentioned above that kind of bias is expected.

> In lieu of a minimal test, perhaps you can tell me which X extensions these
> applications have in common? Perhaps the client library for some extension
> allocates an XID but is missing a SyncHandle call; if the next X request also
> tries to allocate an XID you'd get this assertion failure.

The bugs unfortunately don't have Xorg logs, but I don't think that they will differ that much between different installations. I searched Launchpad for a few Xorg logs for various drivers, and in fact for a given graphics driver the extension lists were exactly identical (I took samples of about 5 per driver):

common for intel, ati, fglrx, nvidia: MIT-SCREEN-SAVER XFree86-VidModeExtension XFree86-DGA DPMS XVideo XVideo-MotionCompensation X-Resource DOUBLE-BUFFER GLX RECORD XFree86-DRI DRI2

extra extensions for fglrx: ATIFGLRXDRI FGLRXEXTENSION GLESX AMDXVOPL (I suppose those are uninteresting)

I asked in the downstram bug for a few Xorg logs from those people who can reproduce the crash. Perhaps it will turn out that they are all using the same graphics driver.
Comment 3 Jamey Sharp 2010-04-15 09:58:31 UTC
(In reply to comment #2)
> > In lieu of a minimal test, perhaps you can tell me which X extensions these
> > applications have in common? Perhaps the client library for some extension
> > allocates an XID but is missing a SyncHandle call; if the next X request also
> > tries to allocate an XID you'd get this assertion failure.
> 
> I asked in the downstram bug for a few Xorg logs from those people who can
> reproduce the crash. Perhaps it will turn out that they are all using the same
> graphics driver.

I don't expect it to be a server-side issue, but then there was a recent report that turned out to be a byte-swapping bug in the server, and I didn't expect it then either... :-)

I meant, which extensions are these applications actually using? I'm hoping to pin the blame on a client-side extension library rather than libX11 itself. ;-) The list of loaded libraries from GDB would be a start. Is there some Launchpad way to download all the attachments from all the public duplicate bugs at once?

I don't have very high hopes though, especially after I went through checking for SyncHandle bugs in every extension library I could find, last week.

Regardless, I hope your effort to get more information from downstream bug reporters works out.
Comment 4 Martin Pitt 2010-04-16 01:58:10 UTC
Just to collect more puzzle pieces, I got three replies by now about my "which driver" question, and these three are all using the proprietary NVIDIA driver. It might be just pure luck with a sample size of three, of course, so I asked whether anyone who got this crash has a different card/driver.

(In reply to comment #3)

> I meant, which extensions are these applications actually using? I'm hoping to
> pin the blame on a client-side extension library rather than libX11 itself. ;-)
> The list of loaded libraries from GDB would be a start.

We routinely collect /proc/<pid>/maps in crash reports, they all have a "ProcMaps.txt" attachment. However, this just says which libraries the program dynamically links against, not which extensions  it's actually using, right?

Is there a better way to see which extensions a program is currently using? Some command that I could ask the bug reporters to run? We could even add that to our standard Apport hooks, so that it's collected for all crash reports.

> Is there some Launchpad way to download all the attachments from all the public duplicate bugs at once?

Bryce Harrington (CC added) has put together a lot of scripts for these purposes (X.org bug triaging). Bryce, do you have a script like that?

Anyway, it's really easy using the Python API, so for this particular purpose I wrote a quick hack which downloads a particular attachment name from a bug and all its duplicates and dumps them to stdout:

$ ./cat-bugattachment.py 507062 ProcMaps.txt 
---- 507062 ---
00110000-001ae000 r-xp 00000000 08:05 526394     /usr/lib/libgdk-x11-2.0.so.0.1902.0
[...]
---- 340441 ---
00400000-00408000 r-xp 00000000 09:01 94936                              /usr/lib/firefox-3.0.7/firefox
[...]

and so on. The script takes the master bug number and an attachment name, I put it on http://people.canonical.com/~pitti/scripts/cat-bugattachment.py for now.

When I do some simple post-processing of the output of above command (which I directed into file "out") I get this:

$ grep lib/libX out | awk '{print $6}' | sort -u
/usr/lib/libX11.so.6.2.0
/usr/lib/libX11.so.6.3.0
/usr/lib/libXau.so.6.0.0
/usr/lib/libXcomposite.so.1.0.0
/usr/lib/libXcursor.so.1.0.2
/usr/lib/libXdamage.so.1.1.0
/usr/lib/libXdmcp.so.6.0.0
/usr/lib/libXext.so.6.4.0
/usr/lib/libXfixes.so.3.1.0
/usr/lib/libXft.so.2.1.13
/usr/lib/libXinerama.so.1.0.0
/usr/lib/libXi.so.6.0.0
/usr/lib/libXi.so.6.1.0
/usr/lib/libXmu.so.6.2.0
/usr/lib/libXrandr.so.2.2.0
/usr/lib/libXrender.so.1.3.0
/usr/lib/libXRes.so.1.0.0
/usr/lib/libXss.so.1.0.0
/usr/lib/libXt.so.6.0.0
/usr/lib/libXtst.so.6.1.0
/usr/lib/libXxf86misc.so.1.1.0
/usr/lib/libXxf86vm.so.1.0.0

Not sure how helpful that actually is, though, since those are still a fair number of extensions.

Thank you!
Comment 5 Martin Pitt 2010-04-16 02:09:29 UTC
(In reply to comment #4)
> Just to collect more puzzle pieces, I got three replies by now about my "which
> driver" question, and these three are all using the proprietary NVIDIA driver.
> It might be just pure luck with a sample size of three, of course, so I asked
> whether anyone who got this crash has a different card/driver.

Got another reply that someone is using Intel, so let's discard the "driver dependent" theory for now.

> $ grep lib/libX out | awk '{print $6}' | sort -u

Ah, BS, sorry. We want the intersection, not the union.

$ sed -n '/^-/ p; /lib\/libX/ { s/^.*usr.lib.//; s/\.so\..*$//; p}' out|uniq|less

By inspection, it seems that those bugs have only a very small set:

---- 528503 ---
libXfixes
libXrender
libXcursor
libXdmcp
libXau
libX11
---- 558407 ---
libXt
libXext
libX11
libXmu
libXau
libXdmcp

So I believe the common set is libXdmcp, libXau, and libX11 itself.
Comment 6 Joris Guisson 2010-06-21 09:13:45 UTC
The same problem has been reported for ktorrent:

https://bugs.kde.org/show_bug.cgi?id=242306
Comment 7 Jamey Sharp 2010-06-21 11:19:06 UTC
(In reply to comment #6)
> The same problem has been reported for ktorrent:
> 
> https://bugs.kde.org/show_bug.cgi?id=242306

In that bug report, you wrote, "Nothing we can do about, the X people have to figure this one out." You could try providing information I've asked for in this bug. Above all, if I have a minimal test application that demonstrates the bug, I'm likely to be able to fix it quickly.

As things stand now, I can't reproduce the problem and I have no reasonable hypotheses about its cause, so it is unlikely that I'll be able to fix it.

One thing I hadn't asked for yet that might help: Run a failing application under `xtrace` and attach the log to this bug.
Comment 8 Pauli 2010-06-21 23:35:57 UTC
Has anyone tried to run any of multi-threaded applications that are crashing in hellgrind? That would give pretty fast conclusion if the bug is about locking problems in those cases.
Comment 9 hules 2010-08-12 03:47:00 UTC
I have a small application that manages to reproduce that crash quite easily on ubuntu 10.04, but only when running under valgrind. But that application itself behaves quite badly as it calls xlib from different threads using the same "Display*" struct for all threads, so I'm not sure if that might be of any use for tracking this issue. However I think it calls XInitThreads() on startup, so that should allow it to share the same display for all threads, isn't it ?

hellgrind outputs loads of stuff like:

==2950== Possible data race during write of size 4 at 0x44bd06c by thread #1
==2950==    at 0x4084764: ??? (in /usr/lib/libX11.so.6.3.0)
==2950==    by 0x4069352: XGetWindowProperty (in /usr/lib/libX11.so.6.3.0)
==2950==    by 0x81E0F46: juce::LinuxComponentPeer::isMinimised() const (juce_linux_Windowing.cpp:902)
==2950==    by 0x8164890: juce::ComponentPeer::handleMovedOrResized() (juce_amalgamated.cpp:71167)
==2950==    by 0x81E296B: juce::LinuxComponentPeer::handleWindowMessage(_XEvent*) (juce_linux_Windowing.cpp:1579)
==2950==    by 0x81E549F: juce::juce_windowMessageReceive(_XEvent*) (juce_linux_Windowing.cpp:2617)
==2950==    by 0x81DE71F: juce::juce_dispatchNextXEvent() (juce_linux_Messaging.cpp:384)
==2950==    by 0x81DE899: juce::juce_dispatchNextMessageOnSystemQueue(bool) (juce_linux_Messaging.cpp:431)
==2950==    by 0x8137170: juce::MessageManager::dispatchNextMessage(bool, bool*) (juce_amalgamated.cpp:34719)
==2950==    by 0x8137204: juce::MessageManager::runDispatchLoop() (juce_amalgamated.cpp:34763)
==2950==    by 0x81348A1: juce::JUCEApplication::main(juce::StringArray&, juce::JUCEApplication*) (juce_amalgamated.cpp:14770)
==2950==    by 0x8134A7B: juce::JUCEApplication::main(int, char**, juce::JUCEApplication*) (juce_amalgamated.cpp:14859)
==2950==  This conflicts with a previous read of size 4 by thread #6
==2950==    at 0x4082F0C: _XSetPrivSyncFunction (in /usr/lib/libX11.so.6.3.0)
==2950==    by 0x408BBEF: _XAllocID (in /usr/lib/libX11.so.6.3.0)
==2950==    by 0x40611A7: XCreatePixmap (in /usr/lib/libX11.so.6.3.0)
==2950==    by 0x50BF8ED: XcursorImageLoadCursor (in /usr/lib/libXcursor.so.1.0.2)
==2950==    by 0x50C0765: XcursorImagesLoadCursors (in /usr/lib/libXcursor.so.1.0.2)
==2950==    by 0x50C0886: XcursorImagesLoadCursor (in /usr/lib/libXcursor.so.1.0.2)
==2950==    by 0x50C4187: XcursorTryShapeCursor (in /usr/lib/libXcursor.so.1.0.2)
==2950==    by 0x4060C5C: XCreateGlyphCursor (in /usr/lib/libX11.so.6.3.0)

before the failed assertion.
Comment 10 Jamey Sharp 2010-08-12 09:10:59 UTC
(In reply to comment #9)
> I have a small application that manages to reproduce that crash quite easily on
> ubuntu 10.04, but only when running under valgrind.

Cool, what's the application? Can I get the source somewhere? I see it's using a toolkit (juce) that I hadn't heard of before though, so I'm not sure it's close to a minimal test case yet...

> hellgrind outputs loads of stuff like:
> 
> ==2950== Possible data race during write of size 4 at 0x44bd06c by thread #1
> ==2950==    at 0x4084764: ??? (in /usr/lib/libX11.so.6.3.0)
> ==2950==    by 0x4069352: XGetWindowProperty (in /usr/lib/libX11.so.6.3.0)
> ...
> ==2950==  This conflicts with a previous read of size 4 by thread #6
> ==2950==    at 0x4082F0C: _XSetPrivSyncFunction (in /usr/lib/libX11.so.6.3.0)
> ==2950==    by 0x408BBEF: _XAllocID (in /usr/lib/libX11.so.6.3.0)
> ==2950==    by 0x40611A7: XCreatePixmap (in /usr/lib/libX11.so.6.3.0)
> ...
> 
> before the failed assertion.

I'm guessing that's the bug we fixed in git commit a6d974dc59f2722b36e2df9d4f07aeee4f83ce43, which is included in libX11 1.3.4 and later. What version of libX11 do you have?

> But that application itself
> behaves quite badly as it calls xlib from different threads using the same
> "Display*" struct for all threads, so I'm not sure if that might be of any use
> for tracking this issue. However I think it calls XInitThreads() on startup, so
> that should allow it to share the same display for all threads, isn't it ?

Yes, that's supposed to be safe. However, I introduced a variety of bugs for multi-threaded apps some time ago, and only got most of them fixed in libX11 1.3.4. You should definitely re-test with that version or later if you don't already have it.
Comment 11 hules 2010-08-13 02:52:51 UTC
I have put the sources for this test here:
http://hules.free.fr/xcbbug.tbz2

JUCE is a small cross-platform toolkit, but this is definitively not a minimal test case as there are ~250000 lines of code.. The test application is not using it properly with respect to multithreading so if you run it in helgrind it will spit a lot of warnings. But when run in valgrind or helgrind, it ends up 75% of time with the _XallocID assert failed.

I have the libx11 of ubuntu 10.04, that is libx11 1.3.2. I'm not sure to proceed to test 1.3.4
Comment 12 Jeremy Huddleston 2011-10-08 20:06:02 UTC
Dupe of #23690?