Bug 21583

Summary: overeager xid reuse
Product: xorg Reporter: Matthias Clasen <mclasen>
Component: Server/GeneralAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: high CC: basic, bgamari, bugs.freedesktop, bugs, bunk, davidsboogs, freedesktop, hugh, jamey, jeremyhu, jmuizelaar, mozilla_bugs, no.tellin, remi, robertojimenoca, slomo, tomi
Version: unspecifiedKeywords: love
Hardware: Other   
OS: All   
Whiteboard: 2012BRB_Reviewed
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 44202    
Attachments:
Description Flags
Firefox backtrace with RenderBadPicture
none
xtrace of a typical crash
none
firefox crash and gdb of corpse none

Description Matthias Clasen 2009-05-05 16:14:42 UTC
http://bugzilla.gnome.org/show_bug.cgi?id=581526 describes a scenario where XCreateWindow appears to reuse an XID while the DestroyNotify for the previous owner of that XID is still sitting in the event queue. This causes GDK to get confused, and things go downhill from there. 

It seems unreasonable to demand that clients peek the queue for pending destroy notifies whenever they want to create a window, in particular since this problem does not occur without the resource-reusing extension.
Comment 1 Karl Tomlinson 2009-07-05 16:50:42 UTC
I can't see a reasonable way for either Xlib or the Xserver to guarantee that
XIDs in client's event queues are unique.

The X server has handed off the DestroyNotify event, so it thinks it has
finished with the event.

Xlib could ensure not to allocate an XID referenced in its own event queue
(for known event types), but it wouldn't know what other clients might have a
reference to a candidate XID sitting in their event queues.

If the server were to keep XIDs of destroyed windows allocated until clients
have processed events on that window, it would need to know when the events in
Xlib's queue have been processed.  I can't see how the Xserver can know this
(without some change in protocol).

The other way of looking at this is that the events are a history of what has
happened and need to be interpreted in the context of when they happened.
Comment 2 David 2009-09-13 15:28:51 UTC
This bug causes serious problems for some of us.  In my case, (after bug 20254 was fixed) this is I believe the cause behind the way most of my firefox sessions terminate (after sometimes producing the disembodied windows mentioned in the first comment at https://bugzilla.gnome.org/show_bug.cgi?id=581526 )

So.  Even if it's not possible to completely prevent an XID from being reused before it's processed, perhaps it could be made so unlikely that it won't happen in reasonable circumstances?  I am thinking of the way process IDs work - each one is higher than the previous one assigned until it hits an integer limit and wraps back to 0, but any unallocated XIDs that old would hopefully not still be in queues.

I tried to take a look at the code but quickly came to the conclusion that this isn't something I personally could just jump into.  So I don't know if it's a feasible suggestion or not - if not perhaps there could be some similar workaround to delay a given ID's reuse until it's simply unlikely to be a problem
Comment 3 Karl Tomlinson 2009-09-13 16:11:40 UTC
Improving the algorithm providing the XID range so that it provided a larger range where possible would make this less likely (though it could still happen less often in reasonable circumstances).

Keeping a buffer of a certain number of recently released XIDs is another possibility.

Or perhaps calculating the range in advance, so that the range used is a range of XIDs that were available (but not advertised) at the time of a previous range request.
Comment 4 D. Hugh Redelmeier 2009-09-14 17:12:32 UTC
Reducing the frequency of the problem would provide relief.  In my (possibly naive) opinion it is the wrong approach: the design flaw needs to be fixed.  Perhaps that requires an API change.
Comment 5 Ben Gamari 2009-09-22 19:19:43 UTC
This seems to be biting me too, to the order of once every 15 minutes (closing a firefox tab has by my estimate a 10% chance of crashing the firefox process). Meanwhile, .xsession-errors is flooded with messages from GDK warning of XID collisions.

I run most of the Xorg stack from git and interestingly enough, this behavior started a few weeks ago. I haven't had a chance to try bisecting yet, but as soon I get a chance I'll drop a note.
Comment 6 Ben Gamari 2009-09-25 14:06:11 UTC
Created attachment 29852 [details]
Firefox backtrace with RenderBadPicture

It seems that Google Maps serves as an excellent reproduction case for the Firefox crash. Opening Google Maps in a tab and closing it will almost always result in a a RenderBadPicture within 3 attempts. Attached is a backtrace from doing just that. Is it possible that this backtrace is caused by aggressive XID reuse?
Comment 7 Karl Tomlinson 2009-09-27 23:25:59 UTC
Comment on attachment 29852 [details]
Firefox backtrace with RenderBadPicture

(In reply to comment #6)
> Is it possible that this backtrace is caused by aggressive XID reuse?

I wouldn't have expected RenderBadPicture from this bug.  If you can get a stack when running Firefox with --sync, then it would be best to file a bug at https://bugzilla.mozilla.org/ under Core -> Widget: Gtk
Comment 8 Søren Sandmann Pedersen 2009-09-28 07:00:23 UTC
The RenderBadPicture may be caused by running cairo master. If you are, try downgrading to 1.8.8.
Comment 9 Ben Gamari 2009-09-28 10:21:12 UTC
(In reply to comment #8)
> The RenderBadPicture may be caused by running cairo master. If you are, try
> downgrading to 1.8.8.
> 

Yep, indeed I am running cairo from master. I just reverted and the usual reproduction cases seem to be stable. This is evidently a known issue? Has a bug been opened for it? Can I do anything to help? Thanks a ton for your comment. I've been passively scratching my head over this for weeks now.
Comment 10 Søren Sandmann Pedersen 2009-09-28 12:37:17 UTC
I don't know if a bug has been filed, but I do know that it has been talked about on the #cairo IRC channel, and that at least Chris Wilson is aware of it.

I'm sure they'd appreciate a bisecting, although that's a bit painful to do because the bug isn't 100% reproducible. 
Comment 11 Ben Gamari 2009-09-28 12:42:33 UTC
(In reply to comment #10)
> I don't know if a bug has been filed, but I do know that it has been talked
> about on the #cairo IRC channel, and that at least Chris Wilson is aware of it.
> 
Yeah, Chris and I talked briefly on #intel-gfx.

> I'm sure they'd appreciate a bisecting, although that's a bit painful to do
> because the bug isn't 100% reproducible. 
> 
I actually tried but it looks like the bug predates 1.8.8. Arg!
Comment 12 Søren Sandmann Pedersen 2009-09-29 09:49:33 UTC
Note that if you install 1.8.8 on top of an 1.9 installation, you'll need to delete the existing libcairo.so, or it won't take effect.
Comment 13 Ben Gamari 2009-09-29 09:56:06 UTC
(In reply to comment #12)
> Note that if you install 1.8.8 on top of an 1.9 installation, you'll need to
> delete the existing libcairo.so, or it won't take effect.
> 

Yep, restarted my Xorg session in between tests which I thought should be sufficient. Moreover, I'm fairly certain the newly installed libraries did take effect after the restart as a scaling bug seen in firefox in 1.8.8 reared its head again. So anyways, I'm fairly confident that I did in fact establish that the bug predates 1.8.8, although it strikes me as odd that it's not seen by more people.
Comment 14 Chris Wilson 2009-09-29 10:01:47 UTC
My analysis into this bug indicates that the RenderBadPicture results from a delayed cairo_surface_destroy() after firefox has called XDestroyWindow() on the *parent* Window. In this situation firefox should be calling cairo_surface_finish(), or cairo_surface_destroy() and disposing of the cairo_surface_t, on the destroyed hierarchy.

So the RenderBadPicture is a separate bug (and not ours! ;-) from the XID reuse.
Comment 15 Søren Sandmann Pedersen 2009-10-04 13:44:05 UTC
Well, I haven't looked into this bug, but for me, it is definitely the case that it happens with cairo master and not with 1.8.8.

Comment 16 Chris Wilson 2009-10-08 12:25:00 UTC
Created attachment 30180 [details]
xtrace of a typical crash

Note that cairo calls RenderFreePicture (4ebda) immediately upon the cairo_surface_finish() [which presumably is actually trigged by the final cairo_surface_destroy() and is not being manually called], but the drawable was destroyed much earlier (the DestroyNotify arrives at 47608) and note that the drawable is never explicitly destroyed but is reaped along with its parent (475f7).

The full trace is available at http://people.freedesktop.org/~ickle/ff.crash.log
Comment 17 Roberto Jimeno 2009-10-17 06:57:36 UTC
I saw a way to reproduce this bug in Firefox at:
https://bugzilla.mozilla.org/show_bug.cgi?id=522635
I can confirm it gets reliably triggered with cairo 1.9.4 but not with cairo 1.8.8
Comment 18 D. Hugh Redelmeier 2009-11-21 22:36:25 UTC
Created attachment 31379 [details]
firefox crash and gdb of corpse
Comment 19 D. Hugh Redelmeier 2009-11-21 22:37:45 UTC
I still get crashes from FireFox every few days.
Before each crash, I see one or more messages like this:
 (firefox:5290): Gdk-WARNING **: XID collision, trouble ahead

The actual crash is usually a SEGV.  I think that it is a null pointer dereference but I cannot be sure because GDB is unreliable with optimized code.  (I have an example where gdb prints 0 for a pointer variable but when I look at the assembly code I see that that variable is not represented at that point in the code.)

I don't think my problem has anything to do with cairo because I don't find RenderBadPicture in any of the tracebacks.  Am I being naive?  Should I look for something else?  I'm using an up-to-date Fedora 11 on x86-64; cairo-1.8.8-1.fc11.x86_64; no flash plugin.

I'm attaching a very long typescript of a firefox session that failed and a gdb of the resulting core file.  Perhaps someone could tell if

I think that the Cairo problems are a different bug and should have a different bugzilla entry.

The original posting in this bugzilla entry describes a bug that I still think is real.  I imagine that this is the bug that is afflicting me.

I'm attaching a very long typescript of a firefox session that failed and a gdb of the resulting core file.  Perhaps someone could tell from this if what I've said in this comment is wrong.
Comment 20 No Tellin 2009-12-01 14:48:24 UTC
You may want to view the following video here:
http://www.youtube.com/watch?v=fwIwZazMTgM

I created this video to clearly demonstrate at least one trigger for the XID
Collision message. I believe there are at least two triggers and that both
triggers are adobe flash 10 related.

You can see from the video that you should have re-createable real life test
cases for this problem.

I run a Gentoo installation. 

For those familiar with Gentoo, at the end of the video, I run: 

emerge -epv mozilla-firefox | less
emerge --info

I have saved the output of these to text files if anyone is interested. Just
contact me.

The reason is that the emerge -epv mozilla-firefox command will display every
package and depencies required for mozilla-firefox. For the record, prior to
creating the video, I actually did re-compile every package in this list
(emerge -e mozilla-firefox) in order to ensure a clean run.

In the video, the left part of the screen is a konsole terminal window. The
right part of the screen is firefox. I start firefox with the command "firefox
-sync' in the terminal window.

I have FF set up to start with a number of tabs. As I change focus from tab to
tab, watch the terminal window. There are two tabs where changing focus causes
XID Collision messages to appear. It is particularly obvious that the error
messages are generated during flash activity. Note especially the generation of
messages as the flash window controls autohide and then re-appear. It's not
clear to me in the second tab (The Daily Show) what kind of flash control is
causing the messages. However, that site never seems to stop loading flash
objects. Or rather, my patience runs out before the flash downloads can
complete.

My reading of other people's problems suggest that x86 (i386) based systems
don't have this problem but please regard this as an unconfirmed data point.

In this thread in the Gentoo forums, I am 'dufeu':
http://forums.gentoo.org/viewtopic-t-788609-highlight-.html

The video best viewed in HD on a screen 1384x768 or larger. (full screen mode)

Thank you all for your time and patience!

BTW - I did understand the discussion of asynchonous ID assignment and release. However, while the problem seems to be properly identified, I'm not sure that the exact trigger for invoking the problem has been properly identified. I hope the video will be helpful. Unless I (as and end-user) have completely misunderstood what I see, it's seems clear that the actual trigger is probably flash 10.

Displaimer: I am only and end user. I am not a programmer.
Comment 21 Jeremy Huddleston Sequoia 2011-10-08 19:45:53 UTC
This seems more like a server issue.  I think it could easily be possible for 
the server to guarantee that XIDs are not reused within a certain time period 
since it issued a DestroyNotify.  That won't guarantee that clients are happy, 
but it can certainly help.  We just need to store a timestamp of the time the 
XID was destroyed and if the head of the recycle queue is too recent, we 
allocate a new XID rather than recycling.

Tracking for 1.12, but I'd consider this for 1.11.x if the change is simple 
enough.
Comment 22 GitLab Migration User 2018-12-13 22:21:13 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/xserver/issues/380.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.