Bug 98883

Summary: BadAccess errors in ShmAttach due to thread races with XNextRequest() usage in cairo-xlib-surface-shm.c
Product: cairo Reporter: Karl Tomlinson <bugs.freedesktop>
Component: xlib backendAssignee: Chris Wilson <chris>
Status: RESOLVED MOVED QA Contact: cairo-bugs mailing list <cairo-bugs>
Severity: normal    
Priority: medium CC: daniel.van.vugt
Version: 1.12.16   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Handle-XShmAttach-failures-v1.patch

Description Karl Tomlinson 2016-11-28 02:03:49 UTC
_cairo_xlib_display_fini_shm sets pool->attached to XNextRequest() assuming
the approaching XShmAttach() will be the next request.
https://cgit.freedesktop.org/cairo/tree/src/cairo-xlib-surface-shm.c?id=3f1a6f7225e31057a8af9313f051a1d311df0c69#n602

This assumption can be invalid when another request is performed on another
thread before the XShmAttach() reads |request| from the display.

An |attached| sequence number that is too old means that 
_cairo_xlib_shm_pool_cleanup() can call _cairo_xlib_display_shm_pool_destroy()
and so shmdt() before the server processes the ShmAttach request, resulting in
BadAccess errors.

Similarly _cairo_xlib_shm_surface_mark_active() is called and uses
XNextRequest() before the corresponding request, leading to similar races
affecting _cairo_xlib_shm_surface_flush() and get_compositor() and
_cairo_xlib_shm_info_cleanup().  I assume _cairo_xlib_shm_surface_get_obdata()
has similar issues.
Comment 1 Dave Gilbert 2016-11-28 19:40:47 UTC
I wonder whether the fix of setting the pool->attached = XNextRequest(dpy)
after the XShmAttach would work.  That's effectively delaying the shmdt until after whatever the next X request is after the attach has completed; who knows what that is, but as long as there is one (ahem, that might be a bit of an assumption; could we just add a dummy?) then it should complete.  Not got a clue what the perf would be.
Comment 2 Karl Tomlinson 2016-11-28 20:31:20 UTC
pool->attached = XNextRequest(dpy) - 1;
after the XShmAttach would work, yes.
It would then have a sequence number either equal to that of the XShmAttach request or to that of a subsequent request, if there was one.
Comment 3 Uli Schlachter 2016-12-02 17:01:40 UTC
Correct me if I'm wrong, but XLockDisplay() says that there is no other thread that could "steal the next request", right?
Also, feel free to correct me, but cairo should be calling XLockDisplay() before doing "SHM stuff".
Comment 4 Bill Spitzak 2016-12-07 17:09:35 UTC
XLockDisplay only works if XInitThreads was done, and that his highly not recommended as it slows Xlib down enormously (because it was very badly implemented with fine-grained locking).
Comment 5 Karl Tomlinson 2016-12-07 23:52:04 UTC
XLockDisplay() would work, and there is no bug if there is only one thread (i.e. when XInitThreads() is not called), but XLockDisplay() is not necessary.
Comment 6 Daniel van Vugt 2018-05-03 10:27:18 UTC
Created attachment 139297 [details] [review]
Handle-XShmAttach-failures-v1.patch

I found myself amongst this code in the process of trying to fix a crash in the Ubuntu 18.04 graphical installer (https://launchpad.net/bugs/1751252).

Here's a patch to fix that, and hopefully this bug too.
Comment 7 Sebastien Bacher 2018-07-25 19:11:33 UTC
Could a maintainer review the patch there?
Comment 8 Uli Schlachter 2018-07-25 20:23:34 UTC
@Sebastian: I don't know who you count as a maintainer, but I'll take a look...

So... this bug is about _cairo_xlib_shm_pool_create() getting the sequence number for pool->attached wrong, resulting in cairo possibly destroying the shared memory segment to early. The discussion concluded that setting pool->attached = XNextRequest(dpy)-1 after the request is the right fix.
In comment #3 I wondered about XLockDisplay(), but that seems unrelated. I was possibly confused since there are two places calling XShmAttach() and one of them locks the display while the other one does not.

The patch in comment #6 says it is about https://bugs.launchpad.net/ubuntu/+source/ubiquity/+bug/1751252 instead of this bug. https://bugs.launchpad.net/ubuntu/+source/ubiquity/+bug/1751252 must be a different bug since it also happens with GDK_SYNCHRONIZE=1, which means that the race that this bug is about cannot happen.

What the patch does is to make XShmAttach synchronous (aka "slow"). I do not know why, but nothing in this bug report suggests that this is necessary. The description of the patch does not say anything about this either. XShmAttach() should not just randomly fail, so why does the patch try to handle random failures?
Comment 9 Bill Spitzak 2018-07-25 22:22:44 UTC
The XLockDisplay is needed so that an error in another thread does not get reported as a failure of XShmAttach.
Comment 10 Karl Tomlinson 2018-07-26 01:07:43 UTC
XSetErrorHandler() is not appropriate for use with multi-thread code.
Comment 11 Daniel van Vugt 2018-07-26 01:53:30 UTC
It's been a few months so my memory is not completely clear. I do however think Uli's statement is incorrect in the final two sentences:

> What the patch does is to make XShmAttach synchronous (aka "slow"). I do not
> know why, but nothing in this bug report suggests that this is necessary.
> The description of the patch does not say anything about this either.
> XShmAttach() should not just randomly fail, so why does the patch try to
> handle random failures?

attachment 139297 [details] [review] mentions multiple times "why" it is necessary to make it synchronous. Because it's impossible to detect errors in XShmAttach otherwise.

It's possible we're using to the wrong bug here, but I believe the patch is still correct and we have no choice but to make XShmAttach slow. You can only choose between maximum performance or reliable error handling, and not have both. The reasons are detailed in attachment 139297 [details] [review].
Comment 12 Daniel van Vugt 2018-07-26 01:58:51 UTC
So I think reliability should be the first priority.

Also, I don't *think* the affected function would usually be used in any performance-sensitive location so the slowdown required to catch errors is probably never noticeable. And it's less undesirable than the crashes we get from not catching them at all.
Comment 13 Karl Tomlinson 2018-07-26 02:10:08 UTC
(In reply to Daniel van Vugt from comment #11)
> The reasons are detailed in attachment 139297 [details] [review].

I don't see an explanation why XShmAttach is failing.
Comment 14 Daniel van Vugt 2018-07-26 02:12:00 UTC
The patch says:

"XShmAttach returns True, even on error."

and

"XShmAttach is hard-coded in libXext to return True, even on failure."

You can dig through the related source code to verify that.
Comment 15 Karl Tomlinson 2018-07-26 02:16:39 UTC
(In reply to Daniel van Vugt from comment #11)
> attachment 139297 [details] [review] [review] mentions multiple times "why" it is
> necessary to make it synchronous. Because it's impossible to detect errors
> in XShmAttach otherwise.

It is not impossible to detect errors without XSync.

Display::async_handlers can be used.  The result would be async and 
the patch comment claims that _cairo_xlib_shm_pool_create must return
a reliable status synchronously.  I haven't check that claim, but if so, I
wonder whether it would be possible to take a fallback path until the pool is
successfully created.
Comment 16 Daniel van Vugt 2018-07-26 02:20:55 UTC
Great. I would be happy to see alternative solutions but also don't need to be involved at all...

In Ubuntu we are not dependent on this patch because we have instead modified other components to indirectly avoid triggering the failure in Cairo. Effectively we have a permanent workaround already.
Comment 17 Karl Tomlinson 2018-07-26 02:29:41 UTC
(In reply to Daniel van Vugt from comment #14)
> The patch says:
> 
> "XShmAttach returns True, even on error."
> 
> and
> 
> "XShmAttach is hard-coded in libXext to return True, even on failure."

The question was what error or failure, and why is that happening.

XShmAttach returns True to indicate it successfully queued the request.
It does not fail to queue the request.

If the server fails to process the request successfully, then it sends an
error response.

It would be helpful to understand why you are expecting an error response.

The best clue appears to be the patch at the end of
https://code.launchpad.net/~azzar1/ubiquity/+git/ubiquity/+merge/345056
Comment 18 Daniel van Vugt 2018-07-26 02:40:07 UTC
> XShmAttach returns True to indicate it successfully queued the request.
> It does not fail to queue the request.
> 
> If the server fails to process the request successfully, then it sends an
> error response.
> 
> It would be helpful to understand why you are expecting an error response.

Essentially we have the Xorg server process having its privileges demoted and re-promoted. If the XShmAttach request comes in while it is demoted then it can't complete the request and returns BadAccess to the client. The client initiated the request from _cairo_xlib_shm_pool_create but has since (asynchronously) moved on to other code. So it crashes in an unpredictable location when the BadAccess eventually comes in from the server.

I am happy with your explanation that "XShmAttach returns True to indicate it successfully queued the request". However I don't believe the same applies to _cairo_xlib_shm_pool_create, so I think any final fix would have to happen in that function still.
Comment 19 GitLab Migration User 2018-08-25 13:31:43 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/cairo/cairo/issues/49.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.