Bug 91967

Summary:	Assertion "(_cairo_atomic_int_get (&(&surface->ref_count)->ref_count) > 0)"
Product:	cairo	Reporter:	Alberts Muktupāvels <alberts.muktupavels>
Component:	xlib backend	Assignee:	Uli Schlachter <psychon>
Status:	RESOLVED FIXED	QA Contact:	cairo-bugs mailing list <cairo-bugs>
Severity:	major
Priority:	medium	CC:	chat-to-me, chris, jskarvad, monsta
Version:	unspecified
Hardware:	Other
OS:	All
See Also:	https://launchpad.net/bugs/1571250 https://bugzilla.redhat.com/show_bug.cgi?id=1331021
Whiteboard:
i915 platform:		i915 features:
Attachments:	cairo-xlib-surface: set image to NULL after destroying it stacktrace of mate applet crash Reproducer Second file of reproducer Proposed fix Proposed fix Proposed fix

Description Alberts Muktupāvels 2015-09-10 21:47:31 UTC

In _get_image_surface function I get assertion:
Assertion "(_cairo_atomic_int_get (&(&surface->ref_count)->ref_count) > 0)" failed.

1) image is created in "if (try_shm && pixman_format)", line 785
2) then image->base is destroyed in same if, line 809
3) then cairo enters "if (ximage == NULL)", line 836
4) then it fails to get ximage and enters "if (ximage == NULL)", line 879
5) that moves to BAIL where cairo_surface_destroy is called on destroyed? surface.

Application crash. I tried to add "if (cairo_surface_get_reference_count (&image->base) != )" and it "fixed" assertion/crash. But I guess that is not right way to fix it.

Comment 1 Uli Schlachter 2015-09-11 15:34:59 UTC

Do you happen to have a test case that reproduces this problem or are you just staring at the code?

It looks like this just should do `image = NULL;` after it called cairo_surface_destroy() (line 809), no?

And for everyone who would like to see a file name: cairo-xlib-surface.c :-)

Comment 2 Alberts Muktupāvels 2015-09-11 15:50:13 UTC

(In reply to Uli Schlachter from comment #1)
> Do you happen to have a test case that reproduces this problem or are you
> just staring at the code?

I got this in gnome-panel - out-process applets crashes with that assertion when panel size is changing very fast.

> It looks like this just should do `image = NULL;` after it called
> cairo_surface_destroy() (line 809), no?

Nice. :) This fixes problem. Now someone just needs to push this simple fix.

> And for everyone who would like to see a file name: cairo-xlib-surface.c :-)

Comment 3 Chris Wilson 2015-09-11 16:00:27 UTC

You may have fixed a crash, but for _get_image_surface() to fail is nigh on catastrophic (i.e. instead of a crash we now lose rendering). About the only valid error I can think of is BadDrawable. Any idea how you hit this failure path?

Comment 4 Alberts Muktupāvels 2015-09-11 16:17:31 UTC

(In reply to Chris Wilson from comment #3)
> You may have fixed a crash, but for _get_image_surface() to fail is nigh on
> catastrophic (i.e. instead of a crash we now lose rendering).

Did not understand... Is it wrong to set image to NULL when image->base is destroyed? It is recreated later. Anyway I have no idea how cairo works. That is why there was no patch from me.

> About the only valid error I can think of is BadDrawable. Any idea how you
> hit this failure path?

No idea, but cairo should not try to free/destroy image->base twice.

Comment 5 Alberts Muktupāvels 2015-09-17 12:35:13 UTC

Created attachment 118325 [details] [review]
cairo-xlib-surface: set image to NULL after destroying it

Please consider applying this simple patch.

Trying to destroy surface for second time results in assertion which ends with crash. And that is serious problem.

Image at that point is already destroyed and not valid so setting it to NULL is safe. That will prevent from trying to destroy surface for second time and more importantly I can handle this error now in code: I can use cairo_surface_status and/or cairo_status to detect this failure.

Comment 6 Bryce Harrington 2015-10-15 20:12:43 UTC

A couple things...

First, if we set image to NULL and it hits BAIL, then doesn't it just crash unreferencing it here?

    if (unlikely (status)) {
        cairo_surface_destroy (&image->base);
        return _cairo_surface_create_in_error (status);
    }


If image can be NULL, then there are a few places in the function where &image->base is used that would need NULL ptr checks else they might segfault.

As well, I would think the image object would need to be deliberately destroyed, to avoid memory leak.  I.e.:

...
        cairo_surface_destroy (&image->base);
        cairo_surface_destroy (image);
        image = NULL;
    }
...

However, backing up a bit... if XShmGetImage() has failed, such that we're destroying &image->base, is any of the rest of the function still valid?  I notice from your trace that it's just hitting another error and bouncing down to BAIL, where the crash occurs.  It looks like most of the BAIL logic doesn't apply here since the objects its freeing hadn't been created.  Which makes me wonder if when this error situation is hit, if it'd be better to just exit the function?  Then we can avoid worrying about NULL image pointers and what's happening in BAIL.

...
        cairo_surface_destroy (&image->base);
        cairo_surface_destroy (image);
        return _cairo_surface_create_in_error (CAIRO_STATUS_INVALID_VISUAL);
    }
...

Maybe the error code there should be CAIRO_STATUS_NO_MEMORY instead.

Comment 7 Alberts Muktupāvels 2015-10-16 00:17:11 UTC

(In reply to Bryce Harrington from comment #6)
> A couple things...
> 
> First, if we set image to NULL and it hits BAIL, then doesn't it just crash
> unreferencing it here?
> 
>     if (unlikely (status)) {
>         cairo_surface_destroy (&image->base);
>         return _cairo_surface_create_in_error (status);
>     }

No. Without NULL it crash here because it tries to destroy image->base for second time...

> If image can be NULL, then there are a few places in the function where
> &image->base is used that would need NULL ptr checks else they might
> segfault.

If I know how to read code then this is not going to happen. When image->base is used again it is first re-created. Line 896 and 969. I fail to see how this could happen.

> As well, I would think the image object would need to be deliberately
> destroyed, to avoid memory leak.  I.e.:
> 
> ...
>         cairo_surface_destroy (&image->base);
>         cairo_surface_destroy (image);
>         image = NULL;
>     }
> ...

Since I have no idea how cairo works I decided to test this. This just causes another assert with ref_count > 0 failed. So this is wrong. Setting it to NULL is enough.

> However, backing up a bit... if XShmGetImage() has failed, such that we're
> destroying &image->base, is any of the rest of the function still valid?  I

Yes function is still valid.

> notice from your trace that it's just hitting another error and bouncing
> down to BAIL, where the crash occurs.  It looks like most of the BAIL logic
> doesn't apply here since the objects its freeing hadn't been created. Which

Wrong. It does not try to free object that hadn't been created. It tries to free object that was created and then freed.

> makes me wonder if when this error situation is hit, if it'd be better to
> just exit the function?  Then we can avoid worrying about NULL image
> pointers and what's happening in BAIL.
> 
> ...
>         cairo_surface_destroy (&image->base);
>         cairo_surface_destroy (image);
>         return _cairo_surface_create_in_error (CAIRO_STATUS_INVALID_VISUAL);
>     }
> ...
> 
> Maybe the error code there should be CAIRO_STATUS_NO_MEMORY instead.

Again, here is what happens:
1) 'if (try_shm && pixman_format)' block creates image, but XShmGetImage fails so we destroy it. Now image points to invalid / freed object.
2) It ties to get/create ximage, but it fails too so we end up in 'if (ximage == NULL)' block (line 880), it sets status to NO_MEMORY and goto BAIL. (If ximage != NULL then image will be re-created, bet that is not my case.)
3) Now we at line 1008 (BAIL). We have set error so we enter 'if (unlikely (status))'. And what we try to do here? We try to destroy already destroyed image->base. And thats is my problem.

Since your suggestion to use cairo_surface_destroy (image); ends up with other ref_count assert (basically double free) I assume that there is no memory leaks and it is properly freed. So ONLY problem is that image points to destroyed object.

Reading code I see one more 'goto BAIL' (line 967) that could lead to same problem/error/crash. In all other cases image is recreated.

Do you need more info and/or want that I test any other possible solutions? I think now it should be clear what happens and that setting image to NULL is safe (at least I fail to see how that could cause problem).

Comment 8 Bryce Harrington 2015-10-27 02:00:18 UTC

You've not really answered my questions.  Just stating I'm wrong doesn't clarify anything; I'm married, so I already know I'm always wrong.  ;-)

Seriously though, I think either you've misunderstood me or are not explaining things properly, because I still think your proposed change isn't correctly implemented for Cairo.

Comment 9 Alberts Muktupāvels 2015-11-02 14:26:02 UTC

(In reply to Bryce Harrington from comment #8)
> You've not really answered my questions.  Just stating I'm wrong doesn't
> clarify anything; I'm married, so I already know I'm always wrong.  ;-)

Which questions?

1. "First, if we set image to NULL and it hits BAIL, then doesn't it just crash unreferencing it here?"

No, it does not crash.

2. "However, backing up a bit... if XShmGetImage() has failed, such that we're destroying &image->base, is any of the rest of the function still valid?"

Yes, it is still valid... because it will try to get image differently.

3. "Which makes me wonder if when this error situation is hit, if it'd be better to just exit the function?"

No, because then we loose another chance of getting image in different way.

> Seriously though, I think either you've misunderstood me or are not
> explaining things properly, because I still think your proposed change isn't
> correctly implemented for Cairo.

Yeah, sounds like I don't understand you or you don't understand me, or maybe both... :(

Not correctly in what way? Then please attach patch that is supposed to be correct and I will test it. I can reproduce this, also I am not only one who have hit this bug/problem...

Bug report for same problem:
https://github.com/mate-desktop/mate-applets/issues/137

Comment 10 Wolfgang Ulbrich 2016-01-04 12:30:01 UTC

Hi,
i'm from Mate desktop upstream and our mate-panel is also affected by this bug.
See, https://github.com/mate-desktop/mate-panel/issues/369
The attached patch from reporter fixes the issue.
So, can you please merge them, or find a better solution to fix the issue?
Our report is one major blocker for the transition of Mate desktop to gtk3.
Thank you

Comment 11 Wolfgang Ulbrich 2016-03-23 21:04:53 UTC

Is there any chance to get your attention after 3 months?
Or is cairo unmaintained nowadays?
This is a serious showstopper for us and we need a solution.
Please help us to find a fix.

Thank you

Comment 12 Dmitry Shachnev 2016-04-24 11:33:24 UTC

I have got a bug report in Ubuntu about the same issue, and the attached patch by Alberts seems to solve it.

Dear Bryce, can you please apply it? It is just one line and fixes a crash that is very annoying for users…

Comment 13 Wolfgang Ulbrich 2016-04-27 17:12:40 UTC

emerald (gtk3) is also affected by this bug.
See rhbz report.

Comment 14 Wolfgang Ulbrich 2016-05-07 08:42:29 UTC

Ok, i will try to explain what did crash in MATE desktop (gtk3 version).
Maybe this helps to get your attention.
Steps to reproduce:
1. Installing Mate desktop gtk3 version from repos for fedora or archlinux.
https://copr.fedorainfracloud.org/coprs/raveit65/Mate-GTK3/
Don't update cairo here because this repo use a patched cairo version to avoid this annoying issue.

2. Load random applets into mate-panel.

3. enable compositor in marco (our WM) if it isn't already enabled.

4. Disable compositor via GUI or gsettings.
gsettings set org.mate.Marco.general compositing-manager false

5 Now you will see that several panel applets are crashing and the panel tries to reload them.

This isn't a randomly failure and is reproducible all the time.

Crash in stacktrace from our clock-applet (full version i will attach):
Thread 1 (Thread 0x7f9fff7ae980 (LWP 27490)):
#0  0x00007f9ffbd629c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
        resultvar = 0
        pid = 27490
        selftid = 27490
#1  0x00007f9ffbd6465a in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x4, sa_sigaction = 0x4}, sa_mask = {__val = {140731922981920, 18793946, 140731922982000, 47270444160, 0, 0, 0, 21474836480, 140325103031070, 140731922982152, 140325162516673, 22530432, 140325162516673, 24234832, 140324717641728, 140325121230504}}, sa_flags = 953, sa_restorer = 0x7f9ffd010af0 <__PRETTY_FUNCTION__.11261>}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f9ffbd5b187 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7f9ffd0106a8 "((*&(&surface->ref_count)->ref_count) > 0)", file=file@entry=0x7f9ffd0105a0 "cairo-surface.c", line=line@entry=953, function=function@entry=0x7f9ffd010af0 <__PRETTY_FUNCTION__.11261> "cairo_surface_destroy") at assert.c:92
        str = 0x171bed0 " \b\212\001"
        total = 4096
> cut

With the patch from reporter this does not happen and we can can happily switch the compositor off/on.
I use the fix since several month for myself and for the Mat gtk3 repo, and i didn't saw any collateral damage.

Dear cairo devs,
hope this is enough info and please take a look at it again.

Comment 15 Wolfgang Ulbrich 2016-05-07 08:45:22 UTC

Created attachment 123533 [details]
stacktrace of mate applet crash

Comment 16 Jaroslav Škarvada 2016-05-24 08:25:28 UTC

This problem is easy to hit with libwnck3 (because libwnck3 uses cairo, but AFAIK libwnck2 didn't). It is reproducible if the application is quickly changing icon. Then there is a race condition when cairo calls XShmGetImage in cairo-xlib-surface.c:797 but the icon pixmap it is trying to get may not exist in this time. So the XShmGetImage returns error invalid pixmap, then the &image->base is destroyed on line 809. So far so good, but the &image->base is then destroyed again on line 1014 which triggers the assert in cairo_surface_destroy, because the reference count is 0 (so it would cause double free). And the application linking with libwnck3 core dumps, here is example backtrace:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f67f0e00a98 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
[Current thread is 1 (Thread 0x7f67f485aa00 (LWP 9784))]
(gdb) bt
#0  0x00007f67f0e00a98 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f67f0e0269a in __GI_abort () at abort.c:89
#2  0x00007f67f0df9227 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7f67f19997a8 "((*&(&surface->ref_count)->ref_count) > 0)", file=file@entry=0x7f67f19996a0 "cairo-surface.c", line=line@entry=953, function=function@entry=0x7f67f1999bf0 <__PRETTY_FUNCTION__.11260> "cairo_surface_destroy") at assert.c:92
#3  0x00007f67f0df92d2 in __GI___assert_fail (assertion=assertion@entry=0x7f67f19997a8 "((*&(&surface->ref_count)->ref_count) > 0)", file=file@entry=0x7f67f19996a0 "cairo-surface.c", line=line@entry=953, function=function@entry=0x7f67f1999bf0 <__PRETTY_FUNCTION__.11260> "cairo_surface_destroy") at assert.c:101
#4  0x00007f67f191ee12 in INT_cairo_surface_destroy (surface=surface@entry=0x5618d0ca78f0) at cairo-surface.c:953
#5  0x00007f67f194e000 in _get_image_surface (surface=surface@entry=0x5618d0c7cff0, extents=extents@entry=0x7ffff2922d40, try_shm=try_shm@entry=1) at cairo-xlib-surface.c:1014
#6  0x00007f67f194ec73 in _cairo_xlib_surface_acquire_source_image (abstract_surface=0x5618d0c7cff0, image_out=0x7ffff2922e00, image_extra=<optimized out>) at cairo-xlib-surface.c:1403
#7  0x00007f67f191f6d4 in _cairo_surface_acquire_source_image (surface=0x5618d0c7cff0, image_out=<optimized out>, image_extra=<optimized out>) at cairo-surface.c:1973
#8  0x00007f67f18e7e52 in _pixman_image_for_pattern (iy=0x7ffff2922fd0, ix=0x7ffff2922fc0, sample=0x7ffff2922fd0, extents=0x7ffff292372c, is_mask=-225298644, pattern=0x7ffff2923770, dst=0xf568971204090700) at cairo-image-source.c:1377
#9  0x00007f67f18e7e52 in _pixman_image_for_pattern (dst=dst@entry=0x5618d0ca7760, pattern=pattern@entry=0x7ffff2923770, is_mask=is_mask@entry=0, extents=extents@entry=0x7ffff292372c, sample=sample@entry=0x7ffff2923750, tx=tx@entry=0x7ffff2922fc0, ty=0x7ffff2922fd0) at cairo-image-source.c:1538
#10 0x00007f67f18e893e in _cairo_image_source_create_for_pattern (dst=0x5618d0ca7760, pattern=0x7ffff2923770, is_mask=0, extents=0x7ffff292372c, sample=0x7ffff2923750, src_x=0x7ffff2922fc0, src_y=0x7ffff2922fd0) at cairo-image-source.c:1583
#11 0x00007f67f191c151 in clip_and_composite_boxes (boxes=0x7ffff2923460, extents=0x7ffff29236f0, compositor=0x7f67f1bd6b60 <spans>) at cairo-spans-compositor.c:678
#12 0x00007f67f191c151 in clip_and_composite_boxes (compositor=compositor@entry=0x7f67f1bd6b60 <spans>, extents=extents@entry=0x7ffff29236f0, boxes=boxes@entry=0x7ffff2923460)
    at cairo-spans-compositor.c:882
#13 0x00007f67f191c75e in clip_and_composite_boxes (compositor=0x7f67f1bd6b60 <spans>, extents=0x7ffff29236f0, boxes=0x7ffff2923460) at cairo-spans-compositor.c:901
#14 0x00007f67f191ca79 in _cairo_spans_compositor_mask (_compositor=0x7f67f1bd6b60 <spans>, extents=0x7ffff29236f0) at cairo-spans-compositor.c:999
#15 0x00007f67f18d7429 in _cairo_compositor_paint (compositor=0x7f67f1bd6b60 <spans>, surface=0x5618d0ca7760, op=<optimized out>, source=<optimized out>, clip=<optimized out>)
    at cairo-compositor.c:65
#16 0x00007f67f191f8b1 in _cairo_surface_paint (surface=0x5618d0ca7760, op=CAIRO_OPERATOR_OVER, source=0x7ffff2923a30, clip=0x0) at cairo-surface.c:2117
#17 0x00007f67f18df285 in _cairo_gstate_paint (gstate=0x5618d0a72e30) at cairo-gstate.c:1067
#18 0x00007f67f18d1ea5 in INT_cairo_paint (cr=<optimized out>) at cairo.c:2003
#19 0x00007f67f4308ad0 in try_pixmap_and_mask (screen=screen@entry=0x5618d08ba8b0, src_pixmap=src_pixmap@entry=48251092, src_mask=src_mask@entry=48251093, iconp=iconp@entry=0x7ffff2923cd8, ideal_width=ideal_width@entry=32, ideal_height=ideal_height@entry=32, mini_iconp=0x7ffff2923ce0, ideal_mini_width=16, ideal_mini_height=16) at xutils.c:1832
#20 0x00007f67f4309fa4 in _wnck_read_icons (ideal_mini_height=16, ideal_mini_width=16, mini_iconp=0x7ffff2923ce0, ideal_height=32, ideal_width=32, iconp=0x7ffff2923cd8, src_mask=48251093, src_pixmap=48251092, screen=0x5618d08ba8b0) at xutils.c:2228
#21 0x00007f67f4309fa4 in _wnck_read_icons (screen=0x5618d08ba8b0, xwindow=xwindow@entry=48235980, icon_cache=icon_cache@entry=0x5618d0a90340, iconp=iconp@entry=0x7ffff2923cd8, ideal_width=ideal_width@entry=32, ideal_height=ideal_height@entry=32, mini_iconp=0x7ffff2923ce0, ideal_mini_width=16, ideal_mini_height=16) at xutils.c:2232
#22 0x00007f67f42ff90f in get_icons (window=window@entry=0x5618d0b18100 [WnckWindow]) at window.c:2109
#23 0x00007f67f43004af in force_update_now (window=0x5618d0b18100 [WnckWindow]) at window.c:3273
#24 0x00007f67f430175a in update_idle (data=0x5618d0b18100) at window.c:3301
#25 0x00007f67f1c22e3a in g_main_context_dispatch (context=0x5618d08bf460) at gmain.c:3154
#26 0x00007f67f1c22e3a in g_main_context_dispatch (context=context@entry=0x5618d08bf460) at gmain.c:3769
#27 0x00007f67f1c231d0 in g_main_context_iterate (context=0x5618d08bf460, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3840
#28 0x00007f67f1c234f2 in g_main_loop_run (loop=0x5618d0a896d0) at gmain.c:4034
#29 0x00007f67f3bc4325 in gtk_main () at gtkmain.c:1241
#30 0x00005618cfe959ef in main (argc=2, argv=0x7ffff2923ff8) at main.c:6027

Comment 17 Jaroslav Škarvada 2016-05-24 08:27:18 UTC

Created attachment 124014 [details]
Reproducer

Reproducing the problem is easy, attaching reproducer:

$ ./test3.py &
$ ./test4.py

And the race sooner (usually very quickly) or later triggers.

Comment 18 Jaroslav Škarvada 2016-05-24 08:27:45 UTC

Created attachment 124015 [details]
Second file of reproducer

Comment 19 Jaroslav Škarvada 2016-05-24 08:32:44 UTC

Adding image = NULL fixes the problem, because the second free doesn't happen, i.e. the following check in cairo_surface.c:949 returns

if (surface == NULL ||
  CAIRO_REFERENCE_COUNT_IS_INVALID (&surface->ref_count))
return

AFAICS the &image->base is pointer to the same memory as image, it's just different pointer type. Maybe there is a better fix, e.g. to just BAIL or return some error, but this problem needs definitely to be fixed. Just ignoring it will not help anyone.

Comment 20 Alberts Muktupāvels 2016-05-24 12:14:11 UTC

(In reply to Jaroslav Škarvada from comment #19)
> AFAICS the &image->base is pointer to the same memory as image, it's just
> different pointer type. Maybe there is a better fix, e.g. to just BAIL or
> return some error, but this problem needs definitely to be fixed. Just
> ignoring it will not help anyone.

I think that BAIL-ing out is not solution...

Looking at code it looks like it was intention to try with shm first and if that fails try with other methods. BAIL-ing out we will lose chance to get image surface with other methods.

Basically this is very simple bug - double free with very simple fix. Surface was destroyed, pointer now is invalid. Setting it to NULL makes sense.

Comment 21 Jaroslav Škarvada 2016-05-24 12:19:53 UTC

(In reply to Alberts Muktupāvels from comment #20)
> (In reply to Jaroslav Škarvada from comment #19)
> > AFAICS the &image->base is pointer to the same memory as image, it's just
> > different pointer type. Maybe there is a better fix, e.g. to just BAIL or
> > return some error, but this problem needs definitely to be fixed. Just
> > ignoring it will not help anyone.
> 
> I think that BAIL-ing out is not solution...
> 
> Looking at code it looks like it was intention to try with shm first and if
> that fails try with other methods. BAIL-ing out we will lose chance to get
> image surface with other methods.
> 
> Basically this is very simple bug - double free with very simple fix.
> Surface was destroyed, pointer now is invalid. Setting it to NULL makes
> sense.

In this case no other method will succeed, because the pixmap doesn't exist.

Comment 22 Alberts Muktupāvels 2016-05-24 12:23:03 UTC

(In reply to Jaroslav Škarvada from comment #21)
> (In reply to Alberts Muktupāvels from comment #20)
> > (In reply to Jaroslav Škarvada from comment #19)
> > > AFAICS the &image->base is pointer to the same memory as image, it's just
> > > different pointer type. Maybe there is a better fix, e.g. to just BAIL or
> > > return some error, but this problem needs definitely to be fixed. Just
> > > ignoring it will not help anyone.
> > 
> > I think that BAIL-ing out is not solution...
> > 
> > Looking at code it looks like it was intention to try with shm first and if
> > that fails try with other methods. BAIL-ing out we will lose chance to get
> > image surface with other methods.
> > 
> > Basically this is very simple bug - double free with very simple fix.
> > Surface was destroyed, pointer now is invalid. Setting it to NULL makes
> > sense.
> 
> In this case no other method will succeed, because the pixmap doesn't exist.

Is this only case when XShmGetImage can fail?

Comment 23 Jaroslav Škarvada 2016-05-24 12:34:02 UTC

(In reply to Alberts Muktupāvels from comment #22)
> (In reply to Jaroslav Škarvada from comment #21)
> > (In reply to Alberts Muktupāvels from comment #20)
> > > (In reply to Jaroslav Škarvada from comment #19)
> > > > AFAICS the &image->base is pointer to the same memory as image, it's just
> > > > different pointer type. Maybe there is a better fix, e.g. to just BAIL or
> > > > return some error, but this problem needs definitely to be fixed. Just
> > > > ignoring it will not help anyone.
> > > 
> > > I think that BAIL-ing out is not solution...
> > > 
> > > Looking at code it looks like it was intention to try with shm first and if
> > > that fails try with other methods. BAIL-ing out we will lose chance to get
> > > image surface with other methods.
> > > 
> > > Basically this is very simple bug - double free with very simple fix.
> > > Surface was destroyed, pointer now is invalid. Setting it to NULL makes
> > > sense.
> > 
> > In this case no other method will succeed, because the pixmap doesn't exist.
> 
> Is this only case when XShmGetImage can fail?

I guess it can also fail if there is no MIT-SHM extension and maybe in other cases. These are cases not causing the double free, because some other method probably steps in and the pixmap is valid in such cases.

But I think the proposed fix is dirty. It relies on the safety check inside the cairo_surface_destroy. Cleanly written code shouldn't do this. The control flow should never get into the cairo_surface_destroy for the second time, that's why I wrote "maybe there is a better fix". But this is definitely question for the upstream maintainers.

Comment 24 Alberts Muktupāvels 2016-05-24 12:49:07 UTC

(In reply to Jaroslav Škarvada from comment #23)
> But I think the proposed fix is dirty. It relies on the safety check inside
> the cairo_surface_destroy. Cleanly written code shouldn't do this. The
> control flow should never get into the cairo_surface_destroy for the second
> time, that's why I wrote "maybe there is a better fix". But this is
> definitely question for the upstream maintainers.

I don't want to agree on this.

Check what could happen if it is called this way:
_get_image_surface (..., ..., FALSE);

I am too lazy to count, but there are definitely multiple paths that could end up calling cairo_surface_destroy (&image->base); when image is still NULL.

Think about this way - setting it to NULL after destroying is same as if that function would have been called with try_shm = FALSE.

Comment 25 Jaroslav Škarvada 2016-05-24 12:58:09 UTC

(In reply to Alberts Muktupāvels from comment #24)
> (In reply to Jaroslav Škarvada from comment #23)
> > But I think the proposed fix is dirty. It relies on the safety check inside
> > the cairo_surface_destroy. Cleanly written code shouldn't do this. The
> > control flow should never get into the cairo_surface_destroy for the second
> > time, that's why I wrote "maybe there is a better fix". But this is
> > definitely question for the upstream maintainers.
> 
> I don't want to agree on this.
> 
> Check what could happen if it is called this way:
> _get_image_surface (..., ..., FALSE);
> 
> I am too lazy to count, but there are definitely multiple paths that could
> end up calling cairo_surface_destroy (&image->base); when image is still
> NULL.
> 
> Think about this way - setting it to NULL after destroying is same as if
> that function would have been called with try_shm = FALSE.

See it this way:

- what about setting the surface NULL in the cairo_surface_destroy after destroying it? It will fix all these issues, but it is apparently not the right way how to fix such bugs.

I think that all the wrong paths leading to the second call of the cairo_surface_destroy should be fixed/cleaned. But I am not upstream, so it's irrelevant what I am thinking about it.

Comment 26 Alberts Muktupāvels 2016-05-24 13:22:38 UTC

(In reply to Jaroslav Škarvada from comment #25)
> (In reply to Alberts Muktupāvels from comment #24)
> > (In reply to Jaroslav Škarvada from comment #23)
> > > But I think the proposed fix is dirty. It relies on the safety check inside
> > > the cairo_surface_destroy. Cleanly written code shouldn't do this. The
> > > control flow should never get into the cairo_surface_destroy for the second
> > > time, that's why I wrote "maybe there is a better fix". But this is
> > > definitely question for the upstream maintainers.
> > 
> > I don't want to agree on this.
> > 
> > Check what could happen if it is called this way:
> > _get_image_surface (..., ..., FALSE);
> > 
> > I am too lazy to count, but there are definitely multiple paths that could
> > end up calling cairo_surface_destroy (&image->base); when image is still
> > NULL.
> > 
> > Think about this way - setting it to NULL after destroying is same as if
> > that function would have been called with try_shm = FALSE.
> 
> See it this way:
> 
> - what about setting the surface NULL in the cairo_surface_destroy after
> destroying it? It will fix all these issues, but it is apparently not the
> right way how to fix such bugs.
> 
> I think that all the wrong paths leading to the second call of the
> cairo_surface_destroy should be fixed/cleaned. But I am not upstream, so
> it's irrelevant what I am thinking about it.

When try_shm == FALSE then cairo_surface_destroy still can be called when image == NULL and it will be first call...

Safety check in cairo_surface_destroy most likely is to save code / lines. It is easier to write simply cairo_surface_destory (surface); then if (surface) cairo_surface_destroy (surface);

Since cairo_surface_destroy is created NULL safe the problem is not that it is called second time. Problem is that it is called with already destroyed surface.

Comment 27 Jaroslav Škarvada 2016-05-24 13:35:23 UTC

(In reply to Alberts Muktupāvels from comment #26)
> (In reply to Jaroslav Škarvada from comment #25)
> > (In reply to Alberts Muktupāvels from comment #24)
> > > (In reply to Jaroslav Škarvada from comment #23)
> > > > But I think the proposed fix is dirty. It relies on the safety check inside
> > > > the cairo_surface_destroy. Cleanly written code shouldn't do this. The
> > > > control flow should never get into the cairo_surface_destroy for the second
> > > > time, that's why I wrote "maybe there is a better fix". But this is
> > > > definitely question for the upstream maintainers.
> > > 
> > > I don't want to agree on this.
> > > 
> > > Check what could happen if it is called this way:
> > > _get_image_surface (..., ..., FALSE);
> > > 
> > > I am too lazy to count, but there are definitely multiple paths that could
> > > end up calling cairo_surface_destroy (&image->base); when image is still
> > > NULL.
> > > 
> > > Think about this way - setting it to NULL after destroying is same as if
> > > that function would have been called with try_shm = FALSE.
> > 
> > See it this way:
> > 
> > - what about setting the surface NULL in the cairo_surface_destroy after
> > destroying it? It will fix all these issues, but it is apparently not the
> > right way how to fix such bugs.
> > 
> > I think that all the wrong paths leading to the second call of the
> > cairo_surface_destroy should be fixed/cleaned. But I am not upstream, so
> > it's irrelevant what I am thinking about it.
> 
> When try_shm == FALSE then cairo_surface_destroy still can be called when
> image == NULL and it will be first call...
> 
> Safety check in cairo_surface_destroy most likely is to save code / lines.
> It is easier to write simply cairo_surface_destory (surface); then if
> (surface) cairo_surface_destroy (surface);
> 
> Since cairo_surface_destroy is created NULL safe the problem is not that it
> is called second time. Problem is that it is called with already destroyed
> surface.

I am afraid you don't understand what I tried to express. And I am also afraid that just repeating arguments will not move this bugzilla further.

For upstream maintainers: there are backtraces, reproducer and clear insight what's going there. Could you take a look?

Comment 28 Jaroslav Škarvada 2016-05-31 13:37:51 UTC

Created attachment 124212 [details] [review]
Proposed fix

Drawable may be destroyed/invalidated asynchronously before control flow reaches _get_image_surface function or at any time the code from _get_image_surface is executed - this may lead to undesired effects like double free (which is prevented by assert), X server errors, crashes. The failing path is e.g. the following:

- XShmGetImage @797 fails because the drawable doesn't exist
- cairo_surface_destroy @809 destroys the image (now there is no image)
- it tries XGetImage @818, but fails so it sets surface->use_pixmap = CAIRO_ASSUME_PIXMAP (it wrongly suppose the surface is a window)
- now it tries to create pixmap @827, but it fails because the drawable doesn't exist
- the check @879 passes and sets the error to CAIRO_STATUS_NO_MEMORY (which is probably not the correct error, but that's not the main problem)
- the control flow is redirected to the BAIL @881
- the BAIL code started @1007
- it checks ximage and destroys it if it exists @1008
- the status is CAIRO_STATUS_NO_MEMORY so the check @1013 passed
- boom! the image is destroyed second time @1014

I think the main problem is that the two BAIL jumps (@848 and @881) doesn't count with the fact that there is no image at the moment. The rest of the BAIL jumps are OK, because there is already some image created.

The attached patch is my attempt to fix the problem. Not the most elegant approach, but it seems to work for me. It also adds X error NOOP handler to X calls which may fail due to the non-existent drawable/pixmap not to SIGTRAP the code. With the patch applied the attached reproducer doesn't fail.

Comment 29 Jaroslav Škarvada 2016-06-10 12:41:44 UTC

Sorry, is there any "special way" how to get feedback for the patch from comment 28? I am running patched cairo for more than a month and haven't recognized any problem. I am also ready to rewrite the patch according new comments.

Comment 30 Wolfgang Ulbrich 2016-06-13 19:33:33 UTC

Thanks Jaroslav,
i can confirm that your 'proposed fix' fixes the issue too.
With cairo-1.14.6 rebuild with your patch mate-applets (gtk3 version) don't crash if i switch the compositor off.
Tested with fedora 24.

Dear upstream, again, please take a look at this report again.
You have a POC file and 2 solutions which fixes the issue.
Maybe you think we should file out more downstream reports?
MATE desktop will switch to gtk+3 in all major distros (fedora, ubuntu, debian, OpenSuse, etc.), in result downstream cairo maintainer will get tons of reports about this issue.
Do you really want to blame cairo downstream maintainer?

Comment 31 Jaroslav Škarvada 2016-06-14 08:41:36 UTC

(In reply to Wolfgang Ulbrich from comment #30)
NP. The second patch is also more robust. Try the stress reproducer from comment 17 and comment 18, the "proposed fix" should withstand the test without crash at least.

BTW I am still running the patched cairo on production machines without problem.

I wonder what we did wrong. There is reproducer, analysis of the problem and patch.

Comment 32 Bryce Harrington 2016-06-16 17:48:20 UTC

I can quibble over patch style but the fix seems on the right path - don't destroy image->base if it wasn't created or was already destroyed.

What I'm not sure about is adding a second goto bail out point in the code.  There are situations where bail outs are appropriate and add clarity to the code, but this routine is so convoluted already I don't think that would be the case here.

What about setting "image = NULL" after the destruction and then just adding an "if (image)" before the cairo_surface_destroy call in BAIL?

Comment 33 Alberts Muktupāvels 2016-06-16 18:44:24 UTC

(In reply to Bryce Harrington from comment #32)
> What about setting "image = NULL" after the destruction and then just adding
> an "if (image)" before the cairo_surface_destroy call in BAIL?

? I guess this is exactly what my patch does, no? cairo_surface_destroy is NULL safe so there is no need to add 'if (image)', but if you want I can update patch.

Comment 34 Jaroslav Škarvada 2016-06-17 09:45:09 UTC

(In reply to Alberts Muktupāvels from comment #33)
> (In reply to Bryce Harrington from comment #32)
> > What about setting "image = NULL" after the destruction and then just adding
> > an "if (image)" before the cairo_surface_destroy call in BAIL?
> 
> ? I guess this is exactly what my patch does, no? cairo_surface_destroy is
> NULL safe so there is no need to add 'if (image)', but if you want I can
> update patch.

Yup, we are at the beginning. The 'image = NULL' will do the job as well as stated several times in this BZ. But from the comments above and code inspection (you mostly don't set the pointers to NULLs after destruction, but rather use assertion) I understand that you want the code path/algorithm clean and that that's the reason of the assertion - to detect and clean all wrong paths.

But even with the image = NULL, you still need to add NOOP error handlers to X calls, because they may still fail with 'invalid pixmap'. The X error handling is asynchronous but mostly with assertion somewhere, so without the NOOP error handlers it can also coredump sooner or later.

Comment 35 Jaroslav Škarvada 2016-06-17 14:42:38 UTC

Created attachment 124576 [details] [review]
Proposed fix

I received report, that the provided reproducer can still crash cairo.

The problem is that cairo doesn't count that the drawable can disappear any time during processing. So it can call X calls on invalid drawable. Which is beginning of all troubles. The attached patch added more NOOP handlers to X calls which can be called with invalid drawable.

I am not sure whether it is the right way to go or whether it's known cairo limitation that the drawable mustn't disappear. Maybe this should be fixed in wnck. But it needs to be fixed somewhere. The problem is very easy to trigger - it's enough if an X application just removes or simply changes it's icon in a wrong time.

Comment 36 Bill Spitzak 2016-06-17 17:21:11 UTC

Temporarily setting the X error handler like that does not work. The value is used when the error event is received, there is no indicator as to what error handler was in effect when the message that caused the error was queued.

Comment 37 Alberts Muktupāvels 2016-06-17 17:31:29 UTC

This bug is/was about assert that is caused by calling cairo_surface_destroy on already destroyed image. Setting image to NULL after destroying it is enough to fix this.

This is already tested by several users/developers. Can someone of cairo developers apply this fix? Please!

Comment 38 Wolfgang Ulbrich 2016-06-17 19:04:25 UTC

(In reply to Alberts Muktupāvels from comment #37)
> This bug is/was about assert that is caused by calling cairo_surface_destroy
> on already destroyed image. Setting image to NULL after destroying it is
> enough to fix this.
> 
> This is already tested by several users/developers. Can someone of cairo
> developers apply this fix? Please!

Sorry Albert,
did you ever try the reproducer?
Cairo crashed at several places with your patch.
Also libwnck3-3.18.0 is in stacktraces.
Simply applying your patch is not enough.

Comment 39 Alberts Muktupāvels 2016-06-17 21:13:15 UTC

(In reply to Wolfgang Ulbrich from comment #38)
> (In reply to Alberts Muktupāvels from comment #37)
> > This bug is/was about assert that is caused by calling cairo_surface_destroy
> > on already destroyed image. Setting image to NULL after destroying it is
> > enough to fix this.
> > 
> > This is already tested by several users/developers. Can someone of cairo
> > developers apply this fix? Please!
> 
> Sorry Albert,
> did you ever try the reproducer?

No, but just did... Without patch I lost whole session, crashed both gnome-panel and metacity.

With patch only gnome-panel crashed with different error.

> Cairo crashed at several places with your patch.
> Also libwnck3-3.18.0 is in stacktraces.
> Simply applying your patch is not enough.

It is enough to fix bug that I reported. This bug is about assert not about possible x errors. Lets not mix separate bugs in one report.

X errors can be handled in applications, if you create xlib cairo surface you can add error traps to fix problem - handle error or simply ignore. This might be or might not be problem in cairo, but probably can be fixed in applications. Anyway this is already different problem and probably should be reported as separate bug.

Comment 40 Alberts Muktupāvels 2016-06-17 21:45:00 UTC

(In reply to Wolfgang Ulbrich from comment #38)
> Also libwnck3-3.18.0 is in stacktraces.

https://bugzilla.gnome.org/show_bug.cgi?id=767803
Patch attached to this bug report fixed crash in gnome-panel that was caused by running test3.py reproducer.

Comment 41 Wolfgang Ulbrich 2016-06-18 10:30:07 UTC

Just tested on fedora 24 with
- cairo-1.14.6 +  'cairo-xlib-surface: set image to NULL after destroying it' patch
- libwnck-3.18.0 + 'add error trap in try_pixmap_and_mask' patch from libwnck report
- running test3.py reproducer

Mate wnck applets or mate-panel don't crash any more.
But the whole desktop can easily freeze if i try to open another window if test3.py is running.
This didn't happen without 'add error trap in try_pixmap_and_mask'.

Comment 42 Alberts Muktupāvels 2016-06-18 11:43:41 UTC

(In reply to Wolfgang Ulbrich from comment #41)
> Just tested on fedora 24 with
> - cairo-1.14.6 +  'cairo-xlib-surface: set image to NULL after destroying
> it' patch
> - libwnck-3.18.0 + 'add error trap in try_pixmap_and_mask' patch from
> libwnck report
> - running test3.py reproducer
> 
> Mate wnck applets or mate-panel don't crash any more.
> But the whole desktop can easily freeze if i try to open another window if
> test3.py is running.

And what do you expect if application in one second change IconPixmap 
hundreds of times?

> This didn't happen without 'add error trap in try_pixmap_and_mask'.

That is probably because of gdk_error_trap_pop. From doc - May block until an error has been definitively received or not received from the X server.

Comment 43 Uli Schlachter 2016-06-18 14:38:18 UTC

Created attachment 124589 [details] [review]
Proposed fix

Before this goes even more out of control: Anybody against this patch?

It should make Chris happy by saying that the caller still did something wrong here, having the drawable freed from beneath cairo.
It should Bryce happy, because the possible-crash he worries about is handled. Note that freeing both image and &image->base as proposed in comment 6 is wrong, since &image->base is just a hidden pointer cast. Image is the cairo_image_surface_t* and &image->base is the same thing as a cairo_surface_t*.
It should make most reporters happy since it fixes the double-cairo_surface_destroy that this book actually is about (sorry Jaroslav, cairo is currently not meant to handle this case and your patch is incomplete due to X11's asynchronous nature).

Comment 44 Wolfgang Ulbrich 2016-06-18 20:06:34 UTC

(In reply to Uli Schlachter from comment #43)
> Created attachment 124589 [details] [review] [review]
> Proposed fix
> 
> Before this goes even more out of control: Anybody against this patch?
> 
I'm fine with it, as it fixes our applet crashes on bare metal.
So please merge it for that downstream cairo maintainer can use the commit.

Comment 45 Uli Schlachter 2016-06-19 09:38:15 UTC

commit d69dd6b341594c338fa6c7b327fd7f201eb37bc1
Author: Uli Schlachter <psychon@znc.in>
Date:   Sat Jun 18 15:08:52 2016 +0200

    xlib: Fix double free in _get_image_surface()
    
    If XShmGetImage() fails, the code tries to continue with its normal,
    non-shared-memory path. However, the image variable, which was previously set to
    NULL, now points to an already-destroyed surface, causing a double-free when the
    function cleans up after itself (actually, its an assertion failure because the
    reference count of the surface is zero, but technically this is still a double
    free).
    
    Fix this by setting image=NULL after destroying the surface that this refers to,
    to make sure this surface will not be destroyed again.
    
    While we are here (multiple changes in a single commit are bad...), also fix the
    cleanup done in bail. In practice, &image->base should be safe when image==NULL,
    because this just adds some offset to the pointer (the offset here is actually
    zero, so this doesn't do anything at all). However, the C standard does not
    require this to be safe, so let's handle this case specially.
    
    Note that anything that is fixed by this change is still buggy, because the only
    reason why XShmGetImage() could fail would be BadDrawable, meaning that the
    target we draw to does not exist or was already destroyed. This patch will
    likely just cause X11 errors elsewhere and drawing to (possible) invalid
    drawables is not supported by cairo anyway. This means that if SHM fails, the
    following fallback code has a high chance of failing, too.
    
    Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=91967
    Signed-off-by: Uli Schlachter <psychon@znc.in>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.