Created attachment 134564 [details] gdb stacktrace Hello, We use Cairo & Pango for rendering in multiple threads. No memory is shared between the threads. The rendering is done on independent image surfaces. I have seen crashes with similar looking stacktraces for a while and think finally got to the bottom of it. I think it happens due to a race condition in libcairo. The program segfaults while trying to call NULL function pointer here https://cgit.freedesktop.org/cairo/tree/src/cairo-traps-compositor.c?h=1.14#n2314 I found that `compositor` points to a static structure and its contents are half initialized. At the same time, there is another thread in https://cgit.freedesktop.org/cairo/tree/src/cairo-image-compositor.c?h=1.14#n1259 in a process of initializing `static cairo_traps_compositor_t compositor;` which is used in the first thread. I'm attaching gdb stacktrace as an example of the crash. Looking at the code there are a couple of other places where libcairo lazily initializes it's statically allocated memory. Is there a way to "prepare" the library and initialize it's structures and static mutexes to "workaround" the race conditions like the one I'm facing right now? The libraries used libcairo 1.14.0, libpango1.0-0 1.36.8
Created attachment 134702 [details] [review] Fix the race around accessing cairo_image_traps_compositor
The proposed patch fixes the race. It needs more work as I found 12 more places where similar racy lazy initialization is used. If this approach is good to go, I will prepare a full patch.
If there are 12 more places that need the same treatment maybe we should create _cairo_spin_lock()/_cairo_spin_unlock() functions. The indentation is messed up. See CODING_STYLE.
Created attachment 134709 [details] [review] Fix the race around accessing cairo_image_traps_compositor
@Adrian thanks for looking at it. Updated patch introduces _cairo_atomic_once_init_enter/_cairo_atomic_once_init_leave functions and fixes indentation according to the style guide. How does it look now?
> How does it look now? Would be fine by me, but does this really need a struct with two ints? Just have 0=UNINITIALIZED, 1=INITIALIZING, and 2=INITIALIZED and one int is enough. (Yes, this makes the implementation a bit more complicated). Also, doesn't the if(likely(once->completed)) at the beginning of the function need an _cairo_atomic_int_get() around it?
(In reply to Uli Schlachter from comment #6) > Also, doesn't the if(likely(once->completed)) at the beginning of the > function need an _cairo_atomic_int_get() around it? My colleague and me discussed it for quite a while, as it was not obvious at the first glance. The conclusion was that it should be ok. If for whatever reason it is not True after we read it, all the following reads/writes to it are atomic, so we should be fine. Another benefit of this implementation compared to the proposed earlier spinlock is that after initialization is finished we don't use slower atomic loads. So performance wise it is exactly the same as the code that used to be there before. > Would be fine by me, but does this really need a struct with two ints? Just > have 0=UNINITIALIZED, 1=INITIALIZING, and 2=INITIALIZED and one int is > enough. (Yes, this makes the implementation a bit more complicated). I agree I will change the patch.
_cairo_atomic_init_once_enter / _cairo_atomic_init_once_leave would be better names.
> My colleague and me discussed it for quite a while, as it was not obvious at the first glance. The conclusion was that it should be ok. If for whatever reason it is not True after we read it, all the following reads/writes to it are atomic, so we should be fine. (Disclaimer: I have never actually used C11 atomics and I certainly am no expert on them, I just 'read some stuff') The only formal semantics for this kind of thing are C11 atomics (right?). There, you may never read an atomic variable non-atomicly (you are not even guaranteed to read a value that was written, it might be some intermediate garbage). Thus, this would at least need to be a relaxed atomic. However, a relaxed read is too weak: It is allowed to read the new value for the atomic variable, but continue to read other older memory for other locations. I.e. it is allowed that the current thread sees that initialisation was done, but does not see the actual memory stores of the initialisation (and AFAIK there are architectures were this can actually happen). Thus, in C11 semantics, this needs at least an acquire read (I'll ignore consume here, everyone else does) (and the memory store that sets the atomic variable needs to be at least a release store). Now, cairo does not use C11 atomic operations (yet). However, the C11 semantics are the way they are since there are architectures that allow such behaviour. Thus, something needs to be atomic here. So, short version: I disagree with the above and I think that this needs a _cairo_atomic_int_get. (Also: +1 for Adrian's proposed name change)
Created attachment 134722 [details] [review] Surround initialisations with atomic critical section
(In reply to Uli Schlachter from comment #9) > So, short version: I disagree with the above and I think that this needs a > _cairo_atomic_int_get. > > (Also: +1 for Adrian's proposed name change) After sleeping on it, I agree with you, it should be _cairo_atomic_int_get. The updated patch includes proposed changes and wraps in _cairo_atomic_init_once_enter/_cairo_atomic_init_once_leave all lazy initializations inside *compositor_get functions I could find.
Looks good. Just a minor nitpick - use TRUE/FALSE instead of 1/0 when returning the boolean in _cairo_atomic_init_once_enter.
(In reply to Adrian Johnson from comment #12) > Looks good. > > Just a minor nitpick - use TRUE/FALSE instead of 1/0 when returning the > boolean in _cairo_atomic_init_once_enter. I get "use of undeclared identifier 'FALSE'" erorr after compiling. And I can't find *.h file with macros, can you point me where they are defined? BTW there is while(0) loop in this file.
TRUE/FALSE is defined in cairoint.h. But we don't want to include that in cairo-atomic-private.h. So leave it as is. The while(0) is intentional: https://stackoverflow.com/questions/154136/why-use-apparently-meaningless-do-while-and-if-else-statements-in-macros
+ assert (0 && "incorrect use of _cairo_atomic_init_once API (once != CAIRO_ATOMIC_ONCE_INITIALIZING)"); I think it would be better for the error message to read "use of _cairo_atomic_init_once_leave without _cairo_atomic_init_once_enter". Unless you really see another way for that assert to be triggered. Also it is an annoying fact of life that on the x86 the fastest way to read an atomic variable "works" in a very useful way: Thread 1: // x is a large structure, y is a lock-free atomic x = foo; atomic_set(y, 1); // never set y again Thread 2: // do not read x here while (y != 1) { // non-atomic read! // code that does other atomic operations } assert( x == foo ); // this works Vast amounts of software assumes this works. The trick is that there is no synchronization, but an assumption that the only way for y==1 to be true is for a sync to have happened since the atomic_set. It does not seem like any version of the atomic operations produce the same instructions as the non-atomic read in GCC. All of them produce a "sync" (except "relaxed" but that clearly states the assert could fail). I am certainly not a mt expert, but I have seen people complain about this with the atomic operations plenty of times to see if anybody can answer this with one of these answers: 1. Everybody is wrong and the above code can fail on Intel-style processors. 2. The way you get the above in a form that can be ported to other platforms is to do "<code>". 3. The overhead of the sync is zero (not impossible, the compiler could optimize it out by seeing that the test is in a loop). 4. Answer 2 or 3 except it is not true because GCC has a bug. Probably not a big deal for this bit of Cairo, but this can add up, and there seem to be some experts here as well as people asking this question, so maybe somebody here can answer it. , and also the docs claim that you should use "release" when setting the variable and mixing modes will screw things up, and this code does not want to do that as it is reusing code that makes more strict assumptions about the setting of y. , the missing assumption is that if the sync has not completed then you will not see a value of y==1. . I am guessing that "aquire" is supposed to do this but it is not doing it. And it does not seem to correspond exactly to any atomic model. Maybe "aquire" except the aquire semantics only apply if a particular value is read. This code does rely on the fact that x is only referred to in the if statement to prevent any out-of-order read of x before y is read.
Please ignore garbage my browser editor left on the end of the comment.
(In reply to Bill Spitzak from comment #15) > I think it would be better for the error message to read "use of _cairo_atomic_init_once_leave without _cairo_atomic_init_once_enter". I like it. Sound better. > 1. Everybody is wrong and the above code can fail on Intel-style processors. It can fail if compiler decided to optimize access to "y" and reads it only once before entering the loop. Let's assume compiler didn't do it and that 'atomic_set' in the example code is the same as 'atomic_store' from C11. Then it is not portable but will work fine on Intel processors. Because regular 'mov' already gives us 'release-acquire' semantics. > 2. The way you get the above in a form that can be ported to other platforms > is to do "<code>". Thread 1: /* We want to be 100% sure that all writes to 'x' are visible after we set * 'y' to 1 in all other threads. The only semantics that gives us * this guarantee is memory_order_seq_cst, thats why we use atomic_store and * not anything weaker */ x = foo; atomic_store(&y, 1); // never set y again Thread 2: /* We only care about the value of 'y' here, because the code above * guarantees us that all writes to x will be visible to all other threads as * soon as y = 1. Therefore reading with memory_order_acquire is enough */ while (atomic_load_explicit (&y, memory_order_acquire) != 1) { // code that does other atomic operations } assert( x == foo ); Having said that we can rewrite _cairo_atomic_init_once_enter/_cairo_atomic_init_once_leave in C11: static cairo_always_inline cairo_bool_t _cairo_atomic_init_once_enter(cairo_atomic_once_t *once) { /* The thread writing to 'once' (_cairo_atomic_init_once_leave) should * guarantee visibility of all the writes happened before it, that's * why memory_order_acquire */ if (likely(atomic_load_explicit (once, memory_order_acquire) == CAIRO_ATOMIC_ONCE_INITIALIZED)) return 0; if (atomic_compare_exchange_strong_explicit(once, CAIRO_ATOMIC_ONCE_UNINITIALIZED, CAIRO_ATOMIC_ONCE_INITIALIZING, memory_order_acq_rel, memory_order_acquire)) return 1; while (atomic_load_explicit (once, memory_order_acquire) != CAIRO_ATOMIC_ONCE_INITIALIZED) {} return 0; } static cairo_always_inline void _cairo_atomic_init_once_leave(cairo_atomic_once_t *once) { /* All writes before we enter here must be visible to all other threads, * that's why memory_order_seq_cst and nothing weaker */ if (unlikely(atomic_compare_exchange_strong_explicit(once, CAIRO_ATOMIC_ONCE_INITIALIZING, CAIRO_ATOMIC_ONCE_INITIALIZED, memory_order_seq_cst, memory_order_acquire))) assert (0 && "incorrect use of _cairo_atomic_init_once API (once != CAIRO_ATOMIC_ONCE_INITIALIZING)"); }
http://en.cppreference.com/w/c/atomic/memory_order makes me think I'm mistaken. everywhere I used memory_order_seq_cst, can be replaced with memory_order_release. Because: memory_order_release - All writes in the current thread are visible in other threads that acquire the same atomic variable memory_order_acquire - All writes in other threads that release the same atomic variable are visible in the current thread It is very hard to wrap my head around the topic. I think we are still fine with the patch as all the operations (load/store/cmp_exchg) used in the patch have the strongest semantic (memory_order_seq_cst).
>> 1. Everybody is wrong and the above code can fail on Intel-style processors. >It can fail if compiler decided to optimize access to "y" and reads it only once before entering the loop. The problem with that argument is that there are atomic operations inside the loop, including normal sequential fences. The compiler must re-read 'y' after the loop contents execute or it is wrong. The argument *does* hold if you use 'x' instead of 'y', it could read an old x, then get a new y==1 and think it's copy of x is ok (the loop is not run so it does not need to re-read x). In practice x is either written to by code inside the loop, or is pointed to by y, so that it is unlikely for an optimizer to do this. But this is a good reason to write something to force it not to happen, like your suggestion: >while (atomic_load_explicit (&y, memory_order_acquire) != 1) { That is what I thought would work but quick test reveals that it produces a sync instruction before the access. Unless I am really confused the sync should not be necessary, since the worst that could happen is it would get the old value of y and think x is no good yet even though it is, and correctly-written code must already handle this. Possibly I need to turn on optimization? BTW the current code is fine, or the release/aquire version (which seems to compile to the same thing for me but might be the actual "correct" code). Just trying to get an answer to this question as I see this all the time and there is a very strong desire by programmers to make the overhead of run-once code minimal after it is run the first time, so the temptation to remove the atomic operation which seems to not be required needs an extremely convincing argument against it.
(In reply to Bill Spitzak from comment #15) > Thread 2: > // do not read x here > while (y != 1) { // non-atomic read! > // code that does other atomic operations > } > assert( x == foo ); // this works That is relying on undefined behavior. Future versions of the compiler may decided to optimize away second and subsequent reads. > Vast amounts of software assumes this works. As we saw with the gcc integer overflow optimizations many years ago, the compiler developers will be unsympathetic to this argument and point you to the spec. I had a look around for performance figures on atomic reads and came across this: https://htor.inf.ethz.ch/publications/img/atomic-bench.pdf I don't think the cost of an atomic read is worth trying to optimize. In any case, it is the job of the compiler to implement the read in the fastest way possible.
> Thread 2: > // do not read x here > while (y != 1) { // non-atomic read! > // code that does other atomic operations > } > assert( x == foo ); // this works >That is relying on undefined behavior. Future versions of the compiler may >decided to optimize away second and subsequent reads. That is not exactly the problem. If there is a release fence inside the loop it MUST re-read y on the next iteration. The actual problem is that it could pre-load x. It must re-load x if the loop is executed, but if the loop is not run (because y==1 initially) then it does not have to reload it, and therefore it can get an incorrect version of x. However I have run some more tests and it now looks like gcc (4.8.2) produces identical code for "if (y)" and "if (__atomic_load_n(&y,__ATOMIC_X))" for all valid values of X, even SEQ_CST (!). I was basing my comments on attempts to use the __sync operations. Absolutely I would recommend using __atomic in all code from now on (if in fact gcc is correct, I am surprised there is no sync or other added instruction). It seems correct to use the acquire/release types for this init flag, but since it produces identical code on x86 there could be a worry that this will be insufficiently tested. This is my test code (compiled with -S -O7 -lthread) which seems to show no difference (change the if statement to get other versions): int y; int x1; int x2; int foo() { if (__atomic_load_n (&y, __ATOMIC_SEQ_CST)) return x1; else return x2; }
I wanted to prepare libcairo package with the fix. We are still catching the crash in CI. Before doing that wanted to be sure that there are no undesirable side effects. So I run `make check` and the tests failed for me even without the patch. Do I need any special packages to run full test suite? When do you guys think the patch will be pushed to master branch?
(In reply to Mikhail Fludkov from comment #22) > I wanted to prepare libcairo package with the fix. We are still catching the > crash in CI. Before doing that wanted to be sure that there are no > undesirable side effects. So I run `make check` and the tests failed for me > even without the patch. The image tests should mostly work but the other backends have a lot of failures as it is very sensitive to the test environment (fonts, version of freetype, gs, poppler, and rsvg). The way around it is to capture the output of a run of the test suite before your changes, then use this output as the reference images for testing your changes. This is documented in the README in the test dir. > > Do I need any special packages to run full test suite? When do you guys > think the patch will be pushed to master branch? Unless any one has any more comments, I will run the test suite over it this weekend then push it.
(In reply to Adrian Johnson from comment #23) > Unless any one has any more comments, I will run the test suite over it this > weekend then push it. I will wait until then. Thank you!
Comment on attachment 134722 [details] [review] Surround initialisations with atomic critical section Review of attachment 134722 [details] [review]: ----------------------------------------------------------------- +1 for the patch, thanks! > However I have run some more tests and it now looks like gcc (4.8.2) produces identical code for "if (y)" and "if (__atomic_load_n(&y,__ATOMIC_X))" for all valid values of X, even SEQ_CST (!). I was basing my comments on attempts to use the __sync operations. Absolutely I would recommend using __atomic in all code from now on (if in fact gcc is correct, I am surprised there is no sync or other added instruction). > > It seems correct to use the acquire/release types for this init flag, but since it produces identical code on x86 there could be a worry that this will be insufficiently tested. According to the second result Google gave me, this is correct: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
I ran make test on image,xlib,pdf then used the output as the reference to test the patch. I ran make -j8 test with the patch. image and pdf are fine. xlib tests showed a few differences. In some cases the output was an improvement. There are two xlib-render-rgb24 tests that looked like a regression: - device-offset-postive - mask When I ran just the device-offset-positive by itself it passed. When I ran just the mask by itself I get different output every time it is run. Uli, could you have a look. I know nothing about the xlib surfaces and don't normally run xlib tests when working on pdf/ps.
> I ran make -j8 test with the patch. Does that make a difference? The unit tests are run by cairo-test-suite and AFAIK it is single-threaded. > Uli, could you have a look. Without any patches, just current master: > 311 Passed, 240 Failed [0 crashed, 9 expected], 24 Skipped (And both device-offset and mask are among the failed ones) So, since it already fails before this patch, I'd say to just ignore it. Also, since the test suite is single-threaded (except for the pthread-* tests), this patch shouldn't make a difference anyway.
(In reply to Uli Schlachter from comment #27) > > I ran make -j8 test with the patch. > > Does that make a difference? The unit tests are run by cairo-test-suite and > AFAIK it is single-threaded. No. I realize now it won't make any difference. > So, since it already fails before this patch, I'd say to just ignore it. > Also, since the test suite is single-threaded (except for the pthread-* > tests), this patch shouldn't make a difference anyway. Pushed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.