Please have a look at http://bugs.gnome.org/show_bug.cgi?id=319509
Created attachment 3609 [details] profile with render disabled I disabled Render inside cairo-xlib-surface.c and my application returned to decent speed (i.e. doesn't feel like swimming through mud) even on Apple X11. I guess the X server doesn't matter much now, its top activity now is memcpy. This configuration has the advantage that one can see clearly what's going on and it's great that pixman is a mini-fb and has even the same function names. I produced a new profile with the main culprits. This is again operating the UI of my application, which is nothing fancy (some comboboxes, an expander with some sliders, a textview where short messages are displayed, a menu, a file chooser, an aboutbox). I hope it is useful for you. You can see that most of the requests come from GDK (and not from Pango, sorry), in particular from gdk_window_clear_backing_rect, I suppose this is in the GDK double-buffering.
Further thoughts on this topic: By using Cairo without Render, I basically shifted the CPU burden from the X server to my application. Profiling I see roughly the same CPU consumption for combined CVS X.org + my app, with and without Render. But in actual use, the application feels more responsive without Render, less latency maybe? I want to rely on Cairo (for my own code as well) & GTK 2.8 and future versions, especially for my Windows users where it works pretty well. I want the distribute the compiled binary to my Mac OS X users. Since they don't normally have those libraries, I link to them statically on Mac OS, so I have control over them. While I could bundle an X server with my application (on Mac OS X the X server is just another application), this is fragile. I hope pixman is kept fairly in sync with x.org server fb, is this true? Any optimization in the higher levels of Cairo and GDK/GTK would be welcome of course :-). This would benefit all platforms. I could help in testing and profiling.
I think that the problem I see is in part (cairo_fill is not too light either) due to: https://bugs.freedesktop.org/show_bug.cgi?id=4191 It's unfortunate that GDK 2.8 added such an expensive code path within a very sensitive function (gdk_window_clear_backing_rect) which is invoked before expose events. I opened http://bugzilla.gnome.org/show_bug.cgi?id=319604 against GDK.
I know of no reason why IN would be used by clear_backing_rect(). In all cases I'm aware of, clear_backing_rect() hits *exactly the same Xlib code paths* in 2.8 as it did in 2.6.
The code of 'gdk_window_clear_backing_rect' (in GTK+ 2.8.6 lines 1768-1793, in gdk/gdkwindow.c) calls Cairo, e.g. cairo_clip, cairo_fill. It's task is simply to clear the area to the window background. This function is called from gdk_window_begin_paint_region which in turn gets called before 'expose_event' signals for double-buffered windows. As you can see in the attached profile document, cairo_clip ultimately calls fbCombineInU if Render is disabled in Cairo or invokes the similar code on the X server through the Render extension. The former implementation was different, it used gdk_gc_set_clip_region and gdk_draw_rectangle.
The code to handle shapes that are lists of pixel-aligned rectangles should keep you from hitting that code path; wonder why that isn't working for you.
Which exact version of cairo are you using?
(In reply to comment #7) > Which exact version of cairo are you using? Both CVS and 1.0.2 act the same. Render 0.8.
(In reply to comment #6) > The code to handle shapes that are lists of pixel-aligned rectangles > should keep you from hitting that code path; wonder why that isn't working > for you. Endianess problems?
Buglet in cairo-meta-surface.c: _init_pattern_with_snapshot(), status is used uninitialized.
There's some fragile floating-point code in Cairo that gets mis-compiled with -ffast-math, at least on this compiler. Never saw this on PowerPC. I had CFLAGS='-Os -ffast-math -mcpu=7450 -mdynamic-no-pic' set for my application. Removing -ffast-math solves the problem. This compiler is gcc version 4.0.0 (Apple Computer, Inc. build 5026) equivalent ~ GNU GCC 4.0.2 Thanks for your patience with me. I'll dig up the code sequence that fools the compiler. This is very annoying :-(
The problem occurs in cairo_fixed_from_double(), but I believe there is no mis-compilation. I think the function receives a denormalized/weird number (possibly from an uninitialized memory address) and floor() silently puts it to 0. In -ffast-math mode the compiler generates the unguarded code directly and the output is a huge number in one of the calls. Still investigating...
Doing this fixes it for me (I don't understand why yet). The interface gets drawn fast and correctly. Notice it consistently rounds away from 0 for both positive and negative half-way cases. The C99 "round" function should work as well. Is this the correct behavior? cairo_fixed_t _cairo_fixed_from_double (double d) { if (d < 0.) return (cairo_fixed_t) (d * 65536 - .5); else return (cairo_fixed_t) (d * 65536 + .5); } No clue yet why the original breaks. It may have to do with the fact that "d * 65536 + .5" is generated as a FMA (fused multiply-add) instruction which doesn't do an intermediate rounding operation.
In fact round() should be used because it takes care of the predecessor of 0.5 and others. See: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18099
I'm glad to see some more close examination of _cairo_fixed_from_double. For such an elementary function, it sure needs (and is getting) an awful lot of attention. With your current eye on correctness (and your environment in which the current code fails) could you experiment with the correctness of the following proposed performance optimization for _cairo_fixed_from_double: http://thread.gmane.org/gmane.comp.lib.cairo/5228 As for round, we could certainly use that when available. But so far, we haven't made C99 a hard requirement for compiling cairo. So, like the performance patch I linked to above, using round would also require appropriate configure-time checks for suitability. (Though the round check would be a lot easier than the checks for IEEE double-precision floating-point necessary for the performance optimization I think).
For the record, it was a bug in the compiler, although I still cannot explain how I got those huge numbers... I made a test case and reported it to Apple. Now for the topic of rounding. I would say first you should define exactly your definition of rounding which is appropriate for your problem domain. It should be specified unambiguously. For example, according to the definition of the round() function, half-way numbers are rounded away from zero (this is the intention of the code in comment #13, but doesn't work for all input, see comment #14). An alternate specification which is statistically unbiased: half-way numbers are rounded to the closest even integer. This is what CPUs usually implement, what rint() does when the rounding mode is FE_TONEAREST (this function is inappropriate for a library, because the application can change the rounding mode) and what Mathematica does (see http://mathworld.wolfram.com/ NearestIntegerFunction.html). I read before the conversation you linked to :-). I had for long time a variant of that code, see below, which I think is more resilient to breaking when the compiler becomes more and more aggressive in its optimizations. Also, for Pentium4, I think it's not that faster anymore, you may need to run some tests with -march=pentium4 flag for gcc. Unfortunately those kind of definitions don't work for certain values. And it's worrying that they fail around +- 0.5 which I think are interesting values for Cairo. You can test with the program below (you may need to compile with -std=c99 for gcc). predecessor of 0.5 = + (.5 - 0x1p-54) successor of -0.5 = - (.5 - 0x1p-54) The code in comment #13 fails for both, current Cairo for the predecessor of 0.5 and the other trick for the successor of -0.5. The round() function, while technically required by C99, it was implemented for long time in Linux, BSDs, other Unices and even Windows. I think it is much better tested and has a crystal clear definition. It may happen that gcc in the future will replace it with inline code (hopefully not buggy like it happened to me). Cairo is already critical infrastructure, please make it fail-proof and idiot-proof (that's me :-)) With my unfortunate experience that started this thread, I would be cautious. And, btw, thank you Carl et al for this brilliant piece of software. #include <math.h> #include <stdio.h> /* big endian #define iman 1 */ /* little endian */ #define iman 0 int round1(const double val) { static union { double d; int i[2]; } u; u.d = val + 68719476736. * 1.5; return (u.i[iman]) >> 16; } int round2(const double val) { if (val < 0.) return (int) (val - .5); else return (int) (val + .5); } int round3(const double val) { return (int) floor(val + .5); } int main(void) { double p = + (.5 - 0x1p-54); double m = - (.5 - 0x1p-54); printf("%d %d\n", round1(p), (int) round(p)); printf("%d %d\n", round1(m), (int) round(m)); return 0; }
Actually, apparently it's not that long ago that round() started to show up :-(
(In reply to comment #16) > For the record, it was a bug in the compiler, Ah, at least one mystery solved then. > Now for the topic of rounding. I would say first you should define exactly your definition of rounding > which is appropriate for your problem domain. It should be specified unambiguously. Yes. That is good, sound advice. We've got a large pass of "robustification" to be done with the upcoming rewrite of the tessellator. This rounding will also be an important part of that. [Snipped many good insights and comments which will be useful---thanks!] > Cairo is already critical infrastructure, please make it fail-proof and idiot-proof (that's me :-)) Hey, I'm one of the biggest idiots involved with this project. (And I think I can back that claim up with bugzilla comments.) > With my unfortunate experience that started this thread, I would be cautious. Caution, at least, is something I can be good at. You may notice that long before the optimization patch I linked to above, we've had a patch proposed for making this function a macro. I've been waiting for better proof that we know we have implemented what we want before making a macro out of it. > And, btw, thank you Carl et al for this brilliant piece of software. For what it's worth, you're quite welcome. I'll pass the "brilliant" portion of your thanks on to Keith. It's all his ideas originally. Thank you for your contributions, (and I'm glad you've got cairo and GTK+ running at more reasonable speeds now). -Carl
I took a closer look at what Bill Spitzak actually proposed, i.e. it does rounding of 65536 * val, and I think the code is correct. The one I put in comment #16 does actually chopping like on http:// www.stereopsis.com/FPU.html. Now I _really_ feel like an idiot. The rounding is done in the sense of second definition in comment #16, i.e. half-way numbers are rounded to the closest even. It's easy to overflow this operation, so some kind of input checking is useful. I reported I was getting huge numbers as input to _cairo_fixed_from_double(), it was because of the code mentioned in https:// bugs.freedesktop.org/show_bug.cgi?id=2675, comment #9: y_intersect = _cairo_fixed_from_double ((b2 - b1) / (m1 - m2)); I think is better to do the operation through a union, but check with a C language lawyer. So the code would be like that: #if __BIG_ENDIAN_ #define iman_ 1 #else #define iman_ 0 #endif cairo_fixed_t _cairo_fixed_from_double (double d) { union { double d; int32_t i[2]; } u; u.d = d + 68719476736. * 1.5; return u.i[iman_]; } Sorry for any confusion I caused.
Since it looks there is some interest around cairo_fixed_from_double() let me add some random bits: - there is an update (http://www.stereopsis.com/sree/fpu2006.html) of that oft-cited stereopsis page - IMHO the "Banker's Rounding" (half-way to even) method is not appropriate for graphics, I suggest the round() method (half-way away from zero) - in my benchmarks, on a desktop CPU (PowerPC G4), of drawing SVGs consiting only of text at different scales (via librsvg) more time is spent in the 64 bit division in compute_x() of cairo_traps.c (called by cairo_traps_tessellate_polygon()). Of course, the x86 assembler optimization of David Turner will not help the other CPUs. I also see the qsort() of cairo_traps_tessellate_polygon() taking more cycles overall than cairo_fixed_from_double(). A better algorithm for cairo_traps_tessellate_polygon() would contribute much more than micro-optimizations, I guess. Again I'm speaking from a desktop user point of view (CPU with hardware FPU).
An optimized _cairo_fixed_from_double() was implemented based on the stereopsis example (with the caveat that banker's rounding is still used), Carl completely rewrote the tessellator based on John Hobby's thesis (with the caveat that there is still room for performance tweaks, but the algorithm itself is state-of-the-art) and its difficult to beat the library qsort (and we can't currently avoid it) - and the compute_x() in cairo-traps.c is no more. Please keep profiling and reporting hotspots - though sadly, it seems all the low-hanging fruit in cairo has been picked. Though there is interesting work afoot in areas like polygon rasterisation and utilising different hardware architectures (a high-level OpenGL interface, as opposed to the low-level implementation of XRender over OpenGL). But in order to demonstrate that any of this is worthwhile we need real-world profiles/benchmarks...
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.