Bug 4846

Summary: Performance problems for GTK+/Cairo on MacOS X
Product: cairo Reporter: Bogdan Nicula <bogdanni>
Component: xlib backendAssignee: Carl Worth <cworth>
Status: RESOLVED FIXED QA Contact: cairo-bugs mailing list <cairo-bugs>
Severity: normal    
Priority: high CC: billy.biggs, jwatt
Version: 1.1.1   
Hardware: PowerPC   
OS: Mac OS X (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: profile with render disabled

Description Bogdan Nicula 2005-10-23 03:14:57 UTC
Please have a look at http://bugs.gnome.org/show_bug.cgi?id=319509
Comment 1 Bogdan Nicula 2005-10-23 23:22:08 UTC
Created attachment 3609 [details]
profile with render disabled

I disabled Render inside cairo-xlib-surface.c and my application returned to
decent speed (i.e. doesn't feel like swimming through mud) even on Apple X11. I
guess the X server doesn't matter much now, its top activity now is memcpy.
This configuration has the advantage that one can see clearly what's going on
and it's great that pixman is a mini-fb and has even the same function names.
I produced  a new profile with the main culprits. This is again operating the
UI of my application, which is nothing fancy (some comboboxes, an expander with
some sliders, a textview where short messages are displayed, a menu, a file
chooser, an aboutbox). I hope it is useful for you.
You can see that most of the requests come from GDK (and not from Pango,
sorry), in particular from gdk_window_clear_backing_rect, I suppose this is in
the GDK double-buffering.
Comment 2 Bogdan Nicula 2005-10-24 04:26:04 UTC
Further thoughts on this topic:

By using Cairo without Render, I basically shifted the CPU burden from the X server to my application. 
Profiling I see roughly the same CPU consumption for combined CVS X.org + my app, with and without 
Render. But in actual use, the application feels more responsive without Render, less latency maybe?

I want to rely on Cairo (for my own code as well) & GTK 2.8 and future versions, especially for my 
Windows users where it works pretty well. I want the distribute the compiled binary to my Mac OS X 
users. Since they don't normally have those libraries, I link to them statically on Mac OS, so I have 
control over them. While I could bundle an X server with my application (on Mac OS X the X server is 
just another application), this is fragile. I hope pixman is kept fairly in sync with x.org server fb, is this 
true?

Any optimization in the higher levels of Cairo and GDK/GTK would be welcome of course :-). This 
would benefit all platforms. I could help in testing and profiling.

Comment 3 Bogdan Nicula 2005-10-25 01:32:50 UTC
I think that the problem I see is in part (cairo_fill is not too light either) due to:
https://bugs.freedesktop.org/show_bug.cgi?id=4191

It's unfortunate that GDK 2.8 added such an expensive code path within a very sensitive function 
(gdk_window_clear_backing_rect) which is invoked before expose events.

I opened http://bugzilla.gnome.org/show_bug.cgi?id=319604 against GDK.
Comment 4 Owen Taylor 2005-10-25 04:40:37 UTC
I know of no reason why IN would be used by clear_backing_rect().

In all cases I'm aware of, clear_backing_rect() hits *exactly the same Xlib
code paths* in 2.8 as it did in 2.6.
Comment 5 Bogdan Nicula 2005-10-25 05:37:23 UTC
The code of 'gdk_window_clear_backing_rect' (in GTK+ 2.8.6 lines 1768-1793, in gdk/gdkwindow.c) 
calls Cairo, e.g. cairo_clip, cairo_fill. It's task is simply to clear the area to the window background.

This function is called from gdk_window_begin_paint_region which in turn gets called before 
'expose_event' signals for double-buffered windows.

As you can see in the attached profile document, cairo_clip ultimately calls fbCombineInU if Render is 
disabled in Cairo or invokes the similar code on the X server through the Render extension.

The former implementation was different, it used gdk_gc_set_clip_region and gdk_draw_rectangle.
Comment 6 Owen Taylor 2005-10-25 05:45:38 UTC
The code to handle shapes that are lists of pixel-aligned rectangles
should keep you from hitting that code path; wonder why that isn't working
for you.
Comment 7 Billy Biggs 2005-10-25 05:58:02 UTC
Which exact version of cairo are you using?
Comment 8 Bogdan Nicula 2005-10-25 06:01:28 UTC
(In reply to comment #7)
> Which exact version of cairo are you using?
Both CVS and 1.0.2 act the same. Render 0.8.

Comment 9 Bogdan Nicula 2005-10-25 06:03:19 UTC
(In reply to comment #6)
> The code to handle shapes that are lists of pixel-aligned rectangles
> should keep you from hitting that code path; wonder why that isn't working
> for you.

Endianess problems?

Comment 10 Bogdan Nicula 2005-10-25 07:07:32 UTC
Buglet in cairo-meta-surface.c:
_init_pattern_with_snapshot(), status is used uninitialized.
Comment 11 Bogdan Nicula 2005-10-25 08:25:19 UTC
There's some fragile floating-point code in Cairo that gets mis-compiled with -ffast-math, at least on this 
compiler. Never saw this on PowerPC.

I had CFLAGS='-Os -ffast-math -mcpu=7450 -mdynamic-no-pic' set for my application.
Removing -ffast-math solves the problem.

This compiler is gcc version 4.0.0 (Apple Computer, Inc. build 5026) equivalent ~ GNU GCC 4.0.2

Thanks for your patience with me. I'll dig up the code sequence that fools the compiler.
This is very annoying :-(
Comment 12 Bogdan Nicula 2005-10-26 00:44:25 UTC
The problem occurs in cairo_fixed_from_double(), but I believe there is no mis-compilation. 

I think the function receives a denormalized/weird number (possibly from an uninitialized memory 
address) and floor() silently puts it to 0. In -ffast-math mode the compiler generates the unguarded code 
directly and the output is a huge number in one of the calls.

Still investigating...
Comment 13 Bogdan Nicula 2005-10-26 06:01:30 UTC
Doing this fixes it for me (I don't understand why yet). The interface gets drawn fast and correctly.
Notice it consistently rounds away from 0 for both positive and negative half-way cases. The C99 
"round" function should work as well.
Is this the correct behavior? 

cairo_fixed_t
_cairo_fixed_from_double (double d)
{
    if (d < 0.)
        return (cairo_fixed_t) (d * 65536 - .5);
    else
        return (cairo_fixed_t) (d * 65536 + .5);
}

No clue yet why the original breaks. It may have to do with the fact that "d * 65536 + .5" is generated 
as a FMA (fused multiply-add) instruction which doesn't do an intermediate rounding operation.
Comment 14 Bogdan Nicula 2005-10-26 06:17:49 UTC
In fact round() should be used because it takes care of the predecessor of 0.5 and others. See:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18099
Comment 15 Carl Worth 2005-10-26 11:44:51 UTC
I'm glad to see some more close examination of _cairo_fixed_from_double.
For such an elementary function, it sure needs (and is getting) an awful lot of
attention.

With your current eye on correctness (and your environment in which the current
code fails) could you experiment with the correctness of the following proposed
performance optimization for _cairo_fixed_from_double:

http://thread.gmane.org/gmane.comp.lib.cairo/5228

As for round, we could certainly use that when available. But so far, we haven't
made C99 a hard requirement for compiling cairo. So, like the performance patch
I linked to above, using round would also require appropriate configure-time
checks for suitability. (Though the round check would be a lot easier than the
checks for IEEE double-precision floating-point necessary for the performance
optimization I think).
Comment 16 Bogdan Nicula 2005-10-26 16:03:06 UTC
For the record, it was a bug in the compiler, although I still cannot explain how I got those huge 
numbers... I made a test case and reported it to Apple.

Now for the topic of rounding. I would say first you should define exactly your definition of rounding 
which is appropriate for your problem domain. It should be specified unambiguously.

For example, according to the definition of the round() function, half-way numbers are rounded away 
from zero (this is the intention of the code in comment #13, but doesn't work for all input, see 
comment #14).

An alternate specification which is statistically unbiased: half-way numbers are rounded to the closest 
even integer. This is what CPUs usually implement, what rint() does when the rounding mode is 
FE_TONEAREST (this function is inappropriate for a library, because the application can change the 
rounding mode) and what Mathematica does (see http://mathworld.wolfram.com/
NearestIntegerFunction.html).

I read before the conversation you linked to :-). I had for long time a variant of that code, see below, 
which I think is more resilient to breaking when the compiler becomes more and more aggressive in its 
optimizations. Also, for Pentium4, I think it's not that faster anymore, you may need to run some tests 
with -march=pentium4 flag for gcc.

Unfortunately those kind of definitions don't work for certain values. And it's worrying that they fail 
around +- 0.5 which I think are interesting values for Cairo. You can test with the program below (you 
may need to compile with -std=c99 for gcc).

predecessor of 0.5 = + (.5 - 0x1p-54)
successor of -0.5   = - (.5 - 0x1p-54)

The code in comment #13 fails for both, current Cairo for the predecessor of 0.5 and the other trick for 
the successor of -0.5.

The round() function, while technically required by C99, it was implemented for long time in Linux, 
BSDs, other Unices and even Windows. I think it is much better tested and has a crystal clear definition.
It may happen that gcc in the future will replace it with inline code (hopefully not buggy like it 
happened to me).

Cairo is already critical infrastructure, please make it fail-proof and idiot-proof (that's me :-))
With my unfortunate experience that started this thread, I would be cautious.

And, btw, thank you Carl et al for this brilliant piece of software.

#include <math.h>
#include <stdio.h>

/* big endian
#define iman 1
*/

/* little endian */
#define iman 0

int round1(const double val)
{
static union { double d; int i[2]; } u;                                                                                                       
    u.d = val + 68719476736. * 1.5;                                                                                                           
    return (u.i[iman]) >> 16;
}

int round2(const double val)
{
	if (val < 0.)
		return (int) (val - .5);
	else
		return (int) (val + .5);
}

int round3(const double val)
{
	return (int) floor(val + .5);
}

int main(void)
{
     double p = + (.5 - 0x1p-54);
     double m = - (.5 - 0x1p-54);

     printf("%d %d\n", round1(p), (int) round(p));
     printf("%d %d\n", round1(m), (int) round(m));

     return 0;
}
Comment 17 Bogdan Nicula 2005-10-26 16:47:49 UTC
Actually, apparently it's not that long ago that round() started to show up :-(
Comment 18 Carl Worth 2005-10-26 17:11:10 UTC
(In reply to comment #16)
> For the record, it was a bug in the compiler,

Ah, at least one mystery solved then.

> Now for the topic of rounding. I would say first you should define exactly
your definition of rounding 
> which is appropriate for your problem domain. It should be specified
unambiguously.

Yes. That is good, sound advice. We've got a large pass of "robustification" to
be done with the upcoming rewrite of the tessellator. This rounding will also
be an important part of that.

[Snipped many good insights and comments which will be useful---thanks!]

> Cairo is already critical infrastructure, please make it fail-proof and
idiot-proof (that's me :-))

Hey, I'm one of the biggest idiots involved with this project. (And I think
I can back that claim up with bugzilla comments.)

> With my unfortunate experience that started this thread, I would be cautious.

Caution, at least, is something I can be good at. You may notice that long
before the optimization patch I linked to above, we've had a patch proposed
for making this function a macro. I've been waiting for better proof that
we know we have implemented what we want before making a macro out of it.

> And, btw, thank you Carl et al for this brilliant piece of software.

For what it's worth, you're quite welcome. I'll pass the "brilliant" portion
of your thanks on to Keith. It's all his ideas originally.

Thank you for your contributions, (and I'm glad you've got cairo and GTK+
running at more reasonable speeds now).

-Carl
Comment 19 Bogdan Nicula 2005-10-27 16:43:28 UTC
I took a closer look at what Bill Spitzak actually proposed, i.e. it does rounding of 65536 * val, and I 
think the code is correct. The one I put in comment #16 does actually chopping like on http://
www.stereopsis.com/FPU.html. Now I _really_ feel like an idiot.

The rounding is done in the sense of second definition in comment #16, i.e. half-way numbers are 
rounded to the closest even.

It's easy to overflow this operation, so some kind of input checking is useful. I reported I was getting 
huge numbers as input to _cairo_fixed_from_double(), it was because of the code mentioned in https://
bugs.freedesktop.org/show_bug.cgi?id=2675, comment #9:
     y_intersect = _cairo_fixed_from_double ((b2 - b1) / (m1 - m2));

I think is better to do the operation through a union, but check with a C language lawyer.
So the code would be like that:

#if __BIG_ENDIAN_
#define iman_ 1
#else
#define iman_ 0
#endif

cairo_fixed_t
_cairo_fixed_from_double (double d)
{
    union { double d; int32_t i[2]; } u;

    u.d = d + 68719476736. * 1.5;
    return u.i[iman_];
}

Sorry for any confusion I caused.
Comment 20 Bogdan Nicula 2006-08-14 11:53:46 UTC
Since it looks there is some interest around cairo_fixed_from_double() let me add some random bits:

- there is an update (http://www.stereopsis.com/sree/fpu2006.html) of that oft-cited stereopsis page

- IMHO the "Banker's Rounding" (half-way to even) method is not appropriate for graphics, I suggest the 
round() method (half-way away from zero)

- in my benchmarks, on a desktop CPU (PowerPC G4), of drawing SVGs consiting only of text at different 
scales (via librsvg) more time is spent in the 64 bit division in compute_x() of cairo_traps.c (called by 
cairo_traps_tessellate_polygon()). Of course, the x86 assembler optimization of David Turner will not help 
the other CPUs. I also see the qsort() of cairo_traps_tessellate_polygon() taking more cycles overall than 
cairo_fixed_from_double(). A better algorithm for cairo_traps_tessellate_polygon() would contribute much 
more than micro-optimizations, I guess. Again I'm speaking from a desktop user point of view (CPU with 
hardware FPU).

Comment 21 Chris Wilson 2008-10-10 14:15:14 UTC
An optimized _cairo_fixed_from_double() was implemented based on the stereopsis example (with the caveat that banker's rounding is still used), Carl completely rewrote the tessellator based on John Hobby's thesis (with the caveat that there is still room for performance tweaks, but the algorithm itself is state-of-the-art) and its difficult to beat the library qsort (and we can't currently avoid it) - and the compute_x() in cairo-traps.c is no more.

Please keep profiling and reporting hotspots - though sadly, it seems all the low-hanging fruit in cairo has been picked. Though there is interesting work afoot in areas like polygon rasterisation and utilising different hardware architectures (a high-level OpenGL interface, as opposed to the low-level implementation of XRender over OpenGL). But in order to demonstrate that any of this is worthwhile we need real-world profiles/benchmarks...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.