Bug 4263

Summary: Gradient on 'rounded rectangle' MUCH slower than normal rectangle.
Product: cairo Reporter: Richard Stellingwerff <remenic>
Component: xlib backendAssignee: Carl Worth <cworth>
Status: RESOLVED FIXED QA Contact: cairo-bugs mailing list <cairo-bugs>
Severity: normal    
Priority: high    
Version: 0.9.3   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: GTK+ 2.8 based benchmark
Non-interactive test
Cairo-only test case
Pixman patch adding MMX special case
Use 8-pixel strips for vertical gradients
Real cairo-only test case
pixman patch checked to compile

Description Richard Stellingwerff 2005-08-26 17:36:15 UTC
I have noticed a HUGE (~500%) performance gap between drawing a simple rectangle
vs. a rounded rectangle.

I created a convinience function to create a rounded rectangle by using four
arcs. However, drawing a linear gradient on such a rounded rectangle is roughly
7 times slower than drawing a normal rectangle, even when the radius of the
corners is only 3.

Since I use this in Clearlooks a lot (for buttons and such) I'm hoping that the
performance of this can be improved. I wrote a simple GTK+ application that lets
you benchmark both rectangle fills.
Comment 1 Richard Stellingwerff 2005-08-26 17:38:06 UTC
Created attachment 3059 [details]
GTK+ 2.8 based benchmark

Compile with gcc -o rounded-rectangle rounded-rectangle.c `pkg-config --cflags
--libs gtk+-2.0`

My results:
Normal rectangle:
Running benchmark...
1203.64 msec
Running benchmark...
1189.68 msec
Running benchmark...
1202.82 msec
Running benchmark...
1200.93 msec
Running benchmark...
1196.54 msec

Rounded rectangle:
Running benchmark...
9481 msec
Running benchmark...
9697.97 msec
Running benchmark...
9625.76 msec
Running benchmark...
9621.66 msec
Comment 2 Owen Taylor 2005-08-27 11:30:25 UTC
Created attachment 3063 [details]
Non-interactive test

Here's a version of your benchmark with simplified invocation.
(Also uses does a gdk_flush()==XSync() after drawing to get a 
better idea of when the X server is actually done.)
Comment 3 Richard Stellingwerff 2005-08-27 12:06:09 UTC
(In reply to comment #2)
> Here's a version of your benchmark with simplified invocation.
> (Also uses does a gdk_flush()==XSync() after drawing to get a 
> better idea of when the X server is actually done.)

Ah, very nice I'll probably use this to base other benchmarks on as well. The
results seem pretty acurate.

Running the 'rounded' version gives me ~20500 ms, and the squared version ~2000
ms (running cpu at 930MHz).
Comment 4 Owen Taylor 2005-08-27 13:27:56 UTC
Profiling indicates that the big bottleneck here is fbCompositeGeneral -
there is no special for the case of 

 (argb32 IN a8) OVER rgb24

As plain-vanilla as that case may sound. Even with the scan-line based
optimization in recent Xserver, fbCompositeGeneral is taking about 86%
of CPU time in my testing.

It may be even slower with older X servers, because I think this test
case will trigger a software fallback to work around a X server bug
- we have a repeating source pattern (Cairo optimizes a horizontal or
vertical gradient into a 1xN or Nx1 repeating source pattern.)
That could conceivably be fixed by some extra code to expand the 
repeating pattern server side into a temporary source pixmap the
size of the destination.

In some sense the fact that rounded rectangles are slower is not suprising -
it's a considerably more complicated operation. But I'm a little suprised
by the ratio.
Comment 5 Owen Taylor 2005-08-27 15:53:52 UTC
OK, I understand the ratio now - what libpixman and the (development)
X servers have is a scanline based fallback to fbCompositeGeneral - 
they fill a scan line, then composite it. But the compositing we do
for vertical gradients completely falls over for this we are compositing
a repeating 1xN pattern, and the compositing code breaks this up into
M 1xN non-repeating composites. So the "scanlines" we deal with are
1 pixel wide. So we get an enormous amount of function call overhead.

I investigated two changes. The first one was to make cairo-pattern.c
use a wider strip for compositing ...  with this, the timings for the
rounded rectangle were:

  width = 1: 2425ms
  width = 4: 795ms
  width = 16: 460ms
  width = 32: 465ms

The square rectangle also got faster, though not as dramatically -
I didn't take a full set of measurements but:

 width = 1: 280ms
 width = 16: 130ms

The change I did was writing an MMX-optimized spaciel case for 
 
  (argb32 IN a8) OVER rgb24
 
Since the entire 1xN strip is composited in a single function call,
there is much less overhead from using narrow strips:

 width = 1: 450ms
 width = 4: 340ms
 width = 8: 340ms
 width = 16: 370ms
 width = 32: 440ms

(I also tested width=6, and as you can guess, it's just slightly less
than 4 or 8 - 338ms)

So, by the combination of the two techniques, we were able to go from 
2425ms to 340ms for the rounded rectangle. But various caveats here 
must be noted:

 1. The optimal strip width depends very much on the width of the 
    area being rendered. We can get away in this test case with
    computing a big strip of gradient, because we are rendering over
    400 pixels of width. If we were rendering only 16 pixels or 
    8 pixels of total width then computing a bigger strip would be
    prohibitive.

 2. The optimal strip width also depends on the speed of gradient
    computation. Currently, it's pretty bad, but if we improve it
    then we can get away with bigger strips.

 3. When rendering to a XFree86-6.8.2 or older server, different
    considerations come into play: as mentioned above, a client-side
    fallback will be triggered - we actually fetch the destination
    image from the server, composite onto it and put it back.
    Now, as also mentioned it would be possible to avoid this fallback
    by expanding the source image out into a temporary pixmap, but
    that may not be a worthwhile optimization. Why not? Because after
    we expand out the actually compositing will be handled by 
    fbCompositeGeneral. Not the fbCompositeGeneral tested above which
    is slow on short scanlines, but reasonably good on longer scanlines,
    but an older fbCompositeGeneral which is bad on all scanline lengths.

My recommendations are:

 1. Add the MMX special case. It is a significant win for skinny regions,
    and it's "core" special case, not something unusual.

 2. Use a strip width of MIN(original_width, 8).

I'll attach a cairo-only benchmark, and patches for both 1. and 2.
The patch for 1. is a bit cluttered up, since I also fixed a different
MMX special case which had been #ifdef'ed out.

P.S. - There is one more optimization that could be made - for the
 fbCompositeGeneral case, when compositing an untransformed repeating
 pattern, use a scanline buffer of full width. There will still be
 excess function call overhead for fetching the source, but much less 
 function call overhead for fetching the mask, fetching the destination,
 compositing, and storing. It's probably worthwhile, but I'm not going
 to try and come up with a patch here.
Comment 6 Owen Taylor 2005-08-27 16:08:33 UTC
Created attachment 3064 [details]
Cairo-only test case
Comment 7 Owen Taylor 2005-08-27 16:09:09 UTC
Created attachment 3065 [details] [review]
Pixman patch adding MMX special case
Comment 8 Owen Taylor 2005-08-27 16:09:44 UTC
Created attachment 3066 [details] [review]
Use 8-pixel strips for vertical gradients
Comment 9 Owen Taylor 2005-08-27 16:35:22 UTC
Two more thoughts occurred to me:

 1. Is 340ms "fast enough"? (the timing was on a 1.7Ghz Pentium M)

    340ms for a 300x400 image, 100 iterations == 35million pixels/sec

    For a 1280x1024 screen at 30fps, that's an overdraw depth of 1.
    Not bad - how many layers of gradient-filled non-rectangular
    paths covering the entire screen do you want? But not wonderful either -
    you can imagine wanting one layer of such plus extras (text, say)
    pretty easily. I think we have some remaining room for improvement from
    the gradient code.

 2. Another approach for the X server case is to take advantage of the
    gradient objects that are in the devel X server, instead of generating
    an image, pushing that to the X server and compositing that. It's hard
    to predict the exact performance but it would be interesting to
    experiment with.
Comment 10 Richard Stellingwerff 2005-08-27 16:41:04 UTC
A few remarks.

The cairo-only benchmark is the one you attached earlier (depends on GTK+).

After applying the two other patches, I get:
fbpict.c: In function '_cairo_pixman_composite':
fbpict.c:1616: error: '_cairo_pixman_composite_src_x888x8x8888mmx' undeclared
(first use in this function)
fbpict.c:1616: error: (Each undeclared identifier is reported only once
fbpict.c:1616: error: for each function it appears in.)
Comment 11 Owen Taylor 2005-08-27 17:26:09 UTC
Created attachment 3067 [details]
Real cairo-only test case
Comment 12 Owen Taylor 2005-08-27 17:31:59 UTC
Created attachment 3068 [details] [review]
pixman patch checked to compile

Forgot to check that some cleanups I did at the end compiled.
Comment 13 Richard Stellingwerff 2005-08-28 06:31:08 UTC
After applying the patches, I noticed that the cair-only test case got a LOT
faster. Before applying that patches, it took ~6700ms to finish the rounded
test. After applying the patches, it took ~1200ms. A welcome improvement.

However, the GTK+2 based benchmark shows no improvement at all, which seems
wierd to me. Maybe I did something wrong?

I've compiled cairo with both GCC 3.3.5 and GCC 4.0.1-pre20050616. Only when
compiled with GCC 4 did it show a huge improvement in the cairo-only test, which
probably makes sense since ./configure tells me that GCC 3.3.5 doesn't support
MMX/SSE intrinsics.

My xorg version is X Protocol Version 11, Revision 0, Release 6.8.99.15.
Comment 14 Owen Taylor 2005-08-28 07:50:59 UTC
I'm surpised that the GTK+ test shows no improvement at all ... I don't
know when the "6.8.99.15" snapshot is from, but if it is from the last
few months, it should have gotten some benefit from the cairo-pattern.c
change. The libpixman change obviously will make no differences, since
that only affects client-side rendering. (Equivalent changes can be
made to the server code - the code in the X server is very closely related.)

Testing two X servers:

 Server 1: Xfake from the xserver tree, from a few days ago.
 
  Without cairo-pattern.c change: 2500ms
  With cairo-pattern.c change: 705ms

 Server 2: "6.8.99.7" - CVS XFree86 from 6-9 months ago

  Without cairo-pattern.c change: 3415ms
  With cairo-pattern.c change: 1440ms

I'd expect this snapshot to behave somewhat similarly to 6.8.2 on the
test.
Comment 15 Richard Stellingwerff 2005-08-28 12:49:56 UTC
Okay, I just ran the gtk+ test again. Apparently I did something wrong the first
try, because I am seeing a difference now.

Two runs without the cairo-pattern patch:
17425.2 msec
17534.2 msec

After applying the cairo-pattern patch:
7916.47 msec
7904.03 msec

My humble apologies for reporting false information earlier :(
Comment 16 Carl Worth 2005-08-29 11:36:37 UTC
I don't have my own numbers from playing with the patches yet, but I'd like
to throw out one question in the meantime.

The recent comments show some good performance improvements for the rounded
rectangle case (thanks!) but have we really done that much to improve the
original bug that the rounded case is several times slower than the squared-off
corners?

Looking at the benchmark here, 99.97% of the pixels are the same in both cases
so we really should be able to do much better here. This looks like a good
test case for some of the ideas I have on improving what trapezoids get handed
to the compositor.
Comment 17 Owen Taylor 2005-08-30 10:42:34 UTC
Comitted the cairo-pattern.c change.

2005-08-27  Owen Taylor  <otaylor@redhat.com>

        * src/cairo-pattern.c (_cairo_pattern_acquire_surface_for_gradient):
        Use a 8xN rather than a 1xN strip for a vertical gradient. This
        is much more tolerant of slow compositing code, and is worth some
        extra expense computing the gradient. (#4263, found in test case
        from Richard Stellingwerff)
Comment 18 Chris Wilson 2008-10-08 05:36:46 UTC
Moved to todo.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.