I have noticed a HUGE (~500%) performance gap between drawing a simple rectangle vs. a rounded rectangle. I created a convinience function to create a rounded rectangle by using four arcs. However, drawing a linear gradient on such a rounded rectangle is roughly 7 times slower than drawing a normal rectangle, even when the radius of the corners is only 3. Since I use this in Clearlooks a lot (for buttons and such) I'm hoping that the performance of this can be improved. I wrote a simple GTK+ application that lets you benchmark both rectangle fills.
Created attachment 3059 [details] GTK+ 2.8 based benchmark Compile with gcc -o rounded-rectangle rounded-rectangle.c `pkg-config --cflags --libs gtk+-2.0` My results: Normal rectangle: Running benchmark... 1203.64 msec Running benchmark... 1189.68 msec Running benchmark... 1202.82 msec Running benchmark... 1200.93 msec Running benchmark... 1196.54 msec Rounded rectangle: Running benchmark... 9481 msec Running benchmark... 9697.97 msec Running benchmark... 9625.76 msec Running benchmark... 9621.66 msec
Created attachment 3063 [details] Non-interactive test Here's a version of your benchmark with simplified invocation. (Also uses does a gdk_flush()==XSync() after drawing to get a better idea of when the X server is actually done.)
(In reply to comment #2) > Here's a version of your benchmark with simplified invocation. > (Also uses does a gdk_flush()==XSync() after drawing to get a > better idea of when the X server is actually done.) Ah, very nice I'll probably use this to base other benchmarks on as well. The results seem pretty acurate. Running the 'rounded' version gives me ~20500 ms, and the squared version ~2000 ms (running cpu at 930MHz).
Profiling indicates that the big bottleneck here is fbCompositeGeneral - there is no special for the case of (argb32 IN a8) OVER rgb24 As plain-vanilla as that case may sound. Even with the scan-line based optimization in recent Xserver, fbCompositeGeneral is taking about 86% of CPU time in my testing. It may be even slower with older X servers, because I think this test case will trigger a software fallback to work around a X server bug - we have a repeating source pattern (Cairo optimizes a horizontal or vertical gradient into a 1xN or Nx1 repeating source pattern.) That could conceivably be fixed by some extra code to expand the repeating pattern server side into a temporary source pixmap the size of the destination. In some sense the fact that rounded rectangles are slower is not suprising - it's a considerably more complicated operation. But I'm a little suprised by the ratio.
OK, I understand the ratio now - what libpixman and the (development) X servers have is a scanline based fallback to fbCompositeGeneral - they fill a scan line, then composite it. But the compositing we do for vertical gradients completely falls over for this we are compositing a repeating 1xN pattern, and the compositing code breaks this up into M 1xN non-repeating composites. So the "scanlines" we deal with are 1 pixel wide. So we get an enormous amount of function call overhead. I investigated two changes. The first one was to make cairo-pattern.c use a wider strip for compositing ... with this, the timings for the rounded rectangle were: width = 1: 2425ms width = 4: 795ms width = 16: 460ms width = 32: 465ms The square rectangle also got faster, though not as dramatically - I didn't take a full set of measurements but: width = 1: 280ms width = 16: 130ms The change I did was writing an MMX-optimized spaciel case for (argb32 IN a8) OVER rgb24 Since the entire 1xN strip is composited in a single function call, there is much less overhead from using narrow strips: width = 1: 450ms width = 4: 340ms width = 8: 340ms width = 16: 370ms width = 32: 440ms (I also tested width=6, and as you can guess, it's just slightly less than 4 or 8 - 338ms) So, by the combination of the two techniques, we were able to go from 2425ms to 340ms for the rounded rectangle. But various caveats here must be noted: 1. The optimal strip width depends very much on the width of the area being rendered. We can get away in this test case with computing a big strip of gradient, because we are rendering over 400 pixels of width. If we were rendering only 16 pixels or 8 pixels of total width then computing a bigger strip would be prohibitive. 2. The optimal strip width also depends on the speed of gradient computation. Currently, it's pretty bad, but if we improve it then we can get away with bigger strips. 3. When rendering to a XFree86-6.8.2 or older server, different considerations come into play: as mentioned above, a client-side fallback will be triggered - we actually fetch the destination image from the server, composite onto it and put it back. Now, as also mentioned it would be possible to avoid this fallback by expanding the source image out into a temporary pixmap, but that may not be a worthwhile optimization. Why not? Because after we expand out the actually compositing will be handled by fbCompositeGeneral. Not the fbCompositeGeneral tested above which is slow on short scanlines, but reasonably good on longer scanlines, but an older fbCompositeGeneral which is bad on all scanline lengths. My recommendations are: 1. Add the MMX special case. It is a significant win for skinny regions, and it's "core" special case, not something unusual. 2. Use a strip width of MIN(original_width, 8). I'll attach a cairo-only benchmark, and patches for both 1. and 2. The patch for 1. is a bit cluttered up, since I also fixed a different MMX special case which had been #ifdef'ed out. P.S. - There is one more optimization that could be made - for the fbCompositeGeneral case, when compositing an untransformed repeating pattern, use a scanline buffer of full width. There will still be excess function call overhead for fetching the source, but much less function call overhead for fetching the mask, fetching the destination, compositing, and storing. It's probably worthwhile, but I'm not going to try and come up with a patch here.
Created attachment 3064 [details] Cairo-only test case
Created attachment 3065 [details] [review] Pixman patch adding MMX special case
Created attachment 3066 [details] [review] Use 8-pixel strips for vertical gradients
Two more thoughts occurred to me: 1. Is 340ms "fast enough"? (the timing was on a 1.7Ghz Pentium M) 340ms for a 300x400 image, 100 iterations == 35million pixels/sec For a 1280x1024 screen at 30fps, that's an overdraw depth of 1. Not bad - how many layers of gradient-filled non-rectangular paths covering the entire screen do you want? But not wonderful either - you can imagine wanting one layer of such plus extras (text, say) pretty easily. I think we have some remaining room for improvement from the gradient code. 2. Another approach for the X server case is to take advantage of the gradient objects that are in the devel X server, instead of generating an image, pushing that to the X server and compositing that. It's hard to predict the exact performance but it would be interesting to experiment with.
A few remarks. The cairo-only benchmark is the one you attached earlier (depends on GTK+). After applying the two other patches, I get: fbpict.c: In function '_cairo_pixman_composite': fbpict.c:1616: error: '_cairo_pixman_composite_src_x888x8x8888mmx' undeclared (first use in this function) fbpict.c:1616: error: (Each undeclared identifier is reported only once fbpict.c:1616: error: for each function it appears in.)
Created attachment 3067 [details] Real cairo-only test case
Created attachment 3068 [details] [review] pixman patch checked to compile Forgot to check that some cleanups I did at the end compiled.
After applying the patches, I noticed that the cair-only test case got a LOT faster. Before applying that patches, it took ~6700ms to finish the rounded test. After applying the patches, it took ~1200ms. A welcome improvement. However, the GTK+2 based benchmark shows no improvement at all, which seems wierd to me. Maybe I did something wrong? I've compiled cairo with both GCC 3.3.5 and GCC 4.0.1-pre20050616. Only when compiled with GCC 4 did it show a huge improvement in the cairo-only test, which probably makes sense since ./configure tells me that GCC 3.3.5 doesn't support MMX/SSE intrinsics. My xorg version is X Protocol Version 11, Revision 0, Release 6.8.99.15.
I'm surpised that the GTK+ test shows no improvement at all ... I don't know when the "6.8.99.15" snapshot is from, but if it is from the last few months, it should have gotten some benefit from the cairo-pattern.c change. The libpixman change obviously will make no differences, since that only affects client-side rendering. (Equivalent changes can be made to the server code - the code in the X server is very closely related.) Testing two X servers: Server 1: Xfake from the xserver tree, from a few days ago. Without cairo-pattern.c change: 2500ms With cairo-pattern.c change: 705ms Server 2: "6.8.99.7" - CVS XFree86 from 6-9 months ago Without cairo-pattern.c change: 3415ms With cairo-pattern.c change: 1440ms I'd expect this snapshot to behave somewhat similarly to 6.8.2 on the test.
Okay, I just ran the gtk+ test again. Apparently I did something wrong the first try, because I am seeing a difference now. Two runs without the cairo-pattern patch: 17425.2 msec 17534.2 msec After applying the cairo-pattern patch: 7916.47 msec 7904.03 msec My humble apologies for reporting false information earlier :(
I don't have my own numbers from playing with the patches yet, but I'd like to throw out one question in the meantime. The recent comments show some good performance improvements for the rounded rectangle case (thanks!) but have we really done that much to improve the original bug that the rounded case is several times slower than the squared-off corners? Looking at the benchmark here, 99.97% of the pixels are the same in both cases so we really should be able to do much better here. This looks like a good test case for some of the ideas I have on improving what trapezoids get handed to the compositor.
Comitted the cairo-pattern.c change. 2005-08-27 Owen Taylor <otaylor@redhat.com> * src/cairo-pattern.c (_cairo_pattern_acquire_surface_for_gradient): Use a 8xN rather than a 1xN strip for a vertical gradient. This is much more tolerant of slow compositing code, and is worth some extra expense computing the gradient. (#4263, found in test case from Richard Stellingwerff)
Moved to todo.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.