Cairo 1.10 started to go through a very slow path when it is asked to paint a CAIRO_FORMAT_RGB24 surface onto the xlib surface provided by GDK/GTK. When the device is having a target depth of 16bit per pixel cairo will go through the slow _draw_image_surface routine.
Created attachment 40253 [details]
Test case for the issue.
This shows the issue when GDK picks a visual with 16bpp.
Created attachment 40254 [details] [review]
Proposal to avoid going through the slowest possible path
Introduce a specialized method and avoid going through the slow path by creating a Xlib surface with a different depth... This means that the Xserver wil still need to convert the image but using pixman this is a lot faster than the _draw_image_surface routine in the xlib backend.
Is it possible to update a testcase to provide some performance numbers? Or probably create some relevant cairo-perf trace? So that alternative patches solving the problems could be easily compared to each other.
Hmm, it is also trivially possible to extend the PutImage routine to do the conversion using pixman where possible which would be the preferred method.
(In reply to comment #4)
> Hmm, it is also trivially possible to extend the PutImage routine to do the
> conversion using pixman where possible which would be the preferred method.
Right, I also tried to suggest the same here (but did not have time to implement, test and benchmark it myself): http://lists.cairographics.org/archives/cairo/2010-November/021057.html
That slow code in cairo also wastes a lot of time on doing dithering, but considering that 16bpp color depth is typically used on very resource constrained devices, not doing that should be reasonable. Also dithering is very important for the formats where very few bits are used for color components (like 1, 2 or 4), but for RGB565 format, banding is less noticeable even though it of course exists.
I'm worried that the proposed patch is just a band-aid and is only going to improve performance from abysmal to just poor on that hardware. But having detailed profiling reports may reveal some of the things which can be improved (PXA270 supports IWMMXT SIMD instructions, so the most critical hotspots in pixman can be improved a lot). For example, as I mentioned before, a properly optimized x8r8g8b8->r5g6b5 conversion is faster than a simple x8r8g8b8->x8r8g8b8 copy on all the hardware that I have seen (just because it needs less memory bandwidth). But this may be not the case with just generic C code, so somebody may be tempted to prefer whatever works faster with the current implementation and ignore the possibilities to get even more performance with a little bit more work. Anyway, this all can be benchmarked to get a better understanding about what's going on there.
(In reply to comment #3)
> Is it possible to update a testcase to provide some performance numbers? Or
> probably create some relevant cairo-perf trace? So that alternative patches
> solving the problems could be easily compared to each other.
Well, the easiest is to paint in a loop and measure the time to draw the image thousands of time. But as we will move work from Cairo to the Xserver this kind of benchmark will be a bit single sided. With cairo 1.10 _draw_image_surface shows up at the top of the profiler (opreport -l), with cairo 1.8 cairo and pixman are pretty low in the profile. I can create that one sided benchmark, I can also include oparchives of cairo 1.10 with and without the patch if that is of any interest.
I agree that this is a band aid, but it is also a huge performance regression compared to cairo 1.8 for 16bit depth devices.
(In reply to comment #6)
> Well, the easiest is to paint in a loop and measure the time to draw the image
> thousands of time.
Yes, whatever provides some objective performance metric (total time, frames per second, ...) should be good enough for this purpose.
> But as we will move work from Cairo to the Xserver this kind of benchmark will
> be a bit single sided. With cairo 1.10 _draw_image_surface shows up at the top
> of the profiler (opreport -l), with cairo 1.8 cairo and pixman are pretty low
> in the profile.
There are actually three components interacting here: the application (cairo), linux kernel (copy_to_user/copy_from_user functions resulting from the use of XPutImage), and X server (pixman), so all of them need to be watched when profiling. And oprofile is best used when configured as 'opcontrol --separate=kernel'. This way it shows performance statistics split per process, and also shows which processes cause extra cpu usage in the kernel. Large logs are not needed, I think just a few top functions (including kernel ones) from both your test application and X server processes should be enough.
My guess is that your patch is going to be heavier on copy_to_user/copy_from_user work done by the kernel (due to transferring 32bpp data instead of 16bpp), that's why I don't quite like this solution.
Created attachment 40268 [details]
Modified test application to print elapsed time
Updated test to call cairo_paint in a for loop and count the time with GTimer.
Here some results from the device in idle state.
cairo 1.10 + patch
138.086 secs elapsed for 10000 frames
138.191 secs elapsed for 10000 frames
138.39 secs elapsed for 10000 frames
cairo 1.10 stock
298.493 secs elapsed for 10000 frames
294.55 secs elapsed for 10000 frames
296.442 secs elapsed for 10000 frames
Created attachment 40286 [details] [review]
a test patch which uses pixman for conversion
A quick and dirty patch to use pixman for conversion. This is still not a very good solution, because a temporary image has to be created. It would be better if xlib/xcb could do optimized x8r8g8b8->r5g6b5 conversion on-the-fly when sending pixel data over the wire without using (large) intermediate buffers. But the best solution is of course to use CAIRO_FORMAT_RGB16_565 by the applications themselves.
Benchmark from ARM Cortex-A9
cairo-test-16bpp - a variant which uses CAIRO_FORMAT_RGB16_565
== original ==
54.2176 secs elapsed for 10000 frames
2.48166 secs elapsed for 10000 frames
== Holger's patch ==
9.53522 secs elapsed for 10000 frames
2.38052 secs elapsed for 10000 frames
== pixman conversion patch ==
5.37653 secs elapsed for 10000 frames
2.39496 secs elapsed for 10000 frames
Could you please run some more tests with these patch variants on your device? To me it looks like doing x8r8g8b8->r5g6b5 conversion using pixman inside of cairo works better even though it has to create a temporary image.
Just to make it clear. I simply would like to make sure that whatever solution we select, is going to be fast on all types of ARM devices. And if pixman conversion does not work fast enough on PXA270 (proven by benchmarks), then something could be tried for it too.
This is quite a bad performance issue for 16bpp desktop color depth, so it would be nice to get it fixed in cairo git master really soon. Thanks a lot for finding it. Please keep reporting bugs if you find anything else.
Siarhei, _cairo_xlib_surface_draw_image() now behaves in a very similar fashion to what you outlined. I think this is as good as cairo itself can perform and I doubt that xlib/xcb will be extended to perform the transformation inline, so I consider this bug to be closed. The residual issue of continuing to improve pixman I leave in your capable hands ;-)
Holger, thank you for the bug report.