This patch implements a framework for platform-specific optimized versions of routines and ARM assembler optimized version of the pixman_fill() routine.
Created attachment 12825 [details] [review]
ARM optimized fill routines
This code itself looks fine (although I haven't written ARM code for a long time), but the build system integration should probably be done similarly to the MMX special-casing. Ie.:
- Add new pixman-arm.[ch] files with the ARM specific code named as pixman_fill_arm()
- in the existing pixman_fill() in pixman-utils.c, add another case like this:
Alternatively, the MMX code could be changed to be done the way you have done the ARM code.
(In reply to comment #2)
> This code itself looks fine (although I haven't written ARM code for a long
> time), but the build system integration should probably be done similarly to
> the MMX special-casing. Ie.:
> - Add new pixman-arm.[ch] files with the ARM specific code named as
> - in the existing pixman_fill() in pixman-utils.c, add another case like this:
> #ifdef USE_ARM
> pixman_fill_arm (...);
> #ifdef USE_MMX
> if (...)
> switch (bpp)
> Alternatively, the MMX code could be changed to be done the way you have done
> the ARM code.
The reason I didn't use the MMX-style integration is twofold; first of all, I don't like having #ifdef-blocks for every architechture that might eventually have optimized code in every function, and secondly to avoid function call overhead.
Changing the MMX code is unfortunately out of the scope of the project I am working on, and is somewhat troublesome because it wants to be able to fall back to the normal codepath if it doesn't like the arguments it's being passed.
Created attachment 18279 [details] [review]
patch against pixman 0.11.8
Fix trivial patch conflict for 0.11.8
The existing C routines should likely be optimized before resorting to arm assembler. That way we can measure any real improvement that comes from this patch.
When I tried the patch only pixman_fill8 and pixman_fill16 were noticeably faster. I assume the reason for this is avoiding byte accesses which could just as easily be implemented in C.
Optimized fill for older ARM cores needs to use 16-byte aligned writes with STM instruction. This way it can utilize burst writes to SDRAM and roughly double the performance. This trick works fine on ARM9E (OMAP1710) and ARM11 (OMAP2420), not sure about other ARM microarchitectures line XScale or anything else. It is used in Xomap.
Naturally this is not possible to implement in C, so a bit of assembly is needed.
Tagging patch so it won't get lost in the future.
The fill operation has been optimized for ARMv6 in http://cgit.freedesktop.org/pixman/commit/?id=3cff56c5b091d2e584503e7887414e224876de37
Nowadays almost everyone is using ARMv7 and the move to 64-bit ARMv8 is underway. ARMv6 is already rather rare and primarily represented by just the Raspberry Pi (which is old and slow). So closing this ticket makes sense.