Bug 20709

Summary: Use SSE2/MMX for fbFetch_x8r8g8b8()
Product: pixman Reporter: Steve Snyder <fdt0317.20.swsnyder>
Component: pixmanAssignee: Søren Sandmann Pedersen <soren.sandmann>
Status: RESOLVED FIXED QA Contact: Søren Sandmann Pedersen <soren.sandmann>
Severity: normal    
Priority: medium    
Version: other   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Use SSE2/MMX to copy ARGB pixels, setting alpha to 0xFF

Description Steve Snyder 2009-03-17 07:04:07 UTC
Created attachment 23964 [details] [review]
Use SSE2/MMX to copy ARGB pixels, setting alpha to 0xFF

While profiling SeaMonkey v2.0a3 I saw that an inordinate amount of time seemed to be spent in fbFetch_a8r8g8b8().  This routine simply copies ARGB pixels, one at a time, setting alpha to 0xFF.  

This is the sort of grunt work that SIMD instructions are supposed to address!

The attached patch uses SSE2 instructions to copy 8 pixels per loop, falling back to MMX (also 8 pixels/loop) if SSE2 is not supported or if 16-byte alignment of the buffers cannot be achieved.

If neither SSE2 nor MMX are supported, or if the minimal 8-byte buffer alignment can't be achieved, the original code (single-pixel) is used.

In my test runs (the Peacekeeper browser benchmark) I found that all even rows were capable of being 16-byte aligned.  Due to stride length, the odd rows could be 8-byte aligned, but not 16-byte aligned.  Thus the test copied half the pixels using SSE2 instructions and half using MMX instructions.  This SSE2/MMX mix reduced execution time by ~20% relative to the original single-pixel-per-loop code.

Notes to follow.
Comment 1 Steve Snyder 2009-03-17 07:20:28 UTC
A few random notes:

1. I am aware that the SSE2/SSE/MMX code should be in file pixman-[simd].c, but that requires architectural decisions that I didn't want to make.  The patch is a drop-in enhancement that doesn't require runtime analysis of CPU capabilities.

2. MSVC8 generates really bad code for MMX (a lot of superfluous copies between registers).  Using x86 ASM instead of compiler intrinsics would yield an improvement.  But then you'd have the headache of AT&T/Microsoft syntax to deal with.

3. Overall performance would greatly benefit from *guaranteeing* 16-byte alignment of both source and destination buffers.

4. The other fbFetch_xxxxxx{} routines in pixman-access.c would likely benefit from SIMD handling as well, but I didn't see much use of them in my environment.
Comment 2 Søren Sandmann Pedersen 2009-03-24 12:45:07 UTC
A couple of comments

- You may want to find out *why* you are seeing so much time spent in fbFetch_a8r8g8b8(). If this activity only comes from a few operations, maybe a better approach is to write a special-case fast path for those? 

- If this activity is actually coming from transformed compositing, then maybe the patch that Andre Tupinamba recently posted to cairo-list could help.

- Any use of MMX and SSE should go into the appropriate files. It is important that people are able to build the library such that any special CPU features are runtime detected. 
Comment 3 Steve Snyder 2009-04-10 12:44:54 UTC
Edited to correct function name.  

It is actually fbFetch_x8r8g8b8() that the patch addresses, not fbFetch_a8r8g8b8().
Comment 4 Søren Sandmann Pedersen 2009-10-27 04:36:38 UTC
To move forward here, we need a way for pixman_implementations to plug in fetchers to the general code. Ie., implementations will have to be involved when images are created and be given the opportunity to set the fetch_scanline variations to something CPU specific.

Not only will that allow this and the a8 SSE2 fetchers, it will also make it possible to have CPU specific gradients and bilinear filters.
Comment 5 Søren Sandmann Pedersen 2011-02-03 00:25:35 UTC
Fixed in master.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.