Bug 1067

Summary: improve fbmmx
Product: xorg Reporter: Nicholas Miell <nmiell>
Component: Server/GeneralAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: high CC: ajax, mharris, soren.sandmann
Version: git   
Hardware: x86 (IA32)   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
patch
none
new patch none

Description Nicholas Miell 2004-08-13 00:07:13 UTC
Attached is a patch that does the following to the fbmmx implementation:

1. It replaces gcc-specific __builtin_ia32_* functions with the standard MMX and
SSE intrinsic types and functions from mmintrin.h and xmmintrin.h. This makes
the code easier to understand and allows for the possible use of MMX other
x86/x86-64 compilers that don't support gcc's builtin functions.

2. It replaces all inline assembly (except the cpuid use in fbHaveMMX) with
intrinsics. This is easier to understand, compatible with compilers other than
gcc, and (in some cases) results in better code generation. (In other cases,
specifically the use of _mm_cvtsi64_si32 and _mm_cvtsi32_si64, extra
instructions are generated. I expect this to improve with future gcc releases.)

3. Support for fbmmx on AMD64 systems is added. fbHaveMMX() is defined to TRUE
on AMD64 systems. SSE support (in the form of the MMX pshufw instruction) is
also unconditionally enabled on AMD64. SSE remains disabled on i386.

4. The USE_GCC34_MMX macro is renamed to the more general USE_MMX, under the
assumption compilers other than gcc may want to use MMX. A USE_SSE macro is
introduced, which unconditionally enables (i.e. no cpuid test) the above
mentiond pshufw instruction.

5. Access to constants in the MMX data structure are now done using the MC()
macro, which hides away those messy casts. All references to the zero constant
are replaced with _mm_setzero_si64(), which generates "pxor %mm, %mm" and is
recommended instead of constant loads in the AMD optimization manual.

6. The shift function is replaced by open-coded instances of _mm_srli_pi16 and
_mm_slli_pi16.

7. Some comments were added, updated, or reformatted for 80 columns.

This patch has been tested on AMD64 against mharris's xorg-x11-6.7.99.1
packages. rendercheck passes without error and the benchmark from bug #839
reports a 1.2 second overall improvement compared to 6.7.0.
Comment 1 Nicholas Miell 2004-08-13 00:08:12 UTC
Created attachment 624 [details] [review]
patch
Comment 2 Søren Sandmann Pedersen 2004-08-13 04:57:23 UTC
Just skimming over the patch it looks very good to me. Two comments though:

- GCC 3.3 is unusably buggy wrt. to intrinsics. GCC 3.4 is buggy, but you
  can workaround. We need to make sure the code continues to work 
  with GCC 3.4.

- This is not material for the upcoming release, but we should looks at it
  for the next one.
Comment 3 Nicholas Miell 2004-08-13 11:46:33 UTC
(In reply to comment #2)
> - GCC 3.3 is unusably buggy wrt. to intrinsics. GCC 3.4 is buggy, but you
>   can workaround. We need to make sure the code continues to work 
>   with GCC 3.4.

It's inclusion is still keyed off of the use of GCC 3.4, and will remain that
way until somebody adds support for other compilers. I only built and tested it
with gcc 3.4 (and am using it right now, without any problem).
Comment 4 Søren Sandmann Pedersen 2004-12-23 08:06:58 UTC
> 6. The shift function is replaced by open-coded instances of _mm_srli_pi16 and
> _mm_slli_pi16.

I'd really like to keep the shift function. To me,

   shift (blah, -48)

is much clearer than _mm_slli_si64 (blah, 48), and since the function is inlined
and optimized, gcc doesn't generate any worse code for it when the argument is a
compile time constant.

Other than that, the patch looks fine to me.
Comment 5 Nicholas Miell 2004-12-23 17:38:45 UTC
(In reply to comment #4)
> I'd really like to keep the shift function. To me,
> 
>    shift (blah, -48)
> 
> is much clearer than _mm_slli_si64 (blah, 48), and since the function is
> inlined and optimized, gcc doesn't generate any worse code for it when the
> argument is a compile time constant.

Eh, I found "shift" with no indication of direction other than a sign confusing,
but I'm not particularly attached to it.

I'll change it and re-diff soon.
Comment 6 Nicholas Miell 2004-12-23 18:54:06 UTC
Created attachment 1591 [details] [review]
new patch

This also fixes the non-SSE expand_alpha, which I messed up the first time
around.
Comment 7 Søren Sandmann Pedersen 2005-01-03 13:34:49 UTC
Committed, except that I put #include <xmmintrin.h> inside an #ifdef USE_SSE,
because xmmintrin.h contains an #error that triggers if -msse is not used

Mon Jan  3 12:45:10 2005  Søren Sandmann  <sandmann@redhat.com>

        Clean-ups and support for AMD64. Bug 1067.
        Patch by Nicholas Miell (nmiell@comcast.net)

        * programs/Xserver/fb/Imakefile: Add support for AMD64

        * programs/Xserver/fb/{fbmmx.[ch],fbpict.c}: Many cleanups
        using <mmintrin.h> instead of __builin_ia32_*, and intrinsics
        instead of inline assembly. Also unconditionally use pshufw on
        AMD64.

        * programs/Xserver/fb/*: s/USE_GCC34_MMX/USE_MMX/g

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.