Multsum_f64 has only a reference implementation. Patch with two optimized
versions to be attached.
Producing other levels of loop unrolling is easily doable. Is it a good idea to
add these other versions?
Created attachment 5660 [details] [review]
multsum_f64_unroll8 and multsum_f64_sse2_unroll4
Created attachment 5770 [details] [review]
This is a new version of the patch against latest anoncvs. It does two things.
First, it fixes the previously broken SSE2 implementation. This version
actually speeds things up notably. Witness:
Second, it introduces an unstrided version of multsum for f32 and f64 with SSE2
optimized versions. Results:
Reopening because of new patch.
There appears to be a bug in the unrolled SSE2 versions. It doesn't manifest it
on my main dev machine (a Pentium M laptop) but shows up on another machine (a
Pentium 4). I'm trying to track it down now. Will report back when it's resolved.
Patch doesn't apply.