Multsum_f64 has only a reference implementation. Patch with two optimized versions to be attached. Producing other levels of loop unrolling is easily doable. Is it a good idea to add these other versions?
Created attachment 5660 [details] [review] multsum_f64_unroll8 and multsum_f64_sse2_unroll4
Applied.
Created attachment 5770 [details] [review] multsum.patch This is a new version of the patch against latest anoncvs. It does two things. First, it fixes the previously broken SSE2 implementation. This version actually speeds things up notably. Witness: multsum_f64 multsum_f64_unroll8 ave=576 std=1.14286 multsum_f64_sse2_unrollb ave=573 std=1.14286 multsum_f64_sse2_unrolla ave=568 std=1.14286 multsum_f64_sse2 ave=576 std=1.14286 multsum_f64_ref ave=850.444 std=1.16741 Second, it introduces an unstrided version of multsum for f32 and f64 with SSE2 optimized versions. Results: multsum_f32_ns multsum_f32_ns_sse ave=330 std=11.3928 multsum_f32_ns_ref ave=737 std=1.125 multsum_f64_ns multsum_f64_ns_sse2_unroll2 ave=372 std=1.14286 multsum_f64_ns_sse2_unroll ave=382 std=1.14286 multsum_f64_ns_sse2 ave=463.444 std=3.59953 multsum_f64_ns_ref ave=734 std=1.125
Reopening because of new patch.
There appears to be a bug in the unrolled SSE2 versions. It doesn't manifest it on my main dev machine (a Pentium M laptop) but shows up on another machine (a Pentium 4). I'm trying to track it down now. Will report back when it's resolved.
Patch doesn't apply.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.