Bug 59701 - lp_test_arit fails on non-sse41 capable machines, breaking make check
lp_test_arit fails on non-sse41 capable machines, breaking make check
Status: RESOLVED FIXED
Product: Mesa
Classification: Unclassified
Component: Other
git
x86-64 (AMD64) Linux (All)
: medium normal
Assigned To: Roland Scheidegger
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-22 08:58 UTC by Michel Dänzer
Modified: 2013-02-18 13:55 UTC (History)
4 users (show)

See Also:


Attachments
proposed patch (5.67 KB, patch)
2013-01-29 22:27 UTC, Roland Scheidegger
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Michel Dänzer 2013-01-22 08:58:12 UTC
fract(-1.40129846e-45): ref = 0.99999994, out = -1.40129846e-45, precision = -0.000000 bits, FAIL
fract(-1.17999995e-38): ref = 0.99999994, out = -1.17999995e-38, precision = -0.000000 bits, FAIL
fract(-1.62981451e-08): ref = 0.99999994, out = -1.62981451e-08, precision = -0.000000 bits, FAIL
fract(-1.1920929e-07): ref = 0.999999881, out = -1.1920929e-07, precision = -0.000000 bits, FAIL
fract(3.40282347e+38): ref = 0, out = 0.99999994, precision = -inf bits, FAIL
fract(-3.40282347e+38): ref = 0, out = -3.40282347e+38, precision = -inf bits, FAIL
FAIL: lp_test_arit


This is using Tom Stellard's LLVM 3.2 tree on

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 18
model		: 1
model name	: AMD A8-3850 APU with Radeon(tm) HD Graphics
stepping	: 0
microcode	: 0x3000027
cpu MHz		: 800.000
cache size	: 1024 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 6
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt arat hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips	: 5799.66
TLB size	: 1536 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
Comment 1 José Fonseca 2013-01-22 10:51:02 UTC
Hi Michel,

It looks like fract is seriously busted.

The processor doesn't have SSE41, so no ROUNDPS instruction.

This can be reproduced on other processors doing:

diff --git a/src/gallium/auxiliary/gallivm/lp_bld_init.c b/src/gallium/auxiliary/gallivm/lp_bld_init.c
index 050eba7..d677dbb 100644
--- a/src/gallium/auxiliary/gallivm/lp_bld_init.c
+++ b/src/gallium/auxiliary/gallivm/lp_bld_init.c
@@ -489,7 +489,7 @@ lp_build_init(void)
 
    gallivm_initialized = TRUE;
 
-#if 0
+#if 1
    /* For simulating less capable machines */
    util_cpu_caps.has_sse3 = 0;
    util_cpu_caps.has_ssse3 = 0;
Comment 2 Roland Scheidegger 2013-01-22 15:05:38 UTC
Hmm fract() is such a mess. The usual definition of fract(x) is sub(x) - floor(x), and it is expected this returns a number in [0,1[ except it doesn't actually work due to the funky float math (floor(verysmallnegnumber) is -1.0f (exact) and the subtraction will be 1.0f (exact) too).
But here it must be floor which is busted, returning 0.0f. In particular ifloor without rounding available which already does a lot of work to fix results, but maybe the math doesn't quite work out around 0.0f. In particular I'm not really sure about that offset constant. Looks to me for denorms it would always be broken but apparently it fails for values further away from zero too.
(The last two failures with very large numbers are quite expected too I guess as we're using fp-to-int and back and these values exceed the representable int range. I can't think of an easy workaround, probably doable with a couple comparisons, masks etc.)
Comment 3 Roland Scheidegger 2013-01-22 19:28:07 UTC
I'm wondering if something like this (in horrible pseudo-code) would be a solution (for ifloor):
xtruncint = fptoint(x)
xtrunc = inttofp(xtruncint)
mask = xtrunc > x
maskone = mask & 1
xintfloor = xtruncint - maskone;

I think this might work for all values as long as they don't exceed integer range, including negative zero as long as cmp(negzero, poszero) equals true which I think it should. It has also similar complexity as what we're doing now.
For floor() if you'd also want it to work with huge exponents I think you'd need to additionally compare the original number (with masked out sign bit) with 2^24 and if it's larger just use the original value (which will already represent an exact integer in this case) using some select (given that you didn't have round in the first place meaning another and/andnot/or).
For NaNs etc. the results will probably remain pretty unpredictable unless you check for those as well but that's probably getting silly.
Honestly I don't know if it's worth fixing this.
Comment 4 José Fonseca 2013-01-22 20:14:40 UTC
(In reply to comment #3)
> Honestly I don't know if it's worth fixing this.

Yep. I think in the short term, the best thing would be to whitelist these failures when SSE41 is not available.
Comment 5 José Fonseca 2013-01-29 07:52:21 UTC
This is failing not only denormals, but also large values.

We could also do IEEE bit manipulation, as glibc does:

  http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/flt-32/s_floorf.c

It's a lot of logic, but probably the only way to get this right for all cases...
Comment 6 José Fonseca 2013-01-29 07:57:15 UTC
Even easier, we could just use the floor instrinsic that was added to LLVM 3.2:

  http://llvm.org/releases/3.2/docs/LangRef.html#int_floor
Comment 7 Roland Scheidegger 2013-01-29 15:19:15 UTC
Yes I mentioned the large values too. My guess though is like some other intrinsics llvm will try to call into external library code if hw support is not available for this intrinsic (and fail miserably with jit). I haven't tried it, though.
Comment 8 Roland Scheidegger 2013-01-29 22:27:58 UTC
Created attachment 73882 [details] [review]
proposed patch

Here's something I think could work, though no doubt floor will be slower. Seems to pass lp_test_arit.
It also adjusts ifloor to follow similar logic though I guess now that floor/ifloor are separated maybe should leave it alone (the new version is actually 4 instead of 5 instructions but probably worse in execution time in typical case since the "latency chain" is longer). It may work better for very small negative values though now (not sure) but this might be unnecessary.
ifloor_fract will still return garbage (independent of new or old ifloor) in particular for the fract part (ifloor is undefined anyway) but I didn't inspect the code enough to see if that's a problem.
Comment 9 Michel Dänzer 2013-01-31 09:11:58 UTC
The patch fixes the problem for me.
Comment 10 José Fonseca 2013-01-31 15:26:35 UTC
Looks good to me too.

Don't forget to take out src/gallium/auxiliary/gallivm/lp_bld_init.c from the patch.
Comment 11 Roland Scheidegger 2013-01-31 15:42:02 UTC
(In reply to comment #10)
> Looks good to me too.
> 
> Don't forget to take out src/gallium/auxiliary/gallivm/lp_bld_init.c from
> the patch.

Ok. I'll mention though that this is not a complete fix, round/trunc/ceil are broken exactly the same way (I think ceil is affected by both problems, trunc/round should only be affected by the very large exponents one). "Luckily" lp_test_arit doesn't test any such values for floor/ceil/trunc/round, only for fract...
Comment 12 José Fonseca 2013-01-31 16:01:56 UTC
(In reply to comment #11)
> Ok. I'll mention though that this is not a complete fix, round/trunc/ceil
> are broken exactly the same way (I think ceil is affected by both problems,
> trunc/round should only be affected by the very large exponents one).
> "Luckily" lp_test_arit doesn't test any such values for
> floor/ceil/trunc/round, only for fract...

Please add FIXME comments to those functions then.
Comment 13 Roland Scheidegger 2013-02-18 12:42:12 UTC
Should be fixed by c25ae5d27b114e23d5734f846002df1a05759658.
Comment 14 Darxus 2013-02-18 13:55:47 UTC
Woo, verified make check passes with commit dd599188d2868838541859a76800a8420958d358 (current latest), thanks.