Age | Commit message (Collapse) | Author | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
these did not truncate excess precision in the return value. fixing
them looks like considerable work, and the current C code seems to
outperform them significantly anyway.
long double functions are left in place because they are not subject
to excess precision issues and probably better than the C code.
|
|
this commit is for the sake of reviewable history.
|
|
|
|
analogous to commit 1c9afd69051a64cf085c6fb3674a444ff9a43857 for
atan[2][f].
|
|
for functions implemented in C, this is a requirement of C11 (F.6);
strictly speaking that text does not apply to standard library
functions, but it seems to be intended to apply to them, and C2x is
expected to make it a requirement.
failure to drop excess precision is particularly bad for inverse trig
functions, where a value with excess precision can be outside the
range of the function (entire range, or range for a particular
subdomain), breaking reasonable invariants a caller may expect.
|
|
commit f3ed8bfe8a82af1870ddc8696ed4cc1d5aa6b441 inadvertently removed
labels that were still needed.
|
|
commit 31c5fb80b9eae86f801be4f46025bc6532a554c5 introduced underflow
code paths for the i386 math asm, along with checks on the fpu status
word to skip the underflow-generation instructions if the underflow
flag was already raised. unfortunately, at least one such path, in
log1p, returned with 2 items on the x87 stack rather than just 1 item
for the return value. this is a violation of the ABI's calling
convention, and could cause subsequent floating point code to produce
NANs due to x87 stack overflow. if floating point results are used in
flow control, this can lead to runaway wrong code execution.
rather than reviewing each "underflow already raised" code path for
correctness, remove them all. they're likely slower than just
performing the underflow code unconditionally, and significantly more
complex.
all of this code should be ripped out and replaced by C source files
with inline asm. doing so would preclude this kind of error by having
the compiler perform all x87 stack register allocation and stack
manipulation, and would produce comparable or better code. however
such a change is a much larger project.
|
|
none of these are actual textrels because of ld-time binding performed
by -Bsymbolic-functions, but I'm changing them with the goal of making
ld-time binding purely an optimization rather than relying on it for
semantic purposes.
in the case of memmove's call to memcpy, making it explicit that the
memmove asm is assuming the forward-copying behavior of the memcpy asm
is desirable anyway; in case memcpy is ever changed, the semantic
mismatch would be apparent while editing memmcpy.s.
|
|
fnstsw does not wait for pending unmasked x87 floating-point exceptions
and it is the same as fstsw when all exceptions are masked which is the
only environment libc supports.
|
|
weak_alias was only in the c code, so drem was missing on platforms
where remainder is implemented in asm.
|
|
there were two problems:
* omitted underflow on subnormal results: exp2l(-16383.5) was calculated
as sqrt(2)*2^-16384, the last bits of sqrt(2) are zero so the down scaling
does not underflow eventhough the result is in subnormal range
* spurious underflow for subnormal inputs: exp2l(0x1p-16400) was evaluated
as f2xm1(x)+1 and f2xm1 raised underflow (because inexact subnormal result)
the first issue is fixed by raising underflow manually if x is in
(-32768,-16382] and not integer (x-0x1p63+0x1p63 != x)
the second issue is fixed by treating x in (-0x1p64,0x1p64) specially
for these fixes the special case handling was completely rewritten
|
|
|
|
underflow is raised by an inexact subnormal float store,
since subnormal operations are slow, check the underflow
flag and skip the store if it's already raised
|
|
|
|
|
|
with naive exp2l(x*log2e) the last 12bits of the result was incorrect
for x with large absolute value
with hi + lo = x*log2e is caluclated to 128 bits precision and then
expl(x) = exp2l(hi) + exp2l(hi) * f2xm1(lo)
this gives <1.5ulp measured error everywhere in nearest rounding mode
|
|
__invtrigl is not needed when acosl, asinl, atanl have asm
implementations
|
|
exp(inf), exp(-inf), exp(nan) used to raise wrong flags
|
|
|
|
old: 2*atan2(sqrt(1-x),sqrt(1+x))
new: atan2(fabs(sqrt((1-x)*(1+x))),x)
improvements:
* all edge cases are fixed (sign of zero in downward rounding)
* a bit faster (here a single call is about 131ns vs 162ns)
* a bit more precise (at most 1ulp error on 1M uniform random
samples in [0,1), the old formula gave some 2ulp errors as well)
|
|
use (1-x)*(1+x) instead of (1-x*x) in asin.s
the later can be inaccurate with upward rounding when x is close to 1
|
|
the int part was wrong when -1 < x <= -0 (+0.0 instead of -0.0)
and the size and performace gain of the asm version was negligible
|
|
(fldl instruction was used instead of flds and fldt)
|
|
special care is made to avoid any inexact computations when either arg
is zero (in which case the exact absolute value of the other arg
should be returned) and to support the special condition that
hypot(±inf,nan) yields inf.
hypotl is not yet implemented since avoiding overflow is nontrivial.
|
|
the old formula atan2(1,sqrt((1+x)/(1-x))) was faster but
could give nan result at x=1 when the rounding mode is
FE_DOWNWARD (so 1-1 == -0 and 2/-0 == -inf), the new formula
gives -0 at x=+-1 with downward rounding.
|
|
the fscale instruction is slow everywhere, probably because it
involves a costly and unnecessary integer truncation operation that
ends up being a no-op in common usages. instead, construct a floating
point scale value with integer arithmetic and simply multiply by it,
when possible.
for float and double, this is always possible by going to the
next-larger type. we use some cheap but effective saturating
arithmetic tricks to make sure even very large-magnitude exponents
fit. for long double, if the scaling exponent is too large to fit in
the exponent of a long double value, we simply fallback to the
expensive fscale method.
on atom cpu, these changes speed up scalbn by over 30%. (min rdtsc
timing dropped from 110 cycles to 70 cycles.)
|
|
this is a lot more efficient and also what is generally wanted.
perhaps the bit shuffling could be more efficient...
|
|
|
|
exponents (base 2) near 16383 were broken due to (1) wrong cutoff, and
(2) inability to fit the necessary range of scalings into a long
double value.
as a solution, we fall back to using frndint/fscale for insanely large
exponents, and also have to special-case infinities here to avoid
inf-inf generating nan.
thankfully the costly code never runs in normal usage cases.
|
|
|
|
|
|
|
|
this could perhaps use some additional testing for corner cases, but
it seems to be correct.
|
|
up to 30% faster exp2 by avoiding slow frndint and fscale functions.
expm1 also takes a much more direct path for small arguments (the
expected usage case).
|
|
|
|
|
|
|
|
unlike some implementations, these functions perform the equivalent of
gcc's -ffloat-store on the result before returning. this is necessary
to raise underflow/overflow/inexact exceptions, perform the correct
rounding with denormals, etc.
|
|
unlike trig functions, these are easy to do in asm because they do not
involve (arbitrary-precision) argument reduction. fpatan automatically
takes care of domain issues, and in asin and acos, fsqrt takes care of
them for us.
|
|
|
|
infinities were getting converted into nans. the new code simply tests
for infinity and replaces it with a large magnitude value of the same
sign.
also, the fcomi instruction is apparently not part of the i387
instruction set, so avoid using it.
|
|
|