Need faster Exponentiate
Re: Need faster Exponentiate
Thanks allot Richard!
Re: Need faster Exponentiate
This one rocks. I will be using it for my program: viewtopic.php?f=8&t=23188Richard wrote: and here is the faster single routine using the same approximations.
Do you want to be credited?
Re: Need faster Exponentiate
I'm happy that you find them useful. You can credit me as “Richard Freebasic”.
I am interested in the particular n-gram techniques you are using for the solution of substitution ciphers. Do you have any references to papers or books that cover n-gram based solution.
These routines were written as “proof of concept”. Regarding speed and accuracy trade-off, I wanted to see what could be done in the CPU arithmetic pipeline faster than the intrinsic functions. I expect speeds will vary depending on the particular CPU version, and the relationship between it's floating point and integer pipelines.
Working in the primitive braided space between integer and floating point makes this an interesting mental challenge. The “octaves” of the floating point format, with the “harmonic” periodicity of the polynomial approximation error functions are fascinating. The variable multiplier b convolutes the error functions in interesting ways to produce beats between the polynomial efs that appear in the final error function. Here is test code that shows how the resultant variable error functions seem to remain within about ±0.5% bounds.
I am interested in the particular n-gram techniques you are using for the solution of substitution ciphers. Do you have any references to papers or books that cover n-gram based solution.
These routines were written as “proof of concept”. Regarding speed and accuracy trade-off, I wanted to see what could be done in the CPU arithmetic pipeline faster than the intrinsic functions. I expect speeds will vary depending on the particular CPU version, and the relationship between it's floating point and integer pipelines.
Working in the primitive braided space between integer and floating point makes this an interesting mental challenge. The “octaves” of the floating point format, with the “harmonic” periodicity of the polynomial approximation error functions are fascinating. The variable multiplier b convolutes the error functions in interesting ways to produce beats between the polynomial efs that appear in the final error function. Here is test code that shows how the resultant variable error functions seem to remain within about ±0.5% bounds.
Code: Select all
'=======================================================================
' Approximate the power function
' Show that the error function {for b < 3.5} is less than 0.5% of the output
' Use only primitive +, -, *, logic or shift
' Avoid division, intrinsics and decisions that change the instruction order
' https://www.freebasic.net/forum/viewtopic.php?f=3&t=27600
'
'=======================================================================
Function raise( Byval x As Single, Byval b As Single ) As Single
' IEEE 754 Single Seeeeeeeefffffffffffffffffffffff ' single format
#define expo_bias &b00111111000000000000000000000000uL ' bias = 126 * 2^23
#define mant_mask &b00000000011111111111111111111111uL ' positive mantissa
#define mant_bits 23 ' number of bits used to store mantissa
#define bias 126 ' the bias that puts mantissa in range 0.5 to 0.999
Dim As Long Ptr fp = Cptr( Long Ptr, @x ) ' make ptr to bit pattern x
Dim As Long expo = ( *fp Shr mant_bits ) - bias ' extract exponent
*fp And= mant_mask ' kill the exponent bits
*fp Or= expo_bias ' bias exponent so value is between 0.5000 and 0.9999
x = expo - 3.153635 + x * ( 6.095832 + x * ( -4.207588 + x * 1.266028 ) ) ' Log2(x)
x *= b ' raise Log to power b, x = b * Log2(x)
expo = Int( x ) ' integer part will become exponent of antilog2(x)
x -= expo ' fractional part, makes x range 0.0000 to 0.9999
x = 1.00 + x * ( 0.6557133 + x * 0.3442867 ) ' quick 2^x over 0 to 1
*fp += ( expo Shl mant_bits ) ' add the integer to biased log exponent
Return x ' approximate x ^ b
End Function ' maximum error is about 0.5 percent of output
' Warning; zero, infinite and denormalised values are not be handled correctly
'=======================================================================
' set up graphics screen
Dim As Integer sx, sy, sz
Screeninfo sx, sy, sz
sx -= 32 ' shrink
sy -= 24
sz = 4 ' 16 colours
Screenres sx, sy, sz
'-----------------------------------------------------------------------
' graphics scale
Dim As Double lhs = 0.90001, rhs = 4.1, lo = -0.5, hi = 0.5
Window ( lhs, lo ) - ( rhs, hi )
Line( lhs, 0 )-( rhs, 0 ), 7
Line( lhs, lo )-( lhs, hi ), 7
Line( rhs, lo )-( rhs, hi ), 7
'=======================================================================
Dim As Double x, y, z, t, q
Dim As Integer i, k
Print
Print " Power =",
For i = 0 To 7
k = i + 8 ' 16 colours
q = i * 0.5
If q < .3 Then q = 0.1
Color k
Print q,
For x = lhs To rhs Step ( rhs - lhs ) / ( sx * 2 )
y = x^q
z = raise( x, q )
t = z - y
Pset( x, 100 * t / y ), k
Next x
Next i
For x = 0 To 4
Draw String ( x+.01, -0.005), Str(x), 15
Next x
'=======================================================================
Sleep
'=======================================================================
Re: Need faster Exponentiate
You can use my multiplier... It does a million digits in under a minute..
Code: Select all
function mul_loop_7( num1 as string , num2 as string ) as string
dim as string number1 = num1
dim as string number2 = num2
'make numbers equal multiple of 7 bytes
dim as string str1
dim as longint dec1
do
str1 = str(len(number1)/7)
dec1 = instr(1,str1,".")
if dec1 <> 0 then number1 = "0" + number1
loop until dec1 = 0
do
str1 = str(len(number2)/7)
dec1 = instr(1,str1,".")
if dec1 <> 0 then number2 = "0" + number2
loop until dec1 = 0
'convert the numeric strings to use pointers
'convert number1
dim as string n1 = string(len(number1)*8,chr(0))
dim as ulongint ptr ulp1 = cptr(ulongint ptr,strptr(n1))
dim as longint valu1
dim as longint len_1 = 0
for a as longint = 0 to len(number1)-1 step 7
valu1 = (number1[a ]-48)*1e6
valu1+= (number1[a+1]-48)*1e5
valu1+= (number1[a+2]-48)*1e4
valu1+= (number1[a+3]-48)*1e3
valu1+= (number1[a+4]-48)*1e2
valu1+= (number1[a+5]-48)*1e1
valu1+= (number1[a+6]-48)'*1
*ulp1 = valu1
ulp1+=1
len_1+=8
next
number1 = left(n1,len_1)
n1=""
'convert the numeric strings to use pointers
'convert number2
dim as string n2 = string(len(number2)*8,chr(0))
dim as ulongint ptr ulp2 = cptr(ulongint ptr,strptr(n2))
dim as longint valu2
dim as longint len_2 = 0
for a as longint = 0 to len(number2)-1 step 7
valu2 = (number2[a ]-48)*1e6
valu2+= (number2[a+1]-48)*1e5
valu2+= (number2[a+2]-48)*1e4
valu2+= (number2[a+3]-48)*1e3
valu2+= (number2[a+4]-48)*1e2
valu2+= (number2[a+5]-48)*1e1
valu2+= (number2[a+6]-48)'*1
*ulp2 = valu2
ulp2+=1
len_2+=8
next
number2 = left(n2,len_2)
n2=""
'create accumulator
dim as string answer = string( len(number1) + len(number2) , chr(0) )
dim as ulongint outblocks = ( len(answer) \ 8 )
dim as ulongint ptr outplace = cptr(ulongint ptr , strptr(answer)) + (outblocks - 1 )
dim as ulongint stops = ( (len(number1)\8) + (len(number2)\8) )
dim as ulongint value = 0
dim as longint hold = -1
dim as longint locat = 0
dim as longint vals = 0
dim as ulongint high1 = ( len(number1) \ 8 ) - 1
dim as ulongint high2 = ( len(number2) \ 8 ) - 1
dim as longint ptr num1_ptr = cptr( ulongint ptr , strptr(number1) ) + high1
dim as longint ptr num2_ptr = cptr( ulongint ptr , strptr(number2) ) + high2
do
hold+=1
vals = hold
locat = 0
if vals > high2 then vals = high2 : locat = (hold - high2)
num1_ptr = cptr( ulongint ptr , strptr(number1) ) + high1 - locat
num2_ptr = cptr( ulongint ptr , strptr(number2) ) + high2 - vals
do
value+= *( num1_ptr ) * *( num2_ptr )
num1_ptr-=1
num2_ptr+=1
if num1_ptr = cptr( ulongint ptr , strptr(number1) ) - 1 then goto done
if num2_ptr > cptr( ulongint ptr , strptr(number2) ) + high2 then goto done
loop
Done:
*outplace = value mod 1e7
outplace-= 1
value = value \ 1e7
loop until hold = stops-2
*outplace = value mod 1e7
'convert answer back to ascii
dim as string outtext=""
outplace = cptr( ulongint ptr , strptr(answer) )
for a as ulongint = 1 to outblocks step 1
value = *outplace
outplace+=1
outtext+= right("0000000" + str(value),7)
next
outtext = ltrim(outtext,"0")
return outtext
end function
Re: Need faster Exponentiate
Okay, will do. My algorithm is mostly original and I have no books to share, have been at it for several years though. I got the concept of using letter n-grams from zkdecrypto. For example a 5-gram such as "SHARE" has a score equal to the logarithm of its frequency in a corpus.Richard wrote: I'm happy that you find them useful. You can credit me as “Richard Freebasic”.
I am interested in the particular n-gram techniques you are using for the solution of substitution ciphers. Do you have any references to papers or books that cover n-gram based solution.
A simple explanation of how the score formula works:
All the 5-grams scores of the solution plain text are summed and divided by the total amount of 5-grams in the solution plain text to normalize the score. This divided sum is then multiplicated by the 1-gram entropy of the solution plain text to keep the solver from converging on "ETAION" heavy stuff. I use a weight on the entropy depending on the n-gram size: entropy^weight. Here is where your speedy function comes into play. On a fast computer my solver can do many tens of millions of iterations of this n-gram summing per second so it is important that the exponentiate function is fast so that it does not cause much of a slow down.
It is beyond my level of education. I am thankful that you came up with it. :)Richard wrote: These routines were written as “proof of concept”. Regarding speed and accuracy trade-off, I wanted to see what could be done in the CPU arithmetic pipeline faster than the intrinsic functions. I expect speeds will vary depending on the particular CPU version, and the relationship between it's floating point and integer pipelines.
Re: Need faster Exponentiate
This more mature version is faster now that several instructions have been removed, merged or modified.
The error function has unfortunately increased slightly, but is much better behaved since I have polished the polynomial coefficients. The step changes and reflections in the error function when crossing octave boundaries have been removed. That required even order (quadratic) polynomials, to get odd (cubic) error functions.
The maximum percentage error is now = 0.54 * ( Abs( power ) + 0.49 ).
Typically; 0.5% for square roots; 1.24% for squares and 1.65% for cubes.
See table in code block.
This is the latest version, wrapped in an error function plotter.
The error function has unfortunately increased slightly, but is much better behaved since I have polished the polynomial coefficients. The step changes and reflections in the error function when crossing octave boundaries have been removed. That required even order (quadratic) polynomials, to get odd (cubic) error functions.
The maximum percentage error is now = 0.54 * ( Abs( power ) + 0.49 ).
Typically; 0.5% for square roots; 1.24% for squares and 1.65% for cubes.
See table in code block.
This is the latest version, wrapped in an error function plotter.
Code: Select all
'=======================================================================
' Approximate the power function; Written by Richard, Freebasic.
'=======================================================================
' Approximate x ^ b, for x > 0. Balanced error without octave steps
Function raise( Byval x As Single, Byval b As Single ) As Single
' IEEE 754 Single Seeeeeeeefffffffffffffffffffffff ' Sign_Log_Linear format
#define expo_bias &b00111111000000000000000000000000uL ' bias = 126 * 2^23
#define frac_mask &b00000000011111111111111111111111uL ' positive fraction
#define frac_bits 23 ' bits used to store fraction, ignoring implicit msb
#define bias 126 ' exponent needed to make single to range from 0.5 to 0.999
Dim As Long Ptr fp = Cptr( Long Ptr, @x ) ' pointer to bit pattern of x
'-------- approximate Log2(x) --------------------------------------
Dim As Long expo = ( *fp Shr frac_bits ) - bias ' extract exponent
*fp = ( *fp And frac_mask ) Or expo_bias ' bias x into range 0.5 to 0.999
x = b * ( expo - 2.693111 + x * ( 4.079332 + x * -1.386221 ) ) ' log2 over 0.5 to 1
' to raise x to power b, y = x^b, y = Exp2( b * Log2( x ) )
'-------- approximate 2^x, is antiLog base 2 ( x ) -----------------
expo = Int( x ) ' integer part will later adjust exponent of Alog2(x)
x -= expo ' reduced to fraction in range 0.0 thru 0.999
x = 1.00 + x * ( 0.6557133 + x * 0.3442867 ) ' approx 2^x over range 0 to 1
*fp += ( expo Shl frac_bits ) ' restore early integer to biased log exponent
Return x
End Function ' Warning; avoid x <= 0, infinite or denormalised values
'-----------------------------------------------------------------------
' maximum error rises as magnitude of power rises
' Maximum percentage error = ( pwr + 0.49 ) * 0.54
' abs maximum
' pwr % error
' 0.5, 0.560634
' 1.0, 0.852009
' 1.5, 1.057730
' 2.0, 1.245606
' 2.5, 1.612362
' 3.0, 1.638780
' 3.5, 2.136535
' 4.0, 2.317659
' 4.5, 2.646873
'=======================================================================
' set up graphics screen
Dim As Integer sx, sy, sz
Screeninfo sx, sy, sz
sx -= 32 ' shrink
sy -= 24
sz = 4 ' 16 colours
Screenres sx, sy, sz
'--------------------------------------
' graphics scale
Dim As Double lhs = 0.90, rhs = 4.1, lo = -0.5, hi = 0.5
Window ( lhs, lo ) - ( rhs, hi )
Line( lhs, 0 )-( rhs, 0 ), 7
Line( lhs, lo )-( lhs, hi ), 7
Line( rhs, lo )-( rhs, hi ), 7
'=======================================================================
Dim As Double x, y, z, t, q, max = -100, min = 100, scale = 15
Dim As Integer i, k = 7
Print
Print " Power =",
'--------------------------------------
For q = -5.5 to 5.51 step .25
k += 1 ' 16 colours
if k > 15 then k = 8
Color k
Print using "####.##"; q;
'----------------------------------
For x = lhs To rhs Step ( rhs - lhs ) / ( sx * 2 )
y = x^q
z = raise( x, q )
t = ( z - y ) / y ' this is the error
If min > t Then min = t
If max < t Then max = t
Pset( x, scale * t ), k
Next x
Next q
'=======================================================================
For x = 0 To 4
Draw String ( x+.01, -0.005), Str(x), 15
Next x
color 15
print
print
print using " Error bounds ####.###### % ####.###### %"; max * 100; min * 100
'=======================================================================
Sleep
'=======================================================================
Re: Need faster Exponentiate
Thanks for your updated function, error table and plotter Richard. Much appreciated!
Re: Need faster Exponentiate
Latest version is slower by two multiplies and two adds.
I have reduced error dependency on the power b, by increasing the order and accuracy of the Log2 approximation.
It is also now less kinky and better behaved.
Use this version when higher power errors are more of a problem than speed.
Maximum error is < ± 0.30% for b < ± 2.5
Maximum error is < ± 0.35% for b < ± 7.0
Maximum error is < ± 0.50% for b < ± 22.
Maximum error is < ± 0.92% for b < ± 60.
I have reduced error dependency on the power b, by increasing the order and accuracy of the Log2 approximation.
It is also now less kinky and better behaved.
Use this version when higher power errors are more of a problem than speed.
Maximum error is < ± 0.30% for b < ± 2.5
Maximum error is < ± 0.35% for b < ± 7.0
Maximum error is < ± 0.50% for b < ± 22.
Maximum error is < ± 0.92% for b < ± 60.
Code: Select all
'=======================================================================
' Approximate the power function; written by Richard.
'=======================================================================
' Approximate x ^ b, for x > 0; |b| < 60. Avoid infinite or denormalised x values
Function raise( Byval x As Single, Byval b As Single ) As Single
' to raise x to power b, y = x^b, y = Exp2( b * Log2( x ) )
' IEEE 754 Single Seeeeeeeefffffffffffffffffffffff ' Sign_Log_Linear format
#define expo_bias &b00111111000000000000000000000000uL ' bias = 126 * 2^23
#define frac_mask &b00000000011111111111111111111111uL ' positive fraction
#define frac_bits 23 ' bits used to store fraction, ignoring implicit msb
#define bias 126 ' exponent that puts single into range from 0.500 to 0.999
Dim As Long Ptr fp = Cptr( Long Ptr, @x ) ' pointer to bit pattern of x
'-------- approximate Log2(x) --------------------------------------
Dim As Long expo = ( *fp Shr frac_bits ) - bias ' extract exponent
*fp = ( *fp And frac_mask ) Or expo_bias ' bias x into range 0.5 to 0.999
#Define a0 -3.5353058e0 ' log2 polynomial approximation
#Define a1 8.2616852e0 ' over the range from 0.5 to 1.0
#Define a2 -8.7235965e0 ' maximum error is 1.54e-4 = 0.000154
#Define a3 5.3685806e0 ' Zeros are located at
#Define a4 -1.3713634e0 ' 0.500, 0.575, 0.720, 0.885, 1.000
x = b * ( Csng( expo ) + a0 + ( a1 + ( a2 + ( a3 + a4 * x ) * x ) * x ) * x )
'-------- approximate 2^x, is antiLog2( x ) ------------------------
expo = Int( x ) ' this integer part adjusts the exponent of Alog2(x)
x -= expo ' x reduced to fraction in range 0.000 thru 0.999
x = 1e0 + ( 0.6607687e0 + 0.3392313e0 * x ) * x ' approx 2^x over range 0 to 1
*fp += ( expo Shl frac_bits ) ' restore early integer to biased log exponent
Return x
End Function
' Maximum error is < ± 0.30% for b < ± 2.5
' Maximum error is < ± 0.35% for b < ± 7.0
' Maximum error is < ± 0.50% for b < ± 22.
' Maximum error is < ± 0.92% for b < ± 60.
' Beyond which the exponent overflows, so it all falls apart
'
'=======================================================================
' set up graphics screen
Dim As Integer sx, sy, sz
Screeninfo sx, sy, sz
sx -= 32 ' shrink
sy -= 24
sz = 4 ' 16 colours
Screenres sx, sy, sz
'--------------------------------------
' graphics scale
Dim As Double x, y, z, t, q
Dim As Double lhs = 0.8, rhs = 4.2, lo = -0.5, hi = 0.5
Window ( lhs, lo ) - ( rhs, hi )
Line( lhs, 0 )-( rhs, 0 ), 1
For x = 1 To 4
Line( x, lo )-( x, hi ), 1
Next x
'=======================================================================
' plot the error curves
Dim As Double max = -1e6, min = 1e6, scale = 100
Dim As Integer i, k = 7
Print
Print " Powers = ";
'--------------------------------------
For q = -2.5 To 2.51 Step 0.1
k += 1 ' 16 colours
If k > 15 Then k = 8
Color k
Print Using "####.##"; q;
'----------------------------------
For x = lhs To rhs Step ( rhs - lhs ) / ( sx * 2 )
y = x^q ' correct
z = raise( x, q ) ' approx
t = ( z - y ) / y ' t is the error
If min > t Then min = t
If max < t Then max = t
Pset( x, scale * t ), k
Next x
Next q
Line( lhs, max * scale )-( rhs, max * scale ), 1
Line( lhs, min * scale )-( rhs, min * scale ), 1
'=======================================================================
For x = 0 To 4
Draw String ( x+.01, -0.005), Str(x), 15
Next x
Color 15
Print
Print
Print Using " Error bounds ####.### % ####.### %"; max * 100; min * 100
'=======================================================================
Sleep
'=======================================================================
Re: Need faster Exponentiate
Thanks Richard, I have just released a new version this morning and included a mention of your code in the Readme.txt in the base directory of the program.
viewtopic.php?f=8&t=23188
I have tested both your functions with the programs functionality in mind and prefer the faster version though the difference is very small. Is it possible to further optimize the function if for example the b variable (the power raised to) will never exceed 5?
viewtopic.php?f=8&t=23188
I have tested both your functions with the programs functionality in mind and prefer the faster version though the difference is very small. Is it possible to further optimize the function if for example the b variable (the power raised to) will never exceed 5?
Re: Need faster Exponentiate
The problem with approximating diadic functions like a^b is that they cover an area rather than a line, so usually require two, or possibly three, approximations to get one answer.
There are a few possible ways to speed it up further.
1. Implement it as macro “inline code” to remove the function call overhead. Use static variables.
2a. Code it in ASM to use only the extended registers inside the CPU. It would probably run at twice the speed, but would then be processor dependent.
2b. Implement four of the approximations at one time, so four packed singles can be processed in parallel in mmx registers and the fpu pipeline.
3. Define constraints to the a and b input ranges, or quantise those values, so lookup table(s) or minimised polynomials could be used.
4. Look at replacing the Int() function call with inline code.
5. Do a larger part of the surrounding math in a logarithmic number system, not in the log_linear exponent_significand floating point system. That will depend on the surrounding code.
6. Use single rather than double throughout the code wherever possible. That only needs to move half the data bits through the processor-memory bottlenecks.
7. Identify if and when a^b is computed more than once with the same value for a. The earlier value of Log2(a) could then be saved as static rather than recomputed.
8. Buy a faster computer or use a GPU for parallel processing.
There are a few possible ways to speed it up further.
1. Implement it as macro “inline code” to remove the function call overhead. Use static variables.
2a. Code it in ASM to use only the extended registers inside the CPU. It would probably run at twice the speed, but would then be processor dependent.
2b. Implement four of the approximations at one time, so four packed singles can be processed in parallel in mmx registers and the fpu pipeline.
3. Define constraints to the a and b input ranges, or quantise those values, so lookup table(s) or minimised polynomials could be used.
4. Look at replacing the Int() function call with inline code.
5. Do a larger part of the surrounding math in a logarithmic number system, not in the log_linear exponent_significand floating point system. That will depend on the surrounding code.
6. Use single rather than double throughout the code wherever possible. That only needs to move half the data bits through the processor-memory bottlenecks.
7. Identify if and when a^b is computed more than once with the same value for a. The earlier value of Log2(a) could then be saved as static rather than recomputed.
8. Buy a faster computer or use a GPU for parallel processing.
Re: Need faster Exponentiate
FWIW,
below is a routine which allows to adjust precision (the amount of iterations)...
This can be done via the optional i_0_11 param,
which defaults to 6 - and then ensures a max-error of +-1% (over the whole range of exponents).
With that default-value of i_0_11 = 6 (+-1% error), the routine is about 25% faster than "raise()" -
with i_0_11 = 8 (+- 0.25%) it has about the same speed on my machine.
As for performance...
- I've tested on a (4 year old) intel-mobile-cpu
- fbc 32bit, version 1.06 (with -fpu SSE as the only compiler-switch)
There might be further optimizations possible in the iteration-loop,
(where x is calculated as x = Sqr(x) "kinda recursively") -
but I'll leave that for the true math-experts... ;)
HTH
Olaf
below is a routine which allows to adjust precision (the amount of iterations)...
This can be done via the optional i_0_11 param,
which defaults to 6 - and then ensures a max-error of +-1% (over the whole range of exponents).
With that default-value of i_0_11 = 6 (+-1% error), the routine is about 25% faster than "raise()" -
with i_0_11 = 8 (+- 0.25%) it has about the same speed on my machine.
As for performance...
- I've tested on a (4 year old) intel-mobile-cpu
- fbc 32bit, version 1.06 (with -fpu SSE as the only compiler-switch)
There might be further optimizations possible in the iteration-loop,
(where x is calculated as x = Sqr(x) "kinda recursively") -
but I'll leave that for the true math-experts... ;)
Code: Select all
'an approach which applies the (usually quite fast) runtime-Sqr-function (multiple times)
'the optional i_0_11 allows to adjust the amount of iterations - and thus the precision...
'with its default: i_0_11=6 the error is +-1% ... with 7 it is half that, a.s.o. up to 11
Function FastPow(ByVal x As Single, ByVal n As Single, ByVal i_0_11 As Long=6) As Single
Dim i as Long, r As Single = 1
Static L(0 To 11) As Single
If L(0) = 0 Then 'fill the Lookup-Table (once, because it's static)
For i = 0 To 11: L(i) = 1 / 2 ^ (i + 1): Next
End If
If x = 0 Then return 0 'early exit for 0-values of x
If n > 0 Then 'handle positive exponents
Do Until n < 1: r *= x: n -= 1: Loop
For i = 0 To i_0_11
x = Sqr(x)
If n >= L(i) Then r *= x: n -= L(i)
If n <= L(i_0_11) Then Exit For 'early exit (n can't get any smaller)
Next
ElseIf n < 0 Then 'handle the case of neg. exponents
r = 1 / FastPow(x, -n, i_0_11)
End If
return r
End Function
Olaf
Re: Need faster Exponentiate
Thanks for your performance tips Richard, at least I will implement it as a macro. I have not used macro's before and it will be a good start. Point 7 should be applicable. Some of the other points are above my skill level.Richard wrote:The problem with approximating diadic functions like a^b is that they cover an area rather than a line, so usually require two, or possibly three, approximations to get one answer.
There are a few possible ways to speed it up further.
1. Implement it as macro “inline code” to remove the function call overhead. Use static variables.
2a. Code it in ASM to use only the extended registers inside the CPU. It would probably run at twice the speed, but would then be processor dependent.
2b. Implement four of the approximations at one time, so four packed singles can be processed in parallel in mmx registers and the fpu pipeline.
3. Define constraints to the a and b input ranges, or quantise those values, so lookup table(s) or minimised polynomials could be used.
4. Look at replacing the Int() function call with inline code.
5. Do a larger part of the surrounding math in a logarithmic number system, not in the log_linear exponent_significand floating point system. That will depend on the surrounding code.
6. Use single rather than double throughout the code wherever possible. That only needs to move half the data bits through the processor-memory bottlenecks.
7. Identify if and when a^b is computed more than once with the same value for a. The earlier value of Log2(a) could then be saved as static rather than recomputed.
8. Buy a faster computer or use a GPU for parallel processing.
Re: Need faster Exponentiate
Thanks OSchmidt for your function,
I have tested it and it is a bit slower than Richard's for my program. As you say it could probably be optimized a little bit further and have not tried altering the i_0_11 parameter yet.
I have tested it and it is a bit slower than Richard's for my program. As you say it could probably be optimized a little bit further and have not tried altering the i_0_11 parameter yet.
Re: Need faster Exponentiate
As said, it gets faster with decreasing values in i_0_11.
With:
- i_0_11=5 instead of the default 6, you'll save one iteration - and the error will be doubled to about +-2%
- i_0_11=4 instead of the default 6, you'll save two iterations - and the error will be about +-4%
Would like to hear, how it performs when you try it with i_0_11=4 (since this roughly matches the err-interval of the "prior raise()".
Did you ensure the compiler-swich: -fpu SSE already?
Regards,
Olaf
With:
- i_0_11=5 instead of the default 6, you'll save one iteration - and the error will be doubled to about +-2%
- i_0_11=4 instead of the default 6, you'll save two iterations - and the error will be about +-4%
Would like to hear, how it performs when you try it with i_0_11=4 (since this roughly matches the err-interval of the "prior raise()".
Did you ensure the compiler-swich: -fpu SSE already?
Regards,
Olaf
Re: Need faster Exponentiate
Code: Select all
' Compare the single precision x^b with algorithm under test
' reference ref time test test time ratio Error bounds
' single_pwr 69.8ns FastPow11 154.7ns 221.67% 0.03125%
' single_pwr 69.8ns FastPow10 141.5ns 202.68% 0.0625%
' single_pwr 69.7ns FastPow9 127.0ns 182.23% 0.125%
' single_pwr 69.7ns FastPow8 112.3ns 161.07% 0.25%
' single_pwr 69.7ns FastPow7 99.8ns 143.11% 0.5%
' single_pwr 69.7ns FastPow6 87.6ns 125.68% 1.%
' single_pwr 69.7ns FastPow5 75.5ns 108.31% 2.%
' single_pwr 69.7ns FastPow4 63.3ns 90.81% 4.%
' single_pwr 69.8ns FastPow3 52.0ns 74.42% 8.%
' single_pwr 69.8ns FastPow2 41.4ns 59.36% 16.%
' single_pwr 69.9ns FastPow1 32.5ns 46.52% 32.%
' single_pwr 69.8ns Raise 41.7ns 59.72% 0.5%
Later model CPUs map the old external x87 register stack onto the internal mmx register block. Access time is therefore the same for each system. Compiler option -fpu (sse/x87) no longer makes a consistent difference.