ulongInt division using the FPU

General FreeBASIC programming questions.
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

ulongInt division using the FPU

Post by srvaldez »

I came across an interesting article about dividing two UlongInts using double precision arithmetic https://arxiv.org/pdf/2207.08420
unfortunately it's almost 4 times slower than FB native division
then I thought, why not use the FPU ?
the problem is that you can't directly load a 64-bit unsigned integer into the FPU registers but there's a simple solution
using the FPU for ulongInt division is actually a bit faster than native FB,I present to you both the intriguing but slow double precision and the FPU versions

Code: Select all

#cmdline "-asm intel -arch native -gen gcc -Wc -O3"

Extern "C"
    Declare Function round(Byval As Double) As Double
    Declare Function fma( Byval x As Double, Byval y As Double, Byval z As Double ) As Double
End Extern

Function div64dbl(Byval a As Ulongint, Byval b As Ulongint) As Ulongint
    Dim bd As Double = Cdbl(b)
    Dim bs As Single = Csng(bd)
    Dim invbs0 As Single = 1.0f / bs
    Dim invbd0 As Double = Cdbl(invbs0)
    Dim Alpha As Double = fma(-bd, invbd0, 1.0)
    Dim invbd As Double = fma(Alpha, invbd0, invbd0)
    Dim ad As Double = Cdbl(a)
    Dim q1d As Double = ad * invbd0
    Dim q1 As Ulongint = round(q1d)
    Dim r1 As Longint = a - (b * q1)
    Dim r1d As Double = Cdbl(r1)
    Dim q3d As Double = r1d * invbd
    Dim q3 As Longint = round(q3d)
    Dim r3 As Longint = r1 - (b * q3)
    Dim q2 As Longint = Iif(r3 < 0, q3 - 1, q3)
    Dim q0 As Ulongint = q1 + q2
    Dim is_big As boolean = -(Clngint(b) < 0)
    Dim if_big As Ulongint = -(a >= b)
    Dim is_one As boolean = -(b <= 1)
    Dim special As Ulongint = Iif(is_big, if_big, a)
    Return Iif(is_one Orelse is_big, special, q0)
End Function

Private Function div64dbl2(Byval a As Ulongint, Byval b As Ulongint) As Ulongint
    If b <= 1 Then
        Return a
    End If
    If Clngint(b) < 0 Then
        Return -(a >= b)
    End If
    Dim bd As Double = Cdbl(b)
    Dim bs As Single = Csng(bd)
    Dim invbs0 As Single = 1.0f / bs
    Dim invbd0 As Double = Cdbl(invbs0)
    Dim Alpha As Double = fma(-bd, invbd0, 1.0)
    Dim invbd As Double = fma(Alpha, invbd0, invbd0)
    Dim ad As Double = Cdbl(a)
    Dim q1d As Double = ad * invbd0
    Dim q1 As Ulongint = round(q1d)
    Dim r1 As Longint = a - (b * q1)
    Dim r1d As Double = Cdbl(r1)
    Dim q3d As Double = r1d * invbd
    Dim q3 As Longint = round(q3d)
    Dim r3 As Longint = r1 - (b * q3)
    Dim q2 As Longint = Iif(r3 < 0, q3 - 1, q3)
    Return q1 + q2
End Function

Function div64fpu( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Static As Ushort oldcw, cw
    Asm
        fstcw word Ptr [oldcw]
        mov ax, word Ptr [oldcw]
        Or ax, &hC99
        mov word Ptr [cw], ax
        fldcw word Ptr [cw]
        mov rax, [n]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
        1:
        mov rax, [m]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [Function]
        fldcw word Ptr [oldcw]
        jmp 4f
        3: .long &h5F800000
        4:
    End Asm
End Function

Dim As Ulongint r, n, m, i, x
Dim As Double t1, t2, t3, t4

n=4294967295555555ull
t1=Timer
x=0
For i=1 To 500000000
    r=div64dbl(n, i)
    x+=r
Next
t1=timer-t1
Print x
Print "div64dbl time = ";t1
Print

t2=Timer
x=0
For i=1 To 500000000
    r=div64dbl2(n, i)
    x+=r
Next
t2=timer-t2
Print x
Print "div64dbl2 time = ";t2
Print

t3=Timer
x=0
For i=1 To 500000000
    r=div64fpu(n, i)
    x+=r
Next
t3=timer-t3
Print x
Print "div64fpu time = ";t3
Print

t4=Timer
x=0
For i=1 To 500000000
    r=n\i
    x+=r
Next
t4=timer-t4
Print x
Print "FB div time = ";t4
Print "div64dbl time / FB time = ";t1/t4
Print "div64dbl2 time / FB time = ";t2/t4
Print "div64fpu time / FB time = ";t3/t4
Print
Print "Press RETURN to end"
Sleep
Last edited by srvaldez on Dec 04, 2024 0:18, edited 1 time in total.
hhr
Posts: 251
Joined: Nov 29, 2019 10:41

Re: ulongInt division using the FPU

Post by hhr »

With my computer I got this:

gcc64, double version:
88507826712810960
div64 time = 87.15632926183753

88507826712810960
FB div time = 3.782033636234701
div64 time / FB time = 23.04483186685992
------------------------
gcc64, FPU version:
88507826712810960
div64 time = 9.716709229163826

88507826712810960
FB div time = 3.781105788890272
div64 time / FB time = 2.569806234386161
------------------------
gas64, double version:
88507826712810960
div64 time = 144.0174012850039

88507826712810960
FB div time = 4.267172679072246
div64 time / FB time = 33.75007577999295
------------------------
gas64, FPU version:
88507826712810960
div64 time = 9.753187846858054

88507826712810960
FB div time = 4.470558808883652
div64 time / FB time = 2.181648483736988
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

thanks hhr
I would be interested to know your PC specs, AMD or Intel and CPU frequency

Processor Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)

my times
double version
88507826712810960
div64 time = 9.59522890000517

88507826712810960
FB div time = 2.66232090000267
div64 time / FB time = 3.60408427849383

FPU version
88507826712810960
div64 time = 1.67033120000269

88507826712810960
FB div time = 2.67807209999592
div64 time / FB time = 0.623706583555103
hhr
Posts: 251
Joined: Nov 29, 2019 10:41

Re: ulongInt division using the FPU

Post by hhr »

Pentium Dual-Core CPU E5300 @ 2.60GHz 2.60GHz
RAM 4.00 GB
Windows7 64 Bit, 15 years old.

For control purposes:

double version
88507826712810960
div64 time = 87.42362828622572

88507826712810960
FB div time = 3.782473475323059
div64 time / FB time = 23.11281991971111

FPU version
88507826712810960
div64 time = 7.128544771927409

88507826712810960
FB div time = 3.830285650677979
div64 time / FB time = 1.861100038495986
Provoni
Posts: 520
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: ulongInt division using the FPU

Post by Provoni »

amd 7840hs: -gen gcc -Wc -march=native,-Ofast,-funroll-loops

div64dbl: 7.69
div64dbl2: 3.49
div64fpu: 1.35
fbdiv: 0.69
Provoni
Posts: 520
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: ulongInt division using the FPU

Post by Provoni »

srvaldez wrote: Dec 03, 2024 19:22 thanks hhr
I would be interested to know your PC specs, AMD or Intel and CPU frequency

Processor Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)

my times
double version
88507826712810960
div64 time = 9.59522890000517

88507826712810960
FB div time = 2.66232090000267
div64 time / FB time = 3.60408427849383

FPU version
88507826712810960
div64 time = 1.67033120000269

88507826712810960
FB div time = 2.67807209999592
div64 time / FB time = 0.623706583555103
Your fbdiv time is strangely slow for such a relatively modern CPU.
SARG
Posts: 1847
Joined: May 27, 2005 7:15
Location: FRANCE

Re: ulongInt division using the FPU

Post by SARG »

To be able to compile with gas64 change div64fpu like this

Code: Select all

Function div64fpu( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Static As Ushort oldcw, cw
	asm
    #if (__FB_BACKEND__ = "gas64")
		fstcw word Ptr oldcw[rip]
		mov ax, word Ptr oldcw[rip]
    #else
		fstcw word Ptr [oldcw]
		mov ax, word Ptr [oldcw]
    #endif
        Or ax, &hC99
    #if (__FB_BACKEND__ = "gas64")
        mov word Ptr cw[rip], ax
        fldcw word Ptr cw[rip]
    #else
        mov word Ptr [cw], ax
        fldcw word Ptr [cw]    
    #endif
        mov rax, [n]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
        1:
        mov rax, [m]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [Function]
	#if (__FB_BACKEND__ = "gas64")
		fldcw word Ptr oldcw[rip]
	#else
		fldcw word Ptr [oldcw]
	#endif
        jmp 4f
        3: .long &h5F800000
        4:
    End Asm
End Function
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

thank you SARG 👍😁
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

SARG, I got off my laziness and made it 32-bit compatible following your example

Code: Select all

#ifdef __FB_WIN32__
    #ifdef __FB_64BIT__
        '64-bit
        #define REGAX rax
    #else
        '32-bit
        #define REGAX eax
    #endif
#endif

Function div64fpu( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Static As Ushort oldcw, cw
    asm
    #if (__FB_BACKEND__ = "gas64")
        fstcw word Ptr oldcw[rip]
        mov ax, word Ptr oldcw[rip]
    #else
        fstcw word Ptr [oldcw]
        mov ax, word Ptr [oldcw]
    #endif
        Or ax, &hC99
    #if (__FB_BACKEND__ = "gas64")
        mov word Ptr cw[rip], ax
        fldcw word Ptr cw[rip]
    #else
        mov word Ptr [cw], ax
        fldcw word Ptr [cw]    
    #endif
        mov REGAX, [n]
        fild qword Ptr [REGAX]
        test Byte Ptr [REGAX+7], 128
        jz 1f
    #ifdef __FB_64BIT__    
        fadd dword Ptr 3[rip]
    #else
        fadd dword Ptr [3]
    #endif
        1:
        mov REGAX, [m]
        fild qword Ptr [REGAX]
        test Byte Ptr [REGAX+7], 128
        jz 2f
    #ifdef __FB_64BIT__     
        fadd dword Ptr 3[rip]
    #else
        fadd dword Ptr [3]
    #endif
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [Function]
    #if (__FB_BACKEND__ = "gas64")
        fldcw word Ptr oldcw[rip]
    #else
        fldcw word Ptr [oldcw]
    #endif
        jmp 4f
        3: .long &h5F800000
        4:
    End Asm
End Function
but it's disappointing that it's faster only in certain conditions
adeyblue
Posts: 338
Joined: Nov 07, 2019 20:08

Re: ulongInt division using the FPU

Post by adeyblue »

Not that I know things about this but it seems weird to me that ditching the statics for the stack and not taking the last jump is as much slower as it is on my machine.

Processor Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz, 2801 Mhz
div64fpu_orig time = 2593ms
FB div time = 3800ms
div64fpu_stack time = 2822ms
I suppose all the extra reg + offset add up, but it still seems like a large amount of time difference

Code: Select all

Function div64fpu_stack naked( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Asm
        push rbx '' just creating space on the stack here
        push rbp
        fstcw word Ptr [rsp]
        mov ax, word Ptr [rsp]
        or ax, &hC99
        mov word Ptr [rsp + 4], ax
        fldcw word Ptr [rsp + 4]
        fild qword Ptr [rcx]
        test Byte Ptr [rcx+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
        1:
        fild qword Ptr [rdx]
        test Byte Ptr [rdx+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [rsp + 8]
        fldcw word Ptr [rsp]
        pop rcx
        pop rax
        ret
        3: .long &h5F800000
    End Asm
End Function
ETA:
OK, replacing the push and pop with add/sub does make it faster than the original. Still seems a big penalty. FB div is using AVX which ISTR had a big warmup penalty in earlier Intel processors
div64fpu_orig time = 2585ms
FB div time = 3819ms
div64fpu_stack time = 2559ms (with add/sub)
div64fpu_stackpp time = 2847ms (with push/pop)

Code: Select all

Function div64fpu_stack naked( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Asm
        sub rsp, 16
        fstcw word ptr[rsp]
        mov ax, word ptr[rsp]
        or ax, 0xC99
        mov word ptr[rsp+4], ax
        fldcw word ptr[rsp+4]
        fild qword ptr[rcx]
        test Byte ptr[rcx+7], 128
        jz 1f
        fadd dword ptr 3[rip]
    1:
        fild qword ptr[rdx]
        test byte ptr[rdx+7], 128
        jz 2f
        fadd dword ptr 3[rip]
    2:
        fdivp st(1), st(0)
        fistp qword ptr[rsp+8]
        fldcw word ptr[rsp]
        mov rax, [rsp + 8]
        add rsp, 16
        ret
    3:  .long &h5F800000
    End Asm
End Function
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

thanks adeyblue for the code and tests, interesting timing 👍
Processor Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)

88507826712810960
div64fpu-orig time = 2026ms (1647ms with -O3)

88507826712810960
FB div time = 2664ms (2581ms with -O3)

88507826712810960
div64fpu_stack time = 1669ms (with add/sub) ( 1615ms with -O3)

88507826712810960
div64fpu_stack time = 2001ms (with push/pop) (1956ms with -O3)
without printing the sum, the FB div time = 0 due to optimization

just to make it easier for others to try here's the amalgamated test code

Code: Select all

#cmdline "-asm intel -arch native -gen gcc -Wc -O2"

Function div64fpu( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Static As Ushort oldcw, cw
    Asm
        fstcw word Ptr [oldcw]
        mov ax, word Ptr [oldcw]
        Or ax, &hC99
        mov word Ptr [cw], ax
        fldcw word Ptr [cw]
        mov rax, [n]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
        1:
        mov rax, [m]
        fild qword Ptr [rax]
        test Byte Ptr [rax+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [Function]
        fldcw word Ptr [oldcw]
        jmp 4f
        3: .long &h5F800000
        4:
    End Asm
End Function

Function div64fpu_stack1 Naked( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Asm
        Sub rsp, 16
        fstcw word Ptr[rsp]
        mov ax, word Ptr[rsp]
        Or ax, 0xC99
        mov word Ptr[rsp+4], ax
        fldcw word Ptr[rsp+4]
        fild qword Ptr[rcx]
        test Byte Ptr[rcx+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
    1:
        fild qword Ptr[rdx]
        test Byte Ptr[rdx+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
    2:
        fdivp st(1), st(0)
        fistp qword Ptr[rsp+8]
        fldcw word Ptr[rsp]
        mov rax, [rsp + 8]
        Add rsp, 16
        ret
    3:  .long &h5F800000
    End Asm
End Function

Function div64fpu_stack2 Naked( Byref n As Ulongint, Byref m As Ulongint ) As Ulongint
    Asm
        Push rbx '' just creating space on the stack here
        Push rbp
        fstcw word Ptr [rsp]
        mov ax, word Ptr [rsp]
        Or ax, &hC99
        mov word Ptr [rsp + 4], ax
        fldcw word Ptr [rsp + 4]
        fild qword Ptr [rcx]
        test Byte Ptr [rcx+7], 128
        jz 1f
        fadd dword Ptr 3[rip]
        1:
        fild qword Ptr [rdx]
        test Byte Ptr [rdx+7], 128
        jz 2f
        fadd dword Ptr 3[rip]
        2:
        fdivp st(1), st(0)
        fistp qword Ptr [rsp + 8]
        fldcw word Ptr [rsp]
        Pop rcx
        Pop rax
        ret
        3: .long &h5F800000
    End Asm
End Function


Dim As Ulongint r, n, m, i, x
Dim As Double t
Dim As Long t1, t2, t3, t4

n=4294967295555555ull
t=Timer
x=0
For i=1 To 500000000
    r=div64fpu(n, i)
    x+=r
Next
t1=(timer-t)*1000
Print x
Print "div64fpu-orig time = ";t1;"ms"
Print

t=Timer
x=0
For i=1 To 500000000
    r=n\i
    x+=r
Next
t2=(timer-t)*1000
Print x
Print "FB div time = ";t2;"ms"
Print

t=Timer
x=0
For i=1 To 500000000
    r=div64fpu_stack1(n, i)
    x+=r
Next
t3=(timer-t)*1000
Print x
Print "div64fpu_stack time = ";t3;"ms (with add/sub)"
Print

t=Timer
x=0
For i=1 To 500000000
    r=div64fpu_stack2(n, i)
    x+=r
Next
t4=(timer-t)*1000
Print x
Print "div64fpu_stack time = ";t4;"ms (with push/pop) "
Print

Print "Press RETURN to end"
Sleep
dafhi
Posts: 1714
Joined: Jun 04, 2005 9:51

Re: ulongInt division using the FPU

Post by dafhi »

speedy gonzales as usual!

here's mine

Code: Select all

Intel i7 256V (Lunar Lake) 2.2GHz
*************

-Wc -Ofast
----------
88507826712810960
div64fpu-orig time =  2774ms

88507826712810960
FB div time =  1193ms

88507826712810960
div64fpu_stack time =  2778ms (with add/sub)

88507826712810960
div64fpu_stack time =  2867ms (with push/pop)

=======================================

-arch native -Wc -Ofast,-mfpmath=sse,-funroll-loops
-------------
88507826712810960
div64fpu-orig time =  5712ms

88507826712810960
FB div time =  1178ms

88507826712810960
div64fpu_stack time =  5803ms (with add/sub)

88507826712810960
div64fpu_stack time =  5734ms (with push/pop)
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

ok
loading an unsigned 64-bit integer into a FPU register is tricky so here's a signed version
I would like to see what your timing is

Code: Select all

#cmdline "-asm intel -arch native -gen gcc -Wc -O3"

Function div64fpu( Byref n As Longint, Byref m As Longint ) As Longint
    Static As Ushort oldcw, cw=&hFFF
    Asm
        fstcw word Ptr [oldcw]
        fldcw word Ptr [cw]
        mov rax, [n]
        fild qword Ptr [rax]
        mov rax, [m]
        fild qword Ptr [rax]
        fdivp st(1), st(0)
        fistp qword Ptr [Function]
        fldcw word Ptr [oldcw]
    End Asm
End Function

Dim As Longint r, n, m, i, x
Dim As Double t
Dim As Long t1, t2

n=4294967295555555ull
t=Timer
x=0
For i=1 To 500000000
    r=div64fpu(n, i)
    x+=r
Next
t1=(timer-t)*1000
Print x
Print "div64fpu-orig time = ";t1;"ms"
Print

t=Timer
x=0
For i=1 To 500000000
    r=n\i
    x+=r
Next
t2=(timer-t)*1000
Print x
Print "FB div time = ";t2;"ms"
Print

If t2>t1 Then
    Print "the FPU version is ";t2/t1;" times faster than FB native"
Elseif t2<t1 Then
    Print "the FPU version is ";t1/t2;" times slower than FB native"
Else
    Print "the times of the FPU and FB versions are the same"
End If
Print "Press RETURN to end"
Sleep
Processor Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)
88507826712810960
div64fpu-orig time = 1393ms

88507826712810960
FB div time = 2880ms

the FPU version is 2.06748025843503 times faster than FB native
hhr
Posts: 251
Joined: Nov 29, 2019 10:41

Re: ulongInt division using the FPU

Post by hhr »

88507826712810960
div64fpu-orig time = 5921ms

88507826712810960
FB div time = 6181ms

the FPU version is 1.043911501435568 times faster than FB native
srvaldez
Posts: 3543
Joined: Sep 25, 2005 21:54

Re: ulongInt division using the FPU

Post by srvaldez »

looks like -O3 is not the best option for you hhr
Post Reply