Sure, they could, but they would have to write the whole program in assembly, not just a routine. Adding only few custom ASM routines in a compiled program might make things harder for automated optimizations, and as result it could make things worse, not betterjj2007 wrote:@angros47: the portability argument is ok, the rest is wishful thinking. A good assembly programmer can always beat a compiler.
Range only PRNG
Re: Range only PRNG
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
In the Returns change RangeN.dw_c to dw_tmp.
Looks like Ulongint is problematic in 32-bit mode. I have seen that before.
However, I don't know what is going on with continuous floats in 64-bit, my Returns are in the correct order.
I need a cup of tea.
Code: Select all
Timings are now:
-O3
32 64
K 82 497
F 79 258
-O2
32 64
K 71 501
F 54 260
However, I don't know what is going on with continuous floats in 64-bit, my Returns are in the correct order.
I need a cup of tea.
Last edited by deltarho[1859] on May 24, 2021 23:09, edited 1 time in total.
Re: Range only PRNG
It depends. If the compiler can optimise away what a simple, small assembly routine does, the compiler may win. But a more sophisticated assembly routine cannot be optimised away so easily. Take Instr(text, match), for example. That is a complex routine, and I bet the compiler has no chance.angros47 wrote:Sure, they could, but they would have to write the whole program in assembly, not just a routine. Adding only few custom ASM routines in a compiled program might make things harder for automated optimizations, and as result it could make things worse, not betterjj2007 wrote:@angros47: the portability argument is ok, the rest is wishful thinking. A good assembly programmer can always beat a compiler.
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
The Type needs changing from SFC32 to SFC64.
However, that won't affect the timings. That has me foxed.
It could be that 32-bit is better at task A than task B whereas 64-bit is better at task B than task A. I have seen that before as well.
Why am I using K instead of N? Ah, because I used 'k +='.
I need some sleep, my grey matter is starting to seize up.
However, that won't affect the timings. That has me foxed.
It could be that 32-bit is better at task A than task B whereas 64-bit is better at task B than task A. I have seen that before as well.
Why am I using K instead of N? Ah, because I used 'k +='.
I need some sleep, my grey matter is starting to seize up.
Re: Range only PRNG
@jj2007
The fact is, deltarho tested it, and optimized code built by fbc+gcc was faster than assembly viewtopic.php?p=283050#p283050
I tried to explain why this happened. It doesn't matter which one is faster in theory, as a matter of fact, compiled code is often faster than assembly.
The fact is, deltarho tested it, and optimized code built by fbc+gcc was faster than assembly viewtopic.php?p=283050#p283050
I tried to explain why this happened. It doesn't matter which one is faster in theory, as a matter of fact, compiled code is often faster than assembly.
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
Conclusion using gcc 8.4 at -O2
N bounded integers: F continuous float
SFC64 is poor in 32-bit mode. In 64-bit mode bounded integers are fast and faster than continuous float. However, the bounded integers result is not significantly faster than SFC32.
SFC32 continuous floats are faster than bounded integers in both 32-bit mode and 64-bit mode. In 64-bit mode both bounded integers and continuous floats are faster than 32-bit mode.
So on my machine FreeBASIC is better served by SFC32. The slowest with SFC32 is bounded integers in 32-bit mode and close to PCG32II. For applications using a range requiring a fast throughput SFC32 is a good choice remembering that it passes PractRand to 2TB at least.
Code: Select all
SFC32 SFC64
32 64 32 64
N 284 477 71 501
F 576 645 55 258
SFC64 is poor in 32-bit mode. In 64-bit mode bounded integers are fast and faster than continuous float. However, the bounded integers result is not significantly faster than SFC32.
SFC32 continuous floats are faster than bounded integers in both 32-bit mode and 64-bit mode. In 64-bit mode both bounded integers and continuous floats are faster than 32-bit mode.
So on my machine FreeBASIC is better served by SFC32. The slowest with SFC32 is bounded integers in 32-bit mode and close to PCG32II. For applications using a range requiring a fast throughput SFC32 is a good choice remembering that it passes PractRand to 2TB at least.
Last edited by deltarho[1859] on May 25, 2021 18:55, edited 1 time in total.
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
Regarding the discussion between jj2007 and angros47 putting portability to one side if performance is an issue then both PowerBASIC and fbc+gas may benefit from some code being written in asm. With regard fbc+gcc the likelihood of benefitting from using asm is greatly reduced and in some situations may have an impact on performance.
I would never say this at the PowerBASIC forum but from a performance perspective fbc+gcc beats PowerBASIC. For those of us who are SDK challenged, PowerBASIC is a great environment for writing GUI applications.
I would never say this at the PowerBASIC forum but from a performance perspective fbc+gcc beats PowerBASIC. For those of us who are SDK challenged, PowerBASIC is a great environment for writing GUI applications.
Re: Range only PRNG
Compiled code can be faster than assembly. My assembly code is typically a factor 2-3 faster than the C runtime functions (which, I suppose, are produced with the best compiler available).angros47 wrote:compiled code is often faster than assembly
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
I have hinted at looking at SFC as a general purpose generator as opposed to range only. Some range performance figures are very fast.
For [0,1) computations I am getting 0.500023 as a typical average, often with four zeros, for 10^8 requests. This is indicative of top quality distribution uniformity. We already have top quality randomness.
Speed wise I am getting exceeding 1100MHz in 32-bit mode for a '+=' timing. Doing a '=' timing sees gcc ripping out the loop. So the real generator performance is actually faster. This gives us a 1GHz plus generator.
In a race with PCG32II we'd need to use binoculars with our rearview mirror.
For those of you who have being testing SFC32.bi try this:
In 64-bit mode I am getting over 970MHz.
For [0,1) computations I am getting 0.500023 as a typical average, often with four zeros, for 10^8 requests. This is indicative of top quality distribution uniformity. We already have top quality randomness.
Speed wise I am getting exceeding 1100MHz in 32-bit mode for a '+=' timing. Doing a '=' timing sees gcc ripping out the loop. So the real generator performance is actually faster. This gives us a 1GHz plus generator.
In a race with PCG32II we'd need to use binoculars with our rearview mirror.
For those of you who have being testing SFC32.bi try this:
Code: Select all
Function SFCRnd ( Byref RangeN As SFC32 ) As Double
' Returns a double in [0,1) with 32-bit granularity
Dim As Ulong dw_tmp
RangeN.dw_counter += 1
dw_tmp = RangeN.dw_a + RangeN.dw_b + RangeN.dw_counter
RangeN.dw_a = RangeN.dw_b Xor ( RangeN.dw_b shr 9 )
RangeN.dw_b = RangeN.dw_c + ( RangeN.dw_c shl 3 )
RangeN.dw_c = rotl(RangeN.dw_c, 21) + dw_tmp
Return RangeN.dw_c/2^32
End Function
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
It seems that using '+=' as opposed to '=' is also prolematical in timings
Consider
Rnd is not FB's Rnd.
Rnd uses
and 'Continuous floats' uses
How on earth can 'Continuous floats' be faster than Rnd?
We need a way to stop gcc messing about with this:
For some reason it also messes with
I have tried 'Asm nop' in 32-bit mode before the '+=' and the throughputs drop dramatically.
@coderJeff
Would it be much trouble to extend #pragma so that we can do this?
The loop will not get optimized.
It would also be handy for bracketing Asm blocks. Who knows, maybe some of our Asm blocks do not run as fast as we thought they would because gcc is sticking it's nose in.
Consider
Code: Select all
Throughputs:
Rand 925MHz
Rnd 886MHz
Unbounded integers 886MHz
Continuous floats 1139MHz
Rnd uses
Code: Select all
Return This.s(2)/2^32
Code: Select all
Return This.s(2)/2^32*( Last - First ) + First
We need a way to stop gcc messing about with this:
Code: Select all
For i = 1 to 10^8
l = f(x)
Next
Code: Select all
For i = 1 to 10^8
l += f(x)
Next
@coderJeff
Would it be much trouble to extend #pragma so that we can do this?
Code: Select all
#pragma GCC push_options
#pragma GCC optimize("O0")
For i = 1 to 10^8
l = f(x)
Next
#pragma GCC pop_options
It would also be handy for bracketing Asm blocks. Who knows, maybe some of our Asm blocks do not run as fast as we thought they would because gcc is sticking it's nose in.
Re: Range only PRNG
hi deltarho[1859]
I suggest that you print the sum after the timing loop, gcc may figure that since the value was not used that it's ok to eliminate the loop
I suggest that you print the sum after the timing loop, gcc may figure that since the value was not used that it's ok to eliminate the loop
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
That is really sneaky. Messing about with a loop such that it altered the total when we wanted the total printed would be a major bug.srvaldez wrote:I suggest that you print the sum after the timing loop
It may well be after a few years playing with PRNGs that I am seeing the real timings.
Thanks srvaldez.
OK, assuming that the '+=' method does not impact much on timings and we now have a solution to the timing problem this is what I get comparing SFC32 with PCG32II; both of them using '+=' and printing the totals.
Code: Select all
SFC32
32 64
Rnd 684 731
RndD 141 109
N 555 909
F 574 647
PCG32II
32 64
Rnd 245 550
RndD 125 125
N 255 555
F 233 482
The relationship between all the metrics seems reasonable now. RndD is a double with 53-bit granularity using two Ulong outputs.
The 'breaking news' is that nearly every single metric of SFC32 is faster than PCG32II.
Over the last couple of days I have been looking at the algorithms PCG32II, Squares, MsWs, xoshiro256 and xoroshiro128 and they all use a mix of +, xor, shl, shr and rotl. SFC32 does less work than any of them so it is no surprise that it is fast. However, when looking at the SFC32 algorithm I am left thinking that it does not deserve to pass PractRand to 2TB with only one small anomaly but it does.
I have written code for SFC32 in the same vein as PCG32II so that it is easy to use. However, I want to kick it around for a while before publishing.
It still hasn't sunk in yet that PCG32II has been pushed off its perch. Having said that, as mentioned above, this additional throughput of SFC32 may not be noticeable in most of our applications; although I should imagine some applications need the fastest generator than they can get their hands on.
Added:
The PCG32II timings are the median of nine tests. The SFC32 timings are single runs and are all over the place, so I need to write some code and time as with the PCG32II figures.
Added:
The SFC32 timings are now the median of nine tests. Comments above have been edited, but the overall conclusion has not changed.
Last edited by deltarho[1859] on May 26, 2021 16:32, edited 2 times in total.
Re: Range only PRNG
The competition is strong.deltarho[1859] wrote:It still hasn't sunk in yet that PCG32II has been pushed off its perch.
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
Melissa O'Neill's PCG family and Bernard Widynski's Middle Square Weyl Sequence and Squares don't get a look in.
-
- Posts: 4310
- Joined: Jan 02, 2017 0:34
- Location: UK
- Contact:
Re: Range only PRNG
I have decided to publish 'warts and all' because I have another problem.
In the Usage example code we have:
The result is usually close to zero but now and then I get something like this '184467440737.0945'. This only happens when First < 0. I am certain that I have seen this before. If we adjust the '-10, 10' to '0, 20', do the calculation we should get close to 10. Correcting back we should get close to zero, but I still get that barmy figure.
How can the same code with the same task get it right most of the time and, occasionally, throw a 'wobbly''?
Added: The average should be the same but the 10^8 values used will be different.
Here is the new SFC32.bi which is now a full-blown generator similar to PCG32II. The function SFC32.RangeN includes an attempt to avoid issues when First < 0.
Following is a new Usage example program covering the new features of the new SFC32.bi.
If it was not for the barmy figure this would be a finished generator for our toolbox. Grrrr!
SFC32.bi
Usage example (At the end in the Throughputs section the totals are printed to stop gcc messing about with our timing loop as per srvaldez's tip.)
In the Usage example code we have:
Code: Select all
Print "Accuracy over 10^8 bounded integers ( -10, 10 ) Theoretical result = 0" : Print
How can the same code with the same task get it right most of the time and, occasionally, throw a 'wobbly''?
Added: The average should be the same but the 10^8 values used will be different.
Here is the new SFC32.bi which is now a full-blown generator similar to PCG32II. The function SFC32.RangeN includes an attempt to avoid issues when First < 0.
Following is a new Usage example program covering the new features of the new SFC32.bi.
If it was not for the barmy figure this would be a finished generator for our toolbox. Grrrr!
SFC32.bi
Code: Select all
' Generator code adapted from code by Daniel Penney at the PowerBASIC forums - thanks Daniel
#Include Once "windows.bi"
#Inclib "bcrypt"
#Include Once "win/wincrypt.bi"
#define rotl(x,k) ( (x Shl k) Or (x Shr(32-k)) ) ' Note the extra parentheses
Type SFC32
Declare Constructor
Declare Sub Initialize()
Declare Function Rand() As Ulong
Declare Function Rnd() As Double ' 32-bit granularity
Declare Function RndD() As Double ' 53-bit granularity
Declare Function RangeN( Byval One As Long, Byval Two As Long ) As Long
Declare Function RangeF( Byval One As Long, Byval Two As Long ) as Double
Declare Sub GetSnapshot()
Declare Sub SetSnapShot()
As Ulong s(0 To 3)
As Ulong snaps(0 to 3)
End Type
Function SFC32.Rand( ) As Ulong
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)
End Function
Function SFC32.Rnd( ) As Double
' 32-bit granularity
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)/2^32
End Function
Function SFC32.RndD( ) As Double
' 53-bit granularity
Dim As Ulong dw_tmp, TempVar1, TempVar2
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
TempVar1 = This.s(2)
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
TempVar2 = This.s(2)
Asm
mov eax, dword ptr [TempVar1]
movd xmm0, eax
mov eax, dword ptr [TempVar2]
movd xmm1, eax
punpckldq xmm0, xmm1
psrlq xmm0, 12
mov eax, 1
cvtsi2sd xmm1, eax
por xmm0, xmm1
subsd xmm0, xmm1
movq [Function], xmm0
End Asm
End Function
Function SFC32.RangeN ( ByVal First As Long, ByVal Last As long ) As Long
Dim As Ulong dw_tmp, Adj
Dim As Long Tempvar
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
If First < 0 then
Adj = -First
First = 0 : Last = Last + Adj
Return Clng( This.s(2) Mod ( Last + 1) ) - Adj
Else
Return Clng( This.s(2) Mod ( Last - First + 1) ) + First ' By dodicat
End If
End Function
Function SFC32.RangeF ( ByVal First As Long, ByVal Last As long ) As Double
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)/2^32*( Last - First ) + First
End Function
Sub SFC32.GetSnapshot ( )
For i As Ulong = 0 to 3
This.snaps(i) = This.s(i)
Next
End Sub
Sub SFC32.SetSnapshot ( )
For i As Ulong = 0 To 3
This.s(i)= This.snaps(i)
Next
End Sub
Sub SFC32.Initialize( )
Dim hRand As BCRYPT_ALG_HANDLE Ptr
BCryptOpenALGOrithmProvider( @hRand, BCRYPT_RNG_ALGORITHM, 0, 0 )
BCryptGenRandom( hRand, cast(any ptr, @This.s(0) ), 12, 0) ' dw_a, dw_b and dw_c
BCryptCloseALGOrithmProvider( hRand, 0 )
' Warm up
This.s(3) = 1
For i As Ulong = 1 To 100
This.RangeN( 1, 10 )
Next
This.GetSnapshot( )
End Sub
' Each time we define a generator it gets initialized automatically and a snapshot is taken.
Constructor SFC32
This.Initialize
This.GetSnapshot
End Constructor
Code: Select all
#Include Once "SFC32.bi"
Dim SFC0 As SFC32
SFC0.Initialize( )
Dim As Ulong i
Dim As Ulongint k
Dim As Double tot, t
Print "6 Random Ulongs" : Print
For i = 1 to 6
Print SFC0.Rand
next
Print
Print "6 Random floats ( 32-bit granularity )" : Print
For i = 1 To 6
Print SFC0.Rnd
next
Print
Print "6 Random floats ( 53-bit granularity )" : Print
For i = 1 To 6
Print SFC0.RndD
next
Print
Print "Random initial stage" : Print
For i = 1 to 4
Print SFC0.Rnd
Next
Print
SFC0.SetSnapshot
Print "Random second stage" : Print
For i = 1 to 6
Print SFC0.Rnd
Next
Print
SFC0.SetSnapshot
Print "Repeat second stage" : Print
For i = 1 to 6
Print SFC0.Rnd
Next
Print
Print "Accuracy over 10^8 bounded integers ( 0, 255 ) Theoretical result = 127.5" : Print
For i = 1 to 10^8
k += SFC0.RangeN( 0, 255 )
Next
Print k/10^8
Print
Print "Accuracy over 10^8 continuous float ( 0, 255 ) Theoretical result = 127.5" : Print
For i = 1 to 10^8
tot += SFC0.RangeF( 0, 255 )
Next
Print tot/10^8
k = 0 : tot = 0
Print
Print "Accuracy over 10^8 bounded integers ( -10, 10 ) Theoretical result = 0" : Print
For i = 1 to 10^8
k += SFC0.RangeN( -10, 10 )
Next
Print k/10^8
Print
Print "Accuracy over 10^8 continuous floats ( -10, 10 ) Theoretical result = 0" : Print
For i = 1 to 10^8
tot += SFC0.RangeF( -10, 10 )
Next
Print tot/10^8
k = 0 : tot =0
Print
Print "Throughputs:" : Print
t = Timer
For i = 1 to 10^8
k += SFC0.Rand
Next
t = Timer - t
Print k;" ";
Print "Rand ";Int(100/t);"MHz"
t = Timer
For i = 1 to 10^8
tot += SFC0.Rnd
Next
t = Timer - t
Print tot;" ";
Print "Rnd ";Int(100/t);"MHz"
tot = 0
t = Timer
For i = 1 to 10^8
tot += SFC0.RndD
Next
t = Timer - t
Print tot;" ";
Print "RndD ";Int(100/t);"MHz"
k = 0 : tot = 0
t = Timer
For i = 1 to 10^8
k += SFC0.RangeN( 0, 255)
Next
t = Timer - t
Print k;" ";
Print "Unbounded integers ";Int(100/t);"MHz"
t = Timer
For i = 1 to 10^8
tot += SFC0.RangeF( 0, 255 )
Next
t = Timer - t
Print tot;" ";
Print "Continuous floats ";Int(100/t);"MHz"
Sleep