Range only PRNG

angros47 · Post by **angros47** » May 24, 2021 22:42

jj2007 wrote:@angros47: the portability argument is ok, the rest is wishful thinking. A good assembly programmer can always beat a compiler.

Sure, they could, but they would have to write the whole program in assembly, not just a routine. Adding only few custom ASM routines in a compiled program might make things harder for automated optimizations, and as result it could make things worse, not better

deltarho[1859] · Post by **deltarho[1859]** » May 24, 2021 22:54

In the Returns change RangeN.dw_c to dw_tmp.

Code: Select all

Timings are now:
-O3
   32    64
K  82  497
F  79  258
 
-O2
   32   64
K  71  501
F  54  260

Looks like Ulongint is problematic in 32-bit mode. I have seen that before.

However, I don't know what is going on with continuous floats in 64-bit, my Returns are in the correct order.

I need a cup of tea.

jj2007 · Post by **jj2007** » May 24, 2021 23:00

angros47 wrote:
jj2007 wrote:@angros47: the portability argument is ok, the rest is wishful thinking. A good assembly programmer can always beat a compiler.
Sure, they could, but they would have to write the whole program in assembly, not just a routine. Adding only few custom ASM routines in a compiled program might make things harder for automated optimizations, and as result it could make things worse, not better

It depends. If the compiler can optimise away what a simple, small assembly routine does, the compiler may win. But a more sophisticated assembly routine cannot be optimised away so easily. Take Instr(text, match), for example. That is a complex routine, and I bet the compiler has no chance.

deltarho[1859] · Post by **deltarho[1859]** » May 24, 2021 23:18

The Type needs changing from SFC32 to SFC64.

However, that won't affect the timings. That has me foxed.

It could be that 32-bit is better at task A than task B whereas 64-bit is better at task B than task A. I have seen that before as well.

Why am I using K instead of N? Ah, because I used 'k +='.

I need some sleep, my grey matter is starting to seize up.

angros47 · Post by **angros47** » May 25, 2021 15:45

@jj2007

The fact is, deltarho tested it, and optimized code built by fbc+gcc was faster than assembly viewtopic.php?p=283050#p283050

I tried to explain why this happened. It doesn't matter which one is faster in theory, as a matter of fact, compiled code is often faster than assembly.

deltarho[1859] · Post by **deltarho[1859]** » May 25, 2021 18:00

Conclusion using gcc 8.4 at -O2

Code: Select all

   SFC32     SFC64
  32   64   32  64
N 284  477  71  501
F 576  645  55  258

N bounded integers: F continuous float

SFC64 is poor in 32-bit mode. In 64-bit mode bounded integers are fast and faster than continuous float. However, the bounded integers result is not significantly faster than SFC32.

SFC32 continuous floats are faster than bounded integers in both 32-bit mode and 64-bit mode. In 64-bit mode both bounded integers and continuous floats are faster than 32-bit mode.

So on my machine FreeBASIC is better served by SFC32. The slowest with SFC32 is bounded integers in 32-bit mode and close to PCG32II. For applications using a range requiring a fast throughput SFC32 is a good choice remembering that it passes PractRand to 2TB at least.

deltarho[1859] · Post by **deltarho[1859]** » May 25, 2021 18:28

Regarding the discussion between jj2007 and angros47 putting portability to one side if performance is an issue then both PowerBASIC and fbc+gas may benefit from some code being written in asm. With regard fbc+gcc the likelihood of benefitting from using asm is greatly reduced and in some situations may have an impact on performance.

I would never say this at the PowerBASIC forum but from a performance perspective fbc+gcc beats PowerBASIC. For those of us who are SDK challenged, PowerBASIC is a great environment for writing GUI applications.

jj2007 · Post by **jj2007** » May 25, 2021 21:22

angros47 wrote:compiled code is often faster than assembly

Compiled code can be faster than assembly. My assembly code is typically a factor 2-3 faster than the C runtime functions (which, I suppose, are produced with the best compiler available).

deltarho[1859] · Post by **deltarho[1859]** » May 25, 2021 22:41

I have hinted at looking at SFC as a general purpose generator as opposed to range only. Some range performance figures are very fast.

For [0,1) computations I am getting 0.500023 as a typical average, often with four zeros, for 10^8 requests. This is indicative of top quality distribution uniformity. We already have top quality randomness.

Speed wise I am getting exceeding 1100MHz in 32-bit mode for a '+=' timing. Doing a '=' timing sees gcc ripping out the loop. So the real generator performance is actually faster. This gives us a 1GHz plus generator.

In a race with PCG32II we'd need to use binoculars with our rearview mirror.

For those of you who have being testing SFC32.bi try this:

Code: Select all

Function SFCRnd ( Byref RangeN As SFC32 ) As Double
' Returns a double in [0,1) with 32-bit granularity
Dim As Ulong dw_tmp
RangeN.dw_counter += 1
dw_tmp = RangeN.dw_a + RangeN.dw_b + RangeN.dw_counter
RangeN.dw_a = RangeN.dw_b Xor ( RangeN.dw_b shr 9 )
RangeN.dw_b = RangeN.dw_c + ( RangeN.dw_c shl 3 )
RangeN.dw_c = rotl(RangeN.dw_c, 21) + dw_tmp
Return RangeN.dw_c/2^32
End Function

In 64-bit mode I am getting over 970MHz.

deltarho[1859] · Post by **deltarho[1859]** » May 26, 2021 11:34

It seems that using '+=' as opposed to '=' is also prolematical in timings

Consider

Code: Select all

Throughputs:
 
Rand  925MHz
Rnd  886MHz
Unbounded integers  886MHz
Continuous floats  1139MHz

Rnd is not FB's Rnd.

Rnd uses

Code: Select all

Return This.s(2)/2^32

and 'Continuous floats' uses

Code: Select all

Return This.s(2)/2^32*( Last - First ) + First

How on earth can 'Continuous floats' be faster than Rnd?

We need a way to stop gcc messing about with this:

Code: Select all

For i = 1 to 10^8
  l = f(x)
Next

For some reason it also messes with

Code: Select all

For i = 1 to 10^8
  l += f(x)
Next

I have tried 'Asm nop' in 32-bit mode before the '+=' and the throughputs drop dramatically.

@coderJeff

Would it be much trouble to extend #pragma so that we can do this?

Code: Select all

#pragma GCC push_options
#pragma GCC optimize("O0")
For i = 1 to 10^8
  l = f(x)
Next
#pragma GCC pop_options

The loop will not get optimized.

It would also be handy for bracketing Asm blocks. Who knows, maybe some of our Asm blocks do not run as fast as we thought they would because gcc is sticking it's nose in.

srvaldez · Post by **srvaldez** » May 26, 2021 12:20

hi deltarho[1859]
I suggest that you print the sum after the timing loop, gcc may figure that since the value was not used that it's ok to eliminate the loop

deltarho[1859] · Post by **deltarho[1859]** » May 26, 2021 15:05

srvaldez wrote:I suggest that you print the sum after the timing loop

That is really sneaky. Messing about with a loop such that it altered the total when we wanted the total printed would be a major bug.

It may well be after a few years playing with PRNGs that I am seeing the real timings.

Thanks srvaldez.

OK, assuming that the '+=' method does not impact much on timings and we now have a solution to the timing problem this is what I get comparing SFC32 with PCG32II; both of them using '+=' and printing the totals.

Code: Select all

SFC32
     32   64
Rnd  684  731
RndD 141  109
N    555  909
F    574  647
 
PCG32II
     32   64
Rnd  245  550
RndD 125  125
N    255  555
F    233  482

We no longer get 'Continuous floats' being faster than Rnd ( Not FB's Rnd remember )

The relationship between all the metrics seems reasonable now. RndD is a double with 53-bit granularity using two Ulong outputs.

The 'breaking news' is that nearly every single metric of SFC32 is faster than PCG32II.

Over the last couple of days I have been looking at the algorithms PCG32II, Squares, MsWs, xoshiro256 and xoroshiro128 and they all use a mix of +, xor, shl, shr and rotl. SFC32 does less work than any of them so it is no surprise that it is fast. However, when looking at the SFC32 algorithm I am left thinking that it does not deserve to pass PractRand to 2TB with only one small anomaly but it does.

I have written code for SFC32 in the same vein as PCG32II so that it is easy to use. However, I want to kick it around for a while before publishing.

It still hasn't sunk in yet that PCG32II has been pushed off its perch. Having said that, as mentioned above, this additional throughput of SFC32 may not be noticeable in most of our applications; although I should imagine some applications need the fastest generator than they can get their hands on.

Added:

The PCG32II timings are the median of nine tests. The SFC32 timings are single runs and are all over the place, so I need to write some code and time as with the PCG32II figures.

Added:

The SFC32 timings are now the median of nine tests. Comments above have been edited, but the overall conclusion has not changed.

jj2007 · Post by **jj2007** » May 26, 2021 15:29

deltarho[1859] wrote:It still hasn't sunk in yet that PCG32II has been pushed off its perch.

The competition is strong.

deltarho[1859] · Post by **deltarho[1859]** » May 26, 2021 15:49

Melissa O'Neill's PCG family and Bernard Widynski's Middle Square Weyl Sequence and Squares don't get a look in.

deltarho[1859] · Post by **deltarho[1859]** » May 26, 2021 18:33

I have decided to publish 'warts and all' because I have another problem.

In the Usage example code we have:

Code: Select all

Print "Accuracy over 10^8 bounded integers ( -10, 10 ) Theoretical result = 0" : Print

The result is usually close to zero but now and then I get something like this '184467440737.0945'. This only happens when First < 0. I am certain that I have seen this before. If we adjust the '-10, 10' to '0, 20', do the calculation we should get close to 10. Correcting back we should get close to zero, but I still get that barmy figure.

How can the same code with the same task get it right most of the time and, occasionally, throw a 'wobbly''?
Added: The average should be the same but the 10^8 values used will be different.

Here is the new SFC32.bi which is now a full-blown generator similar to PCG32II. The function SFC32.RangeN includes an attempt to avoid issues when First < 0.

Following is a new Usage example program covering the new features of the new SFC32.bi.

If it was not for the barmy figure this would be a finished generator for our toolbox. Grrrr!

SFC32.bi

Code: Select all

' Generator code adapted from code by Daniel Penney at the PowerBASIC forums - thanks Daniel
 
#Include Once "windows.bi"
#Inclib "bcrypt"
#Include Once "win/wincrypt.bi"
 
#define rotl(x,k) ( (x Shl k) Or (x Shr(32-k)) ) ' Note the extra parentheses
 
Type SFC32
  Declare Constructor
  Declare Sub Initialize()
  Declare Function Rand() As Ulong
  Declare Function Rnd() As Double ' 32-bit granularity
  Declare Function RndD() As Double ' 53-bit granularity
  Declare Function RangeN( Byval One As Long, Byval Two As Long ) As Long
  Declare Function RangeF( Byval One As Long, Byval Two As Long ) as Double
  Declare Sub GetSnapshot()
  Declare Sub SetSnapShot()
  As Ulong s(0 To 3)
  As Ulong snaps(0 to 3)
End Type
 
Function SFC32.Rand( ) As Ulong
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)
End Function
 
Function SFC32.Rnd( ) As Double
' 32-bit granularity
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)/2^32
End Function
 
Function SFC32.RndD( ) As Double
' 53-bit granularity
Dim As Ulong dw_tmp, TempVar1, TempVar2
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
TempVar1 = This.s(2)
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
TempVar2 = This.s(2)
Asm
  mov eax, dword ptr [TempVar1]
  movd xmm0, eax
  mov eax, dword ptr [TempVar2]
  movd xmm1, eax
  punpckldq xmm0, xmm1
  psrlq xmm0, 12
  mov eax, 1
  cvtsi2sd xmm1, eax
  por xmm0, xmm1
  subsd xmm0, xmm1
  movq [Function], xmm0
End Asm
End Function
 
Function SFC32.RangeN ( ByVal First As Long, ByVal Last As long ) As Long
Dim As Ulong dw_tmp, Adj
Dim As Long Tempvar
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2) + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
If First < 0 then
  Adj = -First
  First = 0 : Last = Last + Adj
  Return Clng( This.s(2) Mod ( Last + 1) ) - Adj
Else
  Return Clng( This.s(2) Mod ( Last - First + 1) ) + First ' By dodicat
End If
End Function
 
Function SFC32.RangeF ( ByVal First As Long, ByVal Last As long ) As Double
Dim As Ulong dw_tmp
This.s(3) += 1
dw_tmp = This.s(0) + This.s(1) + This.s(3)
This.s(0) = This.s(1) Xor ( This.s(1) shr 9 )
This.s(1) = This.s(2)  + ( This.s(2) shl 3 )
This.s(2) = rotl(This.s(2), 21) + dw_tmp
Return This.s(2)/2^32*( Last - First ) + First
End Function
 
Sub SFC32.GetSnapshot ( )
  For i As Ulong = 0 to 3
    This.snaps(i) = This.s(i)
  Next
End Sub
 
Sub SFC32.SetSnapshot ( )
  For i As Ulong = 0 To 3
    This.s(i)= This.snaps(i)
  Next
End Sub
 
Sub SFC32.Initialize( )
Dim hRand As BCRYPT_ALG_HANDLE Ptr
  BCryptOpenALGOrithmProvider( @hRand, BCRYPT_RNG_ALGORITHM, 0, 0 )
  BCryptGenRandom( hRand, cast(any ptr, @This.s(0) ), 12, 0) ' dw_a, dw_b and dw_c
  BCryptCloseALGOrithmProvider( hRand, 0  )
  ' Warm up
  This.s(3) = 1
  For i As Ulong = 1 To 100
    This.RangeN( 1, 10 )
  Next
  This.GetSnapshot( )
End Sub
 
' Each time we define a generator it gets initialized automatically and a snapshot is taken.
Constructor SFC32
  This.Initialize
  This.GetSnapshot
End Constructor

Usage example (At the end in the Throughputs section the totals are printed to stop gcc messing about with our timing loop as per srvaldez's tip.)

Code: Select all

#Include Once "SFC32.bi"
 
Dim SFC0 As SFC32
 
SFC0.Initialize( )
 
Dim As Ulong i
Dim As Ulongint k
Dim As Double tot, t
 
Print "6 Random Ulongs" : Print
For i = 1 to 6
  Print SFC0.Rand
next
Print
Print "6 Random floats ( 32-bit granularity )" : Print
For i = 1 To 6
  Print SFC0.Rnd
next
Print
Print "6 Random floats ( 53-bit granularity )" : Print
For i = 1 To 6
  Print SFC0.RndD
next
Print
Print "Random initial stage" : Print
For i = 1 to 4
  Print SFC0.Rnd
Next
Print
SFC0.SetSnapshot
Print "Random second stage" : Print
For i = 1 to 6
  Print SFC0.Rnd
Next
Print
SFC0.SetSnapshot
Print "Repeat second stage" : Print
For i = 1 to 6
  Print SFC0.Rnd
Next
Print
Print "Accuracy over 10^8 bounded integers ( 0, 255 ) Theoretical result = 127.5" : Print
For i = 1 to 10^8
  k += SFC0.RangeN( 0, 255 )
Next
Print k/10^8
Print
Print "Accuracy over 10^8 continuous float ( 0, 255 ) Theoretical result = 127.5" : Print
For i = 1 to 10^8
  tot += SFC0.RangeF( 0, 255 )
Next
Print tot/10^8
k = 0 : tot = 0
Print
Print "Accuracy over 10^8 bounded integers ( -10, 10 ) Theoretical result = 0" : Print
For i = 1 to 10^8
  k += SFC0.RangeN( -10, 10 )
Next
Print k/10^8
Print
Print "Accuracy over 10^8 continuous floats ( -10, 10 ) Theoretical result = 0" : Print
For i = 1 to 10^8
  tot += SFC0.RangeF( -10, 10 )
Next
Print tot/10^8
k = 0 : tot =0
Print
Print "Throughputs:" : Print
t = Timer
For i = 1 to 10^8
  k += SFC0.Rand
Next
t = Timer - t
Print k;" ";
Print "Rand ";Int(100/t);"MHz"
t = Timer
For i = 1 to 10^8
  tot += SFC0.Rnd
Next
t = Timer - t
Print tot;" ";
Print "Rnd ";Int(100/t);"MHz"
tot = 0
t = Timer
For i = 1 to 10^8
  tot += SFC0.RndD
Next
t = Timer - t
Print tot;" ";
Print "RndD ";Int(100/t);"MHz"
k = 0 : tot = 0
t = Timer
For i = 1 to 10^8
  k += SFC0.RangeN( 0, 255)
Next
t = Timer - t
Print k;" ";
Print "Unbounded integers ";Int(100/t);"MHz"
t = Timer
For i = 1 to 10^8
  tot += SFC0.RangeF( 0, 255 )
Next
t = Timer - t
Print tot;" ";
Print "Continuous floats ";Int(100/t);"MHz"
 
Sleep

Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG

Re: Range only PRNG