A fast CPRNG

Windows specific questions.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

With regard different versions of Windows:

Windows 95 and above: Can use CryptGenRandom
Windows Vista and above: Can use BCryptGenRandom ( as posted )

As we progress from Windows 95, Microsoft stepped up and used the latest recommendation from the NIST with regard what the random number generator is based upon. We used the same API but got a better quality generator.

For XP and above we can avoid invoking a Cryptographic Service Provider and use RtlGenRandom. Microsoft says "It may be altered or unavailable in subsequent versions." I doubt that, Microsoft uses it in two of their flagship products and a Microsoft insider said that it would not happen, a guy at Stack Overflow wrote to Microsoft.

The above was ported from PowerBASIC where I have three versions: RtlGenRandom, BCryptGenRandom and Intel RdRand.

I regard the Intel RdRand version as the ultimate random number generator after quantum random number generators as used at https://qrng.physik.hu-berlin.de, for example. Some have suggested that the NSA have put a back door on Intel RdRand. I don't believe that. There was a very strong rumour that the dual elliptic curve random-number generator algorithm did have a NSA back door. Microsoft pulled it in Windows 10 - it was in Windows 8, and, perhaps, before.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

With CryptoS, and others, being a flat distribution any single value within [0,1) has the same likelihood of being returned. With Gaussian random numbers values about the peak are more likely to be returned than values in the tails; of the 'bell curve'.

In fact, 95% of values returned should fall within -1.96 and 1.96; got from the 'normal distribution' tables for 95%.

The following returns estimated Gaussian random numbers.

Added to CryptoRndBufferQPC.inc

Code: Select all

Private Function Gauss As Single
Static As Long u2_cached
Static As Single u1, u2, x1, x2, w
 
  If u2_cached = -1 Then
    u2_cached = 0
    Function = u2
  Else
    Do
      x1 = CryptoSX ' Note SX and not S
      x2 = CryptoSX ' Note SX and not S
      w = x1 * x1 + x2 * x2
    Loop While w >= 1
    w = Sqr( -2 * Log(w)/w )
    u1 = x1 * w
    u2 = x2 * w
    u2_cached = -1
    Function = u1
  End If
 
End Function
Although an estimate the results are not bad.

Here is a comparison with the Normal Distribution tables and the following graph (from http://www.usmle-forums.com/usmle-step- ... ution.html)

Image

' Compile as Console

Code: Select all

' Compile as Console
#Include Once "CryptoRndBufferCNG.inc"
 
InitializeCryptoBuffers( 128*1024 )
 
Dim As ULongInt i, lCount, lTest
Dim as Single x, Mean, StandardDeviation
 
lTest = 10000000
 
' Compare with Normal Distribution tables
 
Print "Compare with Normal Distribution tables"
For i = 1 To lTest
  If Abs(Gauss) < 2.57583 Then lCount += 1 ' 99% level
Next
Print Using "##.###%"; lCount*100/lTest
 
lCount = 0
For i = 1 To lTest
  If Abs(Gauss) < 1.96 Then lCount += 1 ' 95% level
Next
Print Using "##.###%"; lCount*100/lTest
 
lCount = 0
For i = 1 To lTest
  If Abs(Gauss) < 1.64485 Then lCount += 1 ' 90% level
Next
Print Using "##.###%"; lCount*100/lTest
 
Mean = 0
StandardDeviation = 1
Print
 
' Compare with graph
 
Print "Compare with graph"
lCount = 0
For i = 1 to lTest
If Abs(Mean + Gauss * StandardDeviation) < 1 then lcount += 1
Next
Print Using "##.###%"; 100*lCount/lTest
 
lCount = 0
For i = 1 to lTest
If Abs(Mean + Gauss * StandardDeviation) < 2 then lcount += 1
Next
Print Using "##.###%"; 100*lCount/lTest
 
lCount = 0
For i = 1 to lTest
If Abs(Mean + Gauss * StandardDeviation) < 3 then lcount += 1
Next
Print Using "##.###%"; 100*lCount/lTest
 
CleanUpCryptoRndBufferCNG
 
Sleep
Typical output of test:

Code: Select all

99.004%
95.012%
89.995%
 
68.288%
95.437%
99.727%
NB: If you add it to CryptoRnd.bas for your own compilation use: 'Public Function Gauss As Single'

Added: I doubt that Gauss will be called in rapid succession but we don't want a noticeable delay in its calculation. Although written in BASIC my machine can churn out 36 million per second - that is 28 nanoseconds on average. I say on average because we 'fly out' with a cached u2.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Shock horror - I have found a bug!

In Sub InitializeCryptoBuffers we have:

Code: Select all

....
ptrBaseBuffer0 = VarPtr( Buffer0(0) )
...
...
ptrBaseBuffer1 = VarPtr( Buffer1(0) )

That should read:

Code: Select all

....
ptrBaseBuffer0 = VarPtr( Buffer0(1) )
...
...
ptrBaseBuffer1 = VarPtr( Buffer1(1) )

Numerically, it may be difficult, if not impossible, to establish what effect this has. From a Windows memory management perspective it could well be that we would get away with it all the time.

Nonetheless, it is a mistake and needs correcting.

The library in the download above has been corrected and the bi file now includes Function Gauss.

Thanks to fxm for pointing out what the compiler option -exx does. I need to sit down with a cup of tea and read that section thoroughly.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

In the thread 'Randomize issue' dodicat posted some code which compares the FreeBASIC random number generators so I thought I'd add CryptoRnd - so far I have only tested it for speed and not in a 'real world' environment; other than a Monte Carlo.

The original timings were too small for my liking so I increased dodicat's loop from 100,000 to 1,000,000. I didn't want to push it too far as I may have nodded off waiting for 'Win crypt 5'.

Here is a typical run:

Code: Select all

Default   0      0.05014492066696619       51.01971948053688
CRT       1      0.0609093474057687        46.17083740234375
Fast      2      0.05016880637400334       9.833148672711104
Mersenne  3      0.05001564443855955       51.01971948053688
Qbasic    4      0.04956656503577506       30.42314422130585
Win crypt 5      9.641039294098121         38.51952963066287
CryptoRnd Single 0.01815258643367201       8.553769588470459
CryptoRnd Double 0.03408547734254341       0.186936880354514
Done
I have included both Single and Double because Single uses a ULong and Double uses two ULongs. Double then is a 'genuine' Double and not a FreeBASIC double of (32 bit) * (1.0# / 4294967296.0#).

The results are quite good - 2.76 times faster than 'Fast 2' for Single and 1.47 times faster than 'Fast 2' for Double for this run.

I rarely need a random number within a range but have noticed it is used a lot by game programmers. CryptoRnd has a range function built in - CryptoR( 0, 255 ), for example, which will return a value in [0,255]. For the FreeBASIC generators I used 'Function rnd_range (first As Double, last As Double) As Double' used in the Help file under Rnd.

The task is very much easier for this exercise so I increased dodicat's loop from 100,000 to 100,000,000

Here is a typical run.

Code: Select all

Default   0      0.3152552908169923       177
CRT       1      0.2662600274587703       138
Fast      2      0.2917546084782998       208
Mersenne  3      0.3144139827898584       177
Qbasic    4      0.2907277321695574       139
Win crypt 5      24.43443566152556        191
CryptoRnd Range  0.0584630169600473       78
Done
Here, the results are even better - just less than 5 times faster than 'Fast 2'. To be fair, CryptoR() is written in assembler.

However, I have some very bad news. The above does not use the CryptoRnd library; sneaky of me to call it CryptoRnd <smile> - it uses the inc file. The library is much slower and everybody beats it except 'Win crypt 5'. I had concluded that the library was as fast as the inc file but, clearly, not in the case above.

I have read a discussion on the forum about inc vs bi. Coming from PowerBASIC inc makes sense to me but, by all means, change the inc extension to bi if that makes sense to you.

Makes me wonder if the other algorithms would benifit from buffering. It is not expensive - a 128KB buffer is initialised in no more than 0.65ms - the worst case recorded so far.

Of course, CryptoRnd isn't the slightest bit interested in the thread 'Randomize issue' as it does not require a seed. However, no seed - no repetition. We cannot, at the moment, test an idea with a set of random numbers and then test another idea using the same random numbers. I say, 'at the moment' because I am kicking an idea around where we can.

Added: The TestCrpypto code mentioned earlier has been compiled using procedures without Public/Private, all with Private and the library. This is what I get.

84,480 bytes
83,456 bytes using InitializeCryptoBuffers(), CryptoS and CleanUpCryptoRndBufferCNG
88,064 bytes

No big deal but we can see the dead code removal in the second exe.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Couple of quick ones:

1) If we drop 'InitializeCryptoBuffers(128*1024)' at the bottom of the inc file it will get executed. We don't have to remember to put it in the main source now. 128*024 is the optimal on my machine - good speed and negligible stutter. We still have to use 'CleanUpCryptoRndBufferCNG' before exiting our application.

2) Compilation will fail in 64 bit unless we change the Long to LongInt in the two instances of 'SwitchBufferCriteria = Cast(Long, ptrBuffer) + BufferSize'.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Sometimes we need to repeat a sequence of random numbers - although not very often in my case. Generally, we cannot do that with cryptographic random numbers.

However, we have buffers of fixed sequences until they are overwritten. All we have to do is to avoid that and reset the buffer pointer to the beginning of the buffer to reuse a sequence. The following little snippet will do just that.

Code: Select all

Private Sub ResetBufferPointer
  ptrBuffer = ptrBaseBuffer0
End Sub
That will reset the pointer to the beginning of buffer A; provided we are in Buffer A at the time. Buffer A is shown in the opening post.

The problem we are faced with is to know, a priori, the maximum number of random numbers we are likely to need during an application session. We may have to grossly over estimate to ensure that we do not enter Buffer B. If we do enter Buffer B then Buffer A will be subject to a refill and we defeat the object.

It is normally not recommended to use a large buffer as this increases the potential of stutter during a buffer switch. This is, of course, academic if we do not switch - there cannot be any stutter.

Here is an example usage:

Code: Select all

#Include Once "CryptoRndBufferCNG.inc"
 
Dim as Long i
 
For i = 1 to 10
  Print CryptoS
Next
 
For i = 1 to 10000
  CryptoS
Next
 
ResetBufferPointer
 
Print
For i = 1 to 10
  Print CryptoS
Next
 
CleanUpCryptoRndBufferCNG
 
Sleep
and a typical output.

Code: Select all

 0.284922
 0.1382313
 0.2343062
 0.7762593
 0.5216942
 0.7003971
 0.003035188
 0.2872491
 0.5853286
 0.05180609
 
 0.284922
 0.1382313
 0.2343062
 0.7762593
 0.5216942
 0.7003971
 0.003035188
 0.2872491
 0.5853286
 0.05180609
InitializeCryptoBuffers( 128*1024 ) is now in the inc file and will need to be edited if more than 128*1024 is needed. Don't forget that we are talking 4 x bytes per random number for the Singles and Range and 8 x bytes for the Doubles.

This method is wasteful of memory because Buffer B will be redundant. That should not be an issue these days.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Here is a new version of CryptoRndBufferCNG.inc called CryptRndBuffer.bas.

The original inc file ceased to an inc file as soon as we added the directly executable line using InitializeCryptoBuffers; although some would argue that I should not have used an inc file in the first place, but that is another story. It seems that dead code removal still works when we include a bas file.

CryptRndBuffer.bas is the latest version of the inc file including all the bits and pieces above plus a preprocessor directive.

#define Algo 1: Generates code using BCryptGenRandom
#define Algo 2: Genrates code using Intel's RdRand

BCryptGenRandom was designed to fill a buffer and why we are getting faster speeds than the inbuilt FreeBASIC generators. Intel's RdRand has a 16 bit, 32 bit or 64 bit output. The following is based upon the 32 bit output so fills the buffers one Ulong at a time. It does that quickly but does not compete with BCryptGenRandom.

Not all of you will be able to use 'Algo = 2' as it was introduced with the Intel Ivy Bridge processor in 2012 and AMD introduced it in 2015 but, it seems, it is not a 'word for word' copy of Intel's RdRand.

Not sure if you CPU supports RdRand?

Here is a little GUI which will tell you.

Code: Select all

#include once "windows.bi"
 
Function IsRdRand() As Long
 
Dim CPUManID As String * 12
Dim CPUManIDPtr As Ulong Ptr
 
' Get Manufactures Id String
CPUManIDPtr = Cptr(Ulong Ptr, StrPtr( CPUManID ))
Asm
  mov eax, 0 ' Get Vendor Id
  cpuid
  mov eax, [CPUManIDPtr]
  mov [eax], ebx
  mov [eax + 4], edx
  mov [eax + 8], ecx
End Asm
If CPUManID = "GenuineIntel" or CPUManID = "AuthenticAMD" Then
  ' Is RdRand supported?
  Asm
    mov eax, 1 ' Get features
    cpuid
    test ecx, &h40000000 ' Bit 30
    jz NoRdRand
    mov [Function], -1
NoRdRand:
  End Asm
End If
End Function
 
If IsRdRand = -1 then
  Messagebox( Null, "Yes, it is.", "Is RdRand supported?", MB_OK )
Else
  Messagebox( Null, "No, it is not.", "Is RdRand supported?", MB_OK )
End If 
Becuase it takes longer to fill the buffers then there is the potential for a greater exhaustion stutter; as described in the opening post. For this reason my default 'Algo = 2' buffer is 32x1024 bytes as opposed to the 'Algo = 1' default buffer of 128x1024 bytes.

Here is a test to compare the FreeBASIC's algorithm's 1 to 4 and CryptoRndBuffer.bas with 'Algo = 2'. FreeBASIC's algorithm 5 is not included - it far to slow for large requests of random numbers.

Code: Select all

' Compile as Console
#Include Once "CryptoRndBuffer.bas"
 
Dim as Long lRequestNumber, i, j
Dim As Single z, x, y
Dim AS Double t
 
lRequestNumber = 10000000
 
For i = 1 to 4
   Randomize, i
   t = timer
   For j = 1 to lRequestNumber
      x = Rnd : y = Rnd
      z = sqr( x*x + y*y )
   Next
   t = timer -t
   Print i;": "; Str(Int(lRequestNumber/(t*10^6))) + " Million per second"
Next
 
t =  timer
For i = 1 to lRequestNumber
  x = CryptoS : y = CryptoS
  z = ( x*x + y*y )
Next
t = timer - t
Print " CryptoS: ";Str(Int(lRequestNumber/(t*10^6))) + " Million per second"
 
'CleanUpCryptoRndBufferCNG
 
Sleep
... and here is a typical output.

Code: Select all

 1: 31 Million per second
 2: 35 Million per second
 3: 30 Million per second
 4: 33 Million per second
 CryptoS: 20 Million per second

We could get 27 Million per second with a 128x1024 byte buffer but the potential stutter jumps to 370ns compared with 180ns with
a 32x1024 byte buffer. However, the stutter calculation assumes that we are exhausting the buffers at a 'break neck' speed - something that will not happen, probably, in practice. If you are desperate for speed then increase the buffer size and see what difference it has to your application. With a 512x1024 byte buffer that 20 Million jumps to 30 Million and inline with the Merseene Twister speed. However, the potential stutter jumps to about 870ns, close to 1ms. Your application may accomodate that - I don't know.

At the other extreme with a buffer size of 40MB (> 4x10 Million) then that 20 Million jumps to 131 Million. However, the time to fill the buffer jumps to 17.7ms for 'Algo = 1' and 166.4ms for 'Algo 2' compared with 0.46ms for the 'Algo = 1' default buffer size and 0.88ms for the 'Algo= 2' default buffer size

FreeBASIC's assembler syntax is a little different to what I am used to so I put CryptoRndBuffer.bas using 'Algo = 2' through the 'wringer' with PractRand. I got to 1TB of crunching with only a couple of barely unusal anomalies - the same results got using BCryptRandom. With previous tests comparing BCryptGenRandom with Intel's RdRand, Intel's RdRand had an edge but I would not say that is was significant. Intel's RdRand analyses thermal noise on the CPU so I regard it is a poor man's quantum random number generator although, I suspect, Intel may prefer another description. <smile>

To put the above into context, with 'Algo = 1' the above test gave 101 Million per second with a 128x1024 byte buffer. In that case we needed to uncomment the cleanup to avoid a memory leak.

So, what to do? If speed is not an issue and I wanted the best random numbers that my PC can offer then I would use 'Algo = 2'. If speed was an issue than I would go for 'Algo = 1'. Of course, I am biased but 'Algo = 1' is fast, it has been PractRand tested to 1TB and I don't need no 'stinkin' seed. <smile>

Technical note: Intel do not guarantee a random number will be returned on invoking RdRand. The likelihood of a failure is very small and, in the event of a failure, they suggest simply trying again. I allow up to 10 tries. I suspect that the probability of failing 10 times is infinitesimally small but it is not zero. If this did happen the code goes into a recovery mode and uses Microsoft's RtlGenRandom for the failing Ulong. In all probability that is redundant code but it has to be there.

CryptoRndBuffer.bas ( Latest version: 2 Mar 2017 13:27 and takes into account the following post. )
Note: Back to 32 bit and 64 bit!

Code: Select all

#ifndef ALGO
  #define ALGO 1
#endif

#include once "windows.bi"
#include once "win/bcrypt.bi"
#inclib "bcrypt"

#if (ALGO = 2)
  Declare Function RtlGenRandom Lib "Advapi32.dll" Alias "SystemFunction036" _
  ( RandomBuffer As Any Ptr, RandomBufferLength As Ulong ) As Byte
#endif

Dim Shared As Byte Ptr hRand
Dim Shared As Byte Buffer0(), Buffer1()
Dim Shared As Long BufferSize
Dim Shared As Any Ptr hThread()
Dim Shared As Any Ptr ptrBuffer, ptrBaseBuffer0, ptrBaseBuffer1 
Dim Shared As Long SwitchBufferCriteria

Declare Sub SwitchBuffer
Declare Sub FillBuffer( As Any Ptr )
Declare Sub ResetBufferPointer
Declare Sub InitializeCryptoBuffers( As Long )

#If (ALGO = 1)
  BufferSize = 128*1024
#Else
  BufferSize = 32*1024
#Endif
 
InitializeCryptoBuffers( BufferSize )

Private Function CryptoDW As ULong

  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
  
  Asm
    mov eax, dword ptr [ptrBuffer]
    mov eax, [eax]
    mov [Function], eax
  End Asm
  
  ptrBuffer += 4
  
End Function

Private Function CryptoS As Single ' [0,1)
  
  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
 
  ' ASM by Wilbert @ PureBasic forums
  Asm
    mov eax, dword ptr [ptrBuffer]
    mov eax, [eax]
    movd xmm0, eax
    psrlq xmm0, 9
    mov eax, 1
    cvtsi2ss xmm1, eax
    por xmm0, xmm1
    subss xmm0, xmm1
    movd [Function], xmm0
  End Asm
  
  ptrBuffer += 4
 
End Function
 
Private Function CryptoSX As Single ' [-1,1]
 
  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
 
  ' ASM adapted from CryptoS by author
  Asm
    mov eax, dword ptr [ptrBuffer]
    mov eax, [eax]
    movd xmm0, eax
    psrlq xmm0, 9
    mov edx, 2
    cvtsi2ss xmm1, edx
    por xmm0, xmm1
    subss xmm0, xmm1
    mov edx, 1
    cvtsi2ss xmm1, edx
    subss xmm0, xmm1
    movd [Function], xmm0
  End Asm
  
  ptrBuffer += 4
  
End Function
 
Private Function CryptoD As Double  ' [0,1)
Dim As ULongInt uld
  
  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
 
  ' ASM by Wilbert at PureBasic forums
  Asm
    mov eax, dword ptr [ptrBuffer]
    movd xmm0, [eax]
    movd xmm1, [eax + 4]
    punpckldq xmm0, xmm1
    psrlq xmm0, 12
    mov eax, 1
    cvtsi2sd xmm1, eax
    por xmm0, xmm1
    subsd xmm0, xmm1
    movq [Function], xmm0
  End Asm
  
  ptrBuffer += 8
 
End Function
 
Private Function CryptoDX As Double  ' [-1,1]
 
  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
 
  ' ASM adapted from CryptoD by author
  Asm
    mov eax, dword ptr [ptrBuffer]
    movd xmm0, [eax]
    movd xmm1, [eax + 4]
    punpckldq xmm0, xmm1
    psrlq xmm0, 12
    mov eax, 2
    cvtsi2sd xmm1, eax
    por xmm0, xmm1
    subsd xmm0, xmm1
    mov eax, 1
    cvtsi2sd xmm1, eax
    subsd xmm0, xmm1
    movq [Function], xmm0
  End Asm
  
  ptrBuffer += 8
 
End Function
 
Private Function CryptoR( ByVal One As Long, ByVal Two As Long ) As Long
 
  If ptrBuffer >= SwitchBufferCriteria Then
    SwitchBuffer
  End If
  
  ' ASM by John Gleason @ PowerBASIC forums 
  Asm
    mov edx, dword ptr [ptrBuffer]
    mov edx, [edx]
    mov ecx, [One]
    mov eax, [Two]
    cmp ecx, eax
    jl Now1LT2
    xchg eax, ecx
Now1LT2:
    Sub eax, ecx
    inc eax
    jz doTheRnd
    mul edx
    Add edx, ecx
doTheRnd:
    mov [Function], edx
  End Asm
  
  ptrBuffer += 4
  
End Function
 
Private Function Gauss As Single
Static As Long u2_cached
Static As Single u1, u2, x1, x2, w
 
  If u2_cached = -1 Then
    u2_cached = 0
    Function = u2
  Else
    Do
      x1 = CryptoS
      x2 = CryptoS
      w = x1 * x1 + x2 * x2
    Loop While w >= 1
    w = Sqr( -2 * Log(w)/w )
    u1 = x1 * w
    u2 = x2 * w
    u2_cached = -1
    Function = u1
  End If
 
End Function

Private Sub InitializeCryptoBuffers( ByVal Buffer As Long )
  #if (ALGO = 1)
    BCryptOpenALGOrithmProvider( VarPtr(hRand), BCRYPT_RNG_ALGORITHM, 0, 0)
  #endif
  ReDim As Any Ptr hThread(0 To 1)
  If Buffer < 1024 Then
    BufferSize = 1024
  Else
    BufferSize = Buffer - Buffer Mod 8
  End If
  ReDim Buffer0( 1 To BufferSize) As Byte
  ptrBaseBuffer0 = VarPtr( Buffer0(1) )
  ptrBuffer = ptrBaseBuffer0
  #ifdef __FB_64BIT__
    SwitchBufferCriteria = Cast(LongInt, ptrBuffer) + BufferSize
  #else
    SwitchBufferCriteria = Cast(Long, ptrBuffer) + BufferSize
  #endif
  hThread(0) = ThreadCreate( @FillBuffer, ptrBaseBuffer0 )
  hThread(1) = ThreadCreate( @FillBuffer, ptrBaseBuffer0 + BufferSize\2 )
  ThreadWait( hThread(0) )
  ThreadWait( hThread(1) )
  ReDim Buffer1( 1 To BufferSize) As Byte
  ptrBaseBuffer1 = VarPtr( Buffer1(1) )
  hThread(0) = ThreadCreate( @FillBuffer, ptrBaseBuffer1 )
  hThread(1) = ThreadCreate( @FillBuffer, ptrBaseBuffer1 + BufferSize\2 )
  
End Sub

#if (ALGO = 1)
  Private Sub CleanUpCryptoRndBufferCNG
    BCryptCloseALGOrithmProvider( hRand, 0  )
  End Sub
#endif

#If (ALGO = 1)
Private Sub FillBuffer( ByVal BaseBuffer As Any Ptr )
  BCryptGenRandom( hRand, BaseBuffer, BufferSize\2, 0)
End Sub
#Else
Private Sub FillBuffer( ByVal BaseBuffer As Any Ptr )
Dim As Long HalfBuffer
Dim As ULong RecoverBuffer
Dim As Any Ptr ptrRecoverBuffer
 
  ptrRecoverBuffer = VarPtr(RecoverBuffer)
 
  HalfBuffer = BufferSize\2
  Asm
    mov edi, dword ptr [HalfBuffer]
    mov esi, 0
    mov ebx, dword ptr [BaseBuffer]
  rptRdRand:
    mov ecx, 10 ' Max number Of tries before going into a recovery
  queryAgain:
  #ifdef __FB_64BIT__
    RdRand rax
  #else
    RdRand eax
  #endif
    jc OK ' A Random value was available
    dec ecx
    jnz queryAgain
    Call Recover ' Use RtlGenRandom For This ULong
  OK:
    #ifdef __FB_64BIT__
      mov qword ptr [ebx + esi], rax ' Store RdRand
      Add esi, 8
    #else
      mov dword ptr [ebx + esi], eax ' Store RdRand
      Add esi, 4
    #endif
      cmp edi, esi
      jne rptRdRand
      jmp Done
  Recover:
  #ifndef __FB_64BIT__
    pushad ' I am playing it safe here
  #endif 
  End Asm
  #ifdef __FB_64BIT__    
    RtlGenRandom(ptrRecoverBuffer, 8)  ' Populate buffer
  #else
    RtlGenRandom(ptrRecoverBuffer, 4)
  #endif
  Asm
  #ifndef __FB_64BIT__
    popad
  #endif
  #ifdef __FB_64BIT__  
    mov rax, qword ptr [ptrRecoverBuffer]
  #else
    mov eax, dword ptr [ptrRecoverBuffer]
  #endif
    ret
  Done:
  End Asm
 
End Sub
#Endif

Private Sub SwitchBuffer
	ThreadWait( hThread(0) )
  ThreadWait( hThread(1) )
  Swap ptrBaseBuffer0, ptrBaseBuffer1
  ptrBuffer = ptrBaseBuffer0
  #ifdef __FB_64BIT__
    SwitchBufferCriteria = Cast(LongInt, ptrBuffer) + BufferSize
  #else
    SwitchBufferCriteria = Cast(Long, ptrBuffer) + BufferSize
  #endif
  hThread(0) = ThreadCreate( @FillBuffer, ptrBaseBuffer1 )
  hThread(1) = ThreadCreate( @FillBuffer, ptrBaseBuffer1 + BufferSize\2 )
End Sub

Private Sub ResetBufferPointer
  ptrBuffer = ptrBaseBuffer0
End Sub
Last edited by deltarho[1859] on Mar 02, 2017 13:29, edited 5 times in total.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Come to think of it...

Instead of defining Algo inside CryptoRndBuffer.bas a better approach would be to define it just before the inclusion.

Eg

Code: Select all

#define Algo 1
#Include Once "CryptoRndBuffer.bas"
or

Code: Select all

#define Algo 2
#Include Once "CryptoRndBuffer.bas"
Inside CryptoRndBuffer.bas we could have

Code: Select all

#ifndef Algo
  #define Algo 1
#endif
so if not defined before the inclusion the default will be 'Algo = 1'.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

... and in the test program I should have written

Code: Select all

#if (Algo = 1)
  CleanUpCryptoRndBufferCNG
#endif
Dear, oh dear!
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

From above
2) Compilation will fail in 64 bit unless we change the Long to LongInt in the two instances of 'SwitchBufferCriteria = Cast(Long, ptrBuffer) + BufferSize'.
With regard RdRand we can change 'RdRand eax' to 'RdRand rax'. This will now fill the buffers at 64 bits at a time as opposed to 32 bits at a time..

Early tests indicate that 'Algo = 2' is now running as fast as 'Algo = 1' Added: *** NOT TRUE *** Further tests required.

Perhaps improvements can be made in the double precision sections. I will give that some thought.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

It is amazing how, sometimes, the obvious evades us.

'RdRand rax' versus 'RdRand eax'.

Even though we need call the former only half as many times as the latter it seems that it takes twice as long to return 64 bits as it does 32 bits so, in our context, they will equate.

I would have thought that the overhead of calling RdRand twice in the latter's case would show up in the timings but I cannot see any evidence of this.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

STOP PRESS

Do not compile with 64 bit.

I have a problem somewhere but I cannot find it. I am knew to 64 bit assembly and have just learned that there is no such thing as pushad and popad in x64. The 64 bit command line did not spot that either. I have conditionally worked around that but the issue seems to be earlier than that. I will find it but don't hold your breath.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

CryptoRndBuffer.bas is now for 32 bit compilation only - I bit off more than I can chew with 64 Bit! We live and learn.<smile>
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

64 bit is now back on the table.

It was a simple correction but not easily found and it had been found before. 'Been there a few times. <smile>

Stuff like 'mov [ebx + esi], eax' needed editing to 'mov dword ptr [ebx + esi], eax'. That was needed 11 times in just over 300 lines of code.

New version of CryptoRndBuffer.bas posted above. I'll be back with timimg tests. The 64 bit version took a hit compared to the 32 bit version but a kick exceeded the 32 bit version with some tips from srvvaldez.
deltarho[1859]
Posts: 4310
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: A fast CPRNG

Post by deltarho[1859] »

Timings.

Code: Select all

              Flat out                Real world
       32 bit  64 bit  64 bit*  32 bit  64 bit  64 bit*
Algo 1  219     160     307      100     73      119     BCryptGenRandom
Algo 2   36      57      57       20     32       32     Intel RdRand
 
Flat out                                  Real World [1]
32 bit        64 bit        64 bit [2]    32 bit       64 bit       64 bit [2]
1:    71 MHz  1:    66 MHz  1:    68 MHz  1:   31 MHz  1:   33 MHz  1:    35 MHz
2:   105 MHz  2:   259 MHz  2:   299 MHz  2:   35 MHz  2:  138 MHz  2:   162 MHz
3:    85 MHz  3:   166 MHz  3:   182 MHz  3:   30 MHz  3:   71 MHz  3:    95 MHz
4:    97 MHz  4:   259 MHz  4:   299 MHz  4:   33 MHz  4:  115 MHz  4:   162 MHz
CRB: 219 MHz  CRB: 160 MHz  CRB: 307 MHz  CRB: 99 MHz  CRB: 73 MHz  CRB: 118 MHz
 
where CRB = CryptoRndBuffer
 
[1] { x = Random number : y = Random number : z = Sqr( x*x + y*y ) }
[2] Compiler options: -gen gcc -Wc -O1
Flat out requests random numbers and discards them. It is, effectively, a throughput measure. Real World requests two random numbers and performs a complex calculation. The apparent throughput speed will be inversely proportional to the complexity. The greater the complexity then will tend to compress the difference between generator speeds.

In 32 bit mode flat out CRB is a clear winner and about twice as fast as 2:, a Multiply-with-carry implementation. In 32 bit mode Real World options 1: to 4: are bunched together with CRB being three times faster.

In 64 bit mode flat out options 2: and 4: are blisteringly fast. Option 3:, Mersenne Twister, has benefited but not to the extent that option 2: and 4: have. CRB has taken a performance hit. On the other hand when gcc is used CRB's performance is increased more then the other options putting it slightly ahead of option 2: and 4:. We have a similar story with 64 bit Real World but with options 2: and 4: pulling away.

Conclusion:

In 32 bit mode CRB would be my choice, it is faster than any other option and, being cryptographic, should have the better degree of randomness and, of course, does not require any seeding.

In 64 bit mode if speed is an important issue then option 2: seems to be the best choice. If speed is not an important issue and the degree of randomness is then I would chose CRB.

Interestingly, CRB is as fast, if not better, than Mersenne Twister in all of the above tests.

With regard Algo 2, using Intel's RdRand, it has a fast throughput on paper but it accepts one random number request at a time as do options 1: to 4:. In this case the buffer size used is a quarter of that used by BCryptGenRandom to keep the stutter down to a similar level. In 64 bit mode it is about 50% faster than 32 bit mode. Here we are using the 64 bit register rax as opposed to the 32 bit register eax and request half the number of random numbers as with 32 bits. So, even though we are given 64 bits as opposed to 32 bits our total request overhead is halved. On filling the buffers we are filling with 64 bits as opposed to 32 bits but the 'posting' overhead is halved. If RdRand was developed to deliver a buffer size on a single request, like BCryptGenRandom, then it could quite possibly wipe the floor with everybody. The use of gcc has had no effect. The degree of randomness may be better than anything else tested here and it could be a choice for someone requiring just that but I don't think that it is significantly better than BCryptGenRandom which is significantly faster.
Last edited by deltarho[1859] on Mar 03, 2017 15:02, edited 4 times in total.
Post Reply