Threadsafe RANDOMIZE and RND

Post by **coderJeff** » Oct 31, 2020 12:32

I merged changes to fbc 1.08 to make RANDOMIZE and RND thread-safe.
pull request: https://github.com/freebasic/fbc/pull/264
bug report: https://sourceforge.net/p/fbc/bugs/914/

In the end, I added a new mutex to the rtlib to serialize access to the single instance global state for the random number generators.
- No change in performance on single threaded (i.e. non-multi-threaded programs)
- You might find RND to perform a little slower on multi-threaded programs due the mutex
- I had considered using thread local storage (TLS) but adds complexity and extra pointer look-ups and would still have the overhead of the generic RANDOMIZE and RND interface.
- I think using the mutex is best we can do with the generic & simple RANDOMIZE + RND design & interface

To expose some internals I also added a new header: ./inc/fbc-int/math.bi
Current version of full ./fbc-int/math.bi can be seen at https://github.com/freebasic/fbc/blob/m ... nt/math.bi
I'll show some examples following this post soon that might help explain what's going on in the rtlib.

I think FB's RANDOMIZE & RND is OK for general use. But, if maximum random number bitrate is the goal, then FB's built-in RND won't be best solution. To make fbc's RND fast-as-possible, I think would need a different design and API than what's being done with current rtlib.

Post by **paul doe** » Oct 31, 2020 13:47

coderJeff wrote:...To make fbc's RND fast-as-possible, I think would need a different design and API than what's being done with current rtlib.

PCG32? Middle-Square Weyl Sequence? Squares RND? These are some implementations that can be used without needing to alter the API too much (if at all), and in the two latter cases their implementation is trivial...

Post by **coderJeff** » Oct 31, 2020 14:03

HI paul doe,
Indeed, it would be very straight forward to add additional RNG's with current API.

What I'm talking about in design and API is:
- Current RANDOMIZE & RND operates on single global state
- runtime selectable RNG (i.e. RANDOMIZE selects an RNG to access through RND) - a virtue and a curse
- function RND( arg as single = 1.0 ) takes parameter, typically unused
- rtlib RND functions have an 'if' statement that checks 'arg' to return last number
- RND converts a 32-bit ulong to a double
- Overall, the overhead is multiple function pointer look-ups, plus a typically unused parameter, an if statement, and a conversion.

For "Fastest":
- local automatic storage (no need for mutex locks)
- directly instance a specific RNG and state (i.e. a template or class)
- if RNG math is 'small', inline the function as an expression (fbc doesn't do inline - can only fake it with a macro)

The example programs for testing "performance" of RNG's tends to be calling the function and doing pretty much nothing with the result.

Post by **coderJeff** » Oct 31, 2020 14:44

Current version of full ./fbc-int/math.bi can be seen at https://github.com/freebasic/fbc/blob/m ... nt/math.bi

The examples following are based on the header as of the first addition.

Example #1: RANDOMIZE & RND in the FBC namespace, plus enum FB_RND_ALGORITHMS

Code: Select all

'' move built-ins out of the global namespace
#undef rnd
#undef randomize

#if defined( __FB_CYGWIN__) or defined(__FB_WIN32__)
#inclib "advapi32"
#endif

namespace FBC

enum FB_RND_ALGORITHMS
	FB_RND_AUTO
	FB_RND_CRT
	FB_RND_FAST
	FB_RND_MTWIST
	FB_RND_QB
	FB_RND_REAL
end enum

extern "rtlib"
	declare sub randomize alias "fb_Randomize" ( byval seed as double = -1.0, byval algorithm as long = FB_RND_AUTO )
	declare function rnd alias "fb_Rnd" ( byval n as single = 1.0 ) as double
end extern

end namespace

1) the 'fbc' namespace is being used to for names relating to fbc internals. If the interface is ever formally published, names should go in the 'fb' namespace. the '#undef' statements remove RANDOMIZE & RND from the global namespace.
2) on windows fbc automatically adds "advapi32" import library if built-in RANDOMIZE & RND are used. Because we are removing the built-ins from the namespace, we need to do it manually for the fbc namespace.
3) enumeration of 'FB_RND_ALGORITHMS': I tend to prefer named things rather than magic constants. This formalizes the selection of random number generator
4) the extern "rtlib" block puts the RANDOMIZE & RND functions in to the fbc namespace and can then be called using 'fbc.RANDOMIZE' or 'fbc.RND' respectively.

Code: Select all

#include once "fbc-int/math.bi"
'' fbc-int/math.bi automatically includes fbmath.bi

fbc.randomize , fb.FB_RND_FAST

for i as integer = 1 to 10
	print fbc.RND()
next

EDIT: FB.FB_RND_ALGORITHMS is defined in fbmath.bi

Post by **coderJeff** » Oct 31, 2020 14:53

Example #2: FBC.RND32() returns 32-bit random number

Code: Select all

#include once "fbc-int/math.bi"
for i as integer = 1 to 10
	print fbc.rnd32()
next

With FBC.RND32():
- still thread safe and mutex is used if mutlithreaded
- does not expect any extra parameter
- does not perform any extra if statement internally
- does not do any conversion to double
- returns a 32-bit ulong only

With this rtlib entry point, can avoid some of the overhead of RND() while still remaining threadsafe.

Post by **coderJeff** » Oct 31, 2020 15:51

Example #3: Examining Internals with ~~fbc.RndGetInternals( @info )~~ info = fbc.RndGetState()

The ~~fbc.RndGetInternals( @info )~~ fbc.RndGetState() function will retrieve some internal information about the random number generator state.
This example uses the function and displays the internal information in what is hopefully a human readable format.

Code: Select all

#include once "fbmath.bi"
#include once "fbc-int/math.bi"

function RndAlgoToStr( byval algo as fb.FB_RND_ALGORITHMS ) as string
	static algos(0 to 5) as zstring ptr = _
		{ _
			@"FB_RND_AUTO", _
			@"FB_RND_CRT", _
			@"FB_RND_FAST", _
			@"FB_RND_MTWIST", _
			@"FB_RND_QB", _
			@"FB_RND_REAL" _
		}
	if( (algo >=0) and (algo <= 5) ) then
		function = *algos( algo) & " (" & algo & ")"
	else
		function = "Unknown" & " (" & algo & ")"
	end if
end function

'' MAIN
dim info as fbc.FB_RNDSTATE ptr

'' get the internal state
info = fbc.RndGetState( )

print "RNDSTATE Address          : " & hex( cuint(info), sizeof(any ptr)*2 ) & " (ulong ptr)"

'' Select an radom number generator
fbc.RANDOMIZE , fb.FB_RND_MTWIST

print "Algorithm                 : " & RndAlgoToStr( info->algorithm )
print "FB_RND_MTWIST, FB_RND_REAL:"
print "    State Length          : " & info->length & " (# of bytes)"
print "Interface:"
print "    RND(single) as double : " & hex( cuint(info->rndproc), sizeof(any ptr)*2 ) & " (procptr)"
print "    RND32() as ulong      : " & hex( cuint(info->rndproc32), sizeof(any ptr)*2 ) & " (procptr)"
print "FB_RND_MTWIST, FB_RND_REAL:"
print "    State address         : " & hex( cuint(@(info->state32(0))), sizeof(any ptr)*2 ) & " (ulong ptr)"
print "    State index           : " & hex( cuint(info->index32), sizeof(any ptr)*2 ) & " (ulong ptr)"
print "FB_RND_FAST, FB_RND_QB    :"
print "    State value (iseed64) : " & hex( culngint(info->iseed64), sizeof(ulongint)*2 ) & " (ulongint)"
print "    State value (iseed32) : " & hex( culng(info->iseed32), sizeof(ulong)*2 ) & " (ulong)"

Sample output on win64:

Code: Select all

RNDSTATE Address          : 00000000004090C0 (ulong ptr)
Algorithm                 : FB_RND_MTWIST (3)
FB_RND_MTWIST, FB_RND_REAL:
    State Length          : 624 (# of bytes)
Interface:
    RND(single) as double : 0000000000402B95 (procptr)
    RND32() as ulong      : 0000000000402955 (procptr)
FB_RND_MTWIST, FB_RND_REAL:
    State address         : 00000000004090E8 (ulong ptr)
    State index           : 0000000000409AA8 (ulong ptr)
FB_RND_FAST, FB_RND_QB    :
    State value (iseed64) : 0000000000000000 (ulongint)
    State value (iseed32) : 00000000 (ulong)

Out of all the internal RNG functions exposed, this structure is the one most likely to change if there are additional updates to the RANDOMIZE & RND internals.

deltarho[1859] · Post by **deltarho[1859]** » Oct 31, 2020 22:35

Unless I am wrong it seems to me that a mutex is used to ensure that the random number generator state, that is the state vector, is not being accessed by more than one thread at a time whilst using the one state vector location.

coderJeff wrote:- You might find RND to perform a little slower on multi-threaded programs due the mutex

Suppose that RND drops to 80%, say, then two threads will see 160%. That is impossible using just one state vector location. With two threads, the best would be roughly 50% each.

I looked at Mersenne Twister a little while ago using two threads and found 16%/16% because of collisions. Getting 50%/50% is a vast improvement but getting a total exceeding 100% is not possible, I reckon, using the one state vector location.

With thread safety and one state vector location we get what I would call sequence sharing, that is some generated numbers would go to one thread and the others would go to the other thread. The quality of randomness would not be compromised. I would suggest that the quality of randomness may be improved as the serial correlation coefficient may be smaller for the child generations compared with the parent.

With PCG32II, for example, we get thread safety not from using a mutex, or whatever, but by using separate locations for the state vector. With no collisions, the throughput for each generator is the same as with a single instance. Not only that we can have each generator using its own sequence.

Now whilst making thread safety available to the FreeBASIC generators is laudable, especially with fbc.rnd32() and fbc.RndGetInternals( @info ), it seems to me to be a little late in the day given that they are out shined by modern generators from Melissa O'Neill, David Blackman & Sebastiano Vigna, and Bernard Widynski. These generators are much faster and have a much better quality of randomness and can avail themselves to 32-bit ulong output and access to the state vector. With my latest Vigna generators we get 64-bit ulongint output.

Having said that, a graphics program which uses a few hundred RNDs, at most, as part of its initialization could probably get away with using any of the FreeBASIC generators; except for #5. There are no tools available that I am aware of which can give a reasonable assessment, from a randomness perspective, of a few hundred random numbers. However, if speed and/or quality of randomness are issues then the FreeBASIC generators simply no longer pass muster. At PowerBASIC there are quite few RND diehards who reckon that PowerBASIC's RND is 'good enough' and provide no evidence to support their claim. On shown where PowerBASIC's RND is not good enough they cease posting. Of course, that is true for anyone wearing blinkers.

Post by **coderJeff** » Nov 01, 2020 2:18

Hi deltarho. Correct, the mutex is to ensure only one thread accesses the single state vector.

fbc.rnd() & fbc.rnd32() automatically lock/unlock the mutex when getting the next value.

However, fbc-int/math.bi also exposes fbc.MathLock() and fbc.MathUnlock(). And ~~fbc.RndGetInternals( @info )~~ fbc.RndGetState() will return function pointers to the random number generator procedures - which just do the math part, no locking or unlocking. Which does it make it possible to lock, generate a bunch of numbers at once and then unlock.

Example #4: Explicit Lock, RndProc32, Unlock

Code: Select all

#include "fbgfx.bi"
#include "fbmath.bi"
#include "fbc-int/math.bi"

dim shared info as fbc.FB_RNDSTATE ptr

sub FillImageRnd32( byval image as fb.image ptr )

	'' example is for 4 bytes per pixel only
	assert( image->bpp = 4)

	fbc.MathLock()

	dim dst as ulong ptr = cast( ulong ptr, image + 1 )
	for y as integer = 0 to image->height - 1
		for x as integer = 0 to image->width-1
			dst[ y * (image->pitch \ 4) + x] = info->rndproc32()
		next
	next

	fbc.MathUnlock()

end sub

'' for demonstration only FB_RND_REAL is much slower than all other PRNGs
fbc.randomize , fb.FB_RND_REAL
info = fbc.RndGetState( )
screenres 640, 480, 32

dim as fb.image ptr image = ImageCreate( 128, 128 )

do
	FillImageRnd32( image )
	put( 0, 0 ), image, pset
loop until inkey <> ""

ImageDestroy( image )

Also at the suggestion of adeyblue when FB_RND_REAL is used (what uses the crypto API), the Mersenne Twister buffer is used to buffer 624 ulongs at one time. As he put it, reading a single value is criminal. On my PC, FB_RND_REAL is about 20 times slower than the other PRNGs.

Post by **coderJeff** » Nov 01, 2020 2:36

deltarho[1859] wrote:With PCG32II, for example, we get thread safety not from using a mutex, or whatever, but by using separate locations for the state vector. With no collisions, the throughput for each generator is the same as with a single instance. Not only that we can have each generator using its own sequence.

Now whilst making thread safety available to the FreeBASIC generators is laudable, especially with fbc.rnd32() and fbc.RndGetInternals( @info ), it seems to me to be a little late in the day given that they are out shined by modern generators from Melissa O'Neill, David Blackman & Sebastiano Vigna, and Bernard Widynski. These generators are much faster and have a much better quality of randomness and can avail themselves to 32-bit ulong output and access to the state vector. With my latest Vigna generators we get 64-bit ulongint output.

I debated with myself for about a week if I should even bother with this update. I expected that it would be received as lacking.

For a new PRNG to be added:
- adding it to RANDOMIZE & RND is straight-forward.
- For PCG32II or any other PRNG, I need an implementation that I can reference and use to code in C.
- Whatever source code I create or use need to be able to release under LGPL 2.1 with our linking exception.
- This part is quite easy if can point me to the right algorithm to use.

To create individual state vectors:
- need something different from RANDOMIZE & RND only
- internally, https://github.com/freebasic/fbc/blob/m ... math_rnd.c needs a rewrite
- math_rnd.c needs to be rewritten so that a user allocated state vector can be specified
- but still need to have a global thread safe state vector by default for backwards source compatibility
- would be nice to split up the PRNG's in to separate modules, making it possible to link to a specific PRNG and not bloat the executable with unused code.

Post by **coderJeff** » Nov 01, 2020 2:54

Here's the test code I've been using to get an idea of the timings.

Code: Select all

#include once "fbc-int/math.bi"

const outputfile = "rnd-results.txt"
const MAX_COUNT = 10000000
const MAX_TRIALS = 3

sub PrintOut( byref text as const string, byval dontlog as boolean = false )
	print text;

	if( dontlog = false ) then
		open outputfile for append as #1
		print #1, text;
		close #1
	end if
end sub

#if __FB_MT__
	#ifdef NTHREADS
		const MAX_THREADS = NTHREADS
	#else
		const MAX_THREADS = 4
	#endif
#else
	const MAX_THREADS = 1
#endif

type TESTINFO
	n as ulongint
	rndinfo as FBC.FB_RNDINTERNALS
end type

sub do_rnd(byval p as any ptr)
	dim arg as TESTINFO ptr = p
	dim d as double
	for i as integer = 1 to arg->n
		d += fbc.rnd()
	next
end sub

sub do_rnd32(byval p as any ptr)
	dim arg as TESTINFO ptr = p
	dim d as ulongint
	for i as integer = 1 to arg->n
		d += fbc.rnd32()
	next
end sub

sub do_rnd_nolock(byval p as any ptr)
	dim arg as TESTINFO ptr = p
	dim d as double
	for i as integer = 1 to arg->n
		d += arg->rndinfo.rndproc()
	next
end sub

sub do_rnd32_nolock(byval p as any ptr)
	dim arg as TESTINFO ptr = p
	dim d as ulongint
	for i as integer = 1 to arg->n
		d += arg->rndinfo.rndproc32()
	next
end sub

function PerformTest _
	( _
		title as string, _
		thread as sub(byval arg as any ptr), _
		threads as integer, _
		count as ulongint _
	) as double

	printout( title, true )
	dim as double t = timer

	dim arg as TESTINFO
	arg.n = count
	fbc.rndGetInternals( @(arg.rndinfo) )

#if __FB_MT__
	dim as any ptr thread_ptr(threads-1)
	for i as integer = 0 to threads-1
	   thread_ptr(i)=threadcreate(thread, @arg)
	   sleep 10
	next i
	for i as integer = 0 to threads-1
	   threadwait(thread_ptr(i))
	next i
#else
	for i as integer = 0 to threads-1
		thread( @arg )
	next
#endif
	t = timer - t
	printout( ": " & cuint(t*1000) & " msec" & !"\n", true )
	function = t

end function

type TESTPROC
	title as zstring ptr 
	proc as sub (byval arg as any ptr)
	nolocks as boolean
end type

dim testprocs(0 to ...) as TESTPROC = _
	{ _
		( @"rnd"               , @do_rnd          , false ), _
		( @"rnd32"             , @do_rnd32        , false ), _
		( @"rndproc (nolock)"  , @do_rnd_nolock   , true ), _
		( @"rndproc32 (nolock)", @do_rnd32_nolock , true ) _
	}

type GENERATOR
	title as zstring ptr
	index as integer
	iterations as integer
end type

dim generators(0 to ... ) as GENERATOR = _
	{ _
		( @"CRT"    , 1, MAX_COUNT    ), _
		( @"FAST"   , 2, MAX_COUNT    ), _
		( @"MTWIST" , 3, MAX_COUNT    ), _
		( @"QB"     , 4, MAX_COUNT    ), _
		( @"REAL"   , 5, MAX_COUNT\20 ) _
	}

type RESULT
	avg_t as double
end type

dim results(0 to ubound( testprocs ), 0 to ubound(generators) ) as RESULT

for trial as integer = 1 to MAX_TRIALS
	for test as integer = 0 to ubound( testprocs )
		for gen as integer = 0 to ubound(generators)

			fbc.randomize , generators( gen ).index

			dim title as string = ""
			dim as integer threads = MAX_THREADS
			dim as integer iterations = generators( gen ).iterations
			#if __FB_MT__
				title &= "MT "
			#else
				title &= "ST "
			#endif
			title &= "trial #" & trial
			if( testprocs( test ).nolocks ) then
				threads = 1
				title &= ", threads=" & 1 & " (main only)"
			else
				title &= ", threads=" & threads
			end if
			title &= ", N=" & iterations 
			title &= " gen=" & *generators( gen ).title
			title &= " test=" & *testprocs( test ).title

			var t = PerformTest( title, testprocs( test ).proc, threads, iterations\threads )

			'' accumulate average
			with results( test, gen )
				.avg_t += (.avg_t * cdbl(trial-1) + t*1000000000/iterations ) / cdbl(trial)
			end with
		next
	next
next

printout( !"\n" )
#ifdef __FB_64BIT__
	printout( "64-bit - " )
#else
	printout( "32-bit - " )
#endif
printout( __FB_BACKEND__ & !"\n" )
#if __FB_MT__
printout( "Threads          : " & MAX_THREADS & !"\n" )
printout( "Threads (nolock) : " & 1 & !" (main only)\n" )
#else
printout( "Single threaded" & !"\n" )
#endif
printout( "Average of " & MAX_TRIALS & " trials (in nanosec)" & !"\n" )
printout( space(20) )
for col as integer = 0 to ubound(generators)
	printout( right( space(10) & *generators(col).title, 10 ) )
next
printout( !"\n" )
for test as integer = 0 to ubound(testprocs)
	printout( left( *testprocs(test).title & space(20), 20 ) )

	for gen as integer = 0 to ubound(generators)
		printout( right( space(10) & cuint( results( test, gen ).avg_t ), 10 ) )
	next
	printout( !"\n" )
next

OUTPUT EXAMPLE:
By compiling for 1 thread, can compare what I think is the overhead of locking/unlock against non-locking.
$ fbc-win32 dotest.bas -gen gas -exx -d NTHREADS=1 -mt

Code: Select all

32-bit - gas
Threads          : 1
Threads (nolock) : 1 (main only)
Average of 3 trials (in nanosec)
                           CRT      FAST    MTWIST        QB      REAL
rnd                        252       232       301       229      7435
rnd32                      187       129       209       129      7365
rndproc (nolock)           166       140       213       135      7242
rndproc32 (nolock)          99        44       118        49      7247

With 4 threads, a lot of extra time is spent waiting on the mutex, but the no-lock timings are about the same.
$ fbc-win32 dotest.bas -gen gas -exx -d NTHREADS=4 -mt

Code: Select all

32-bit - gas
Threads          : 4
Threads (nolock) : 1 (main only)
Average of 3 trials (in nanosec)
                           CRT      FAST    MTWIST        QB      REAL
rnd                        574       508       625       577      8855
rnd32                      415       545       547       548      8652
rndproc (nolock)           167       140       212       135      7145
rndproc32 (nolock)          97        44       118        50      7272

Post by **fxm** » Nov 01, 2020 8:51

coderJeff wrote:Example #4: Explicit Lock, RndProc32, Unlock

Code: Select all

#include "fbgfx.bi"
#include "fbc-int/math.bi"

dim shared info as fbc.FB_RNDINTERNALS

sub FillImageRnd32( byval image as fb.image ptr )

	'' example is for 4 bytes per pixel only
	assert( image->bpp = 4)

	fbc.MathLock()

	dim dst as ulong ptr = cast( ulong ptr, image + 1 )
	for y as integer = 0 to image->height - 1
		for x as integer = 0 to image->width-1
			dst[ y * (image->pitch \ 4) + x] = info.rndproc32()
		next
	next

	fbc.MathUnlock()

end sub

fbc.randomize , fbc.FB_RND_REAL
fbc.RndGetInternals( @info )
screenres 640, 480, 32

dim as fb.image ptr image = ImageCreate( 128, 128 )

do
	FillImageRnd32( image )
	put( 0, 0 ), image, pset
loop until inkey <> ""

ImageDestroy( image )

Code to compile with the '-mt' option.

deltarho[1859] · Post by **deltarho[1859]** » Nov 01, 2020 8:56

coderJeff wrote:As he [adeyblue] put it, reading a single value is criminal. On my PC, FB_RND_REAL is about 20 times slower than the other PRNGs.

That is what I do with CryptoRndII; I use 128KB buffers when using BCryptGenRandom and 32KB buffers when using Intel RdRand. There is a bit more to it than that: Two buffers are used and both filled initially. When the first buffer is exhausted we switch to the second buffer and then start filling the first buffer again, and so on. There is a bit more: Each buffer is split into two and each half is populated with a separate thread of execution. In practice the likelihood of waiting for a buffer to be filled before it can be used is almost nil.

It is worth noting that CryptGenRandom, used with generator #5 in Windows, was designed to fill a buffer.

With regard the timings you forgot to remove '-exx' which normally knocks the stuffing out of the performance.

I have found that gas is diabolically slow with my generators and use gcc with -O2 optimization.

a lot of extra time is spent waiting on the mutex

If that is true, then I dislike the sound of that. The FreeBASIC generators are already slow compared with modern generators. Of course, with some applications speed is not an issue.

I am a descending voice here. I would be interested in what others have to say.

Post by **fxm** » Nov 01, 2020 9:57

I get results tighter than yours:

Code: Select all

32-bit - gas
Single threaded
Average of 3 trials (in nanosec)
                           CRT      FAST    MTWIST        QB      REAL
rnd                         81        38        50        15      3687
rnd32                       52        10        19         9      3604
rndproc (nolock)            83        44        53        17      3708
rndproc32 (nolock)          60        14        24        15      3589

Code: Select all

32-bit - gas
Threads          : 4
Threads (nolock) : 1 (main only)
Average of 3 trials (in nanosec)
                           CRT      FAST    MTWIST        QB      REAL
rnd                        374       306       345       213      3839
rnd32                      339       204       248       198      3883
rndproc (nolock)            83        44        63        20      3760
rndproc32 (nolock)          61        15        24        14      3647

Post by **fxm** » Nov 01, 2020 10:29

Keep in mind that using TLS itself increases execution time a bit because it induces successive indirections for all accesses to these thread-local static variables.

deltarho[1859] · Post by **deltarho[1859]** » Nov 01, 2020 10:30

fxm wrote:I get results tighter than yours:

Using the same code and command line?

Threadsafe RANDOMIZE and RND

Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND

Re: Threadsafe RANDOMIZE and RND