Here is a set of FB macros for benchmarking code. Instead of returning the elapsed time, these macros return the elapsed processor clock cycles. When benchmarking code for the purpose of optimizing/tuning it, I find the elapsed processor clock cycles to be generally more useful than the elapsed time for two reasons. The first is that the much (for recent processors ~1000 times) higher resolution makes it possible to detect very small changes in execution time, without having to wait for the code to loop a zillion times. The second is that for short sequences of assembly code the processor clock cycles, combined with some knowledge of the optimal clock cycle counts for the instructions, provide a simple method of gauging the efficiency of the code. Recent processors can execute most simple instructions in less than one clock cycle, and much less than one clock cycle for the most recent processors. So, for example, if you were running on a recent processor and you had ten simple instructions in a loop that ran 100 times, the entire loop should be able to execute in well under 1000 clock cycles. If the loop executed in, say 400 clock cycles, then the code would be somewhere near optimal, and if it executed in >1000 clock cycles, then the code would be far from optimal.
EDIT:
I finally got around to updating these macros. The new version:
Allows you to specify a loop count and a priority class.
Aligns the loop label for the timing loops to make the cycle counts insensitive to the alignment of the macro call.
Starts a new time slice at the beginning of the loop, to minimize the possibility of the loop overlapping the time slice (see the comments).
I will update the examples in this thread and elsewhere as time permits.
Code: Select all
'====================================================================
'' The COUNTER_BEGIN and COUNTER_END macros provide a convenient
'' method of measuring the processor clock cycle count for a block
'' of code. These macros must be called in pairs, and the block of
'' code must be placed in between the COUNTER_BEGIN and COUNTER_END
'' macro calls. The clock cycle count for a single loop through the
'' block of code, corrected for the test loop overhead, is returned
'' in the shared variable COUNTER_CYCLES.
''
'' These macros capture the lowest cycle count that occurs in a
'' single loop through the block of code, on the assumption that
'' the lowest count is the correct count. The higher counts that
'' occur are the result of one or more context switches within the
'' loop. Context switches can occur at the end of a time slice, so
'' to minimize the possibility of the loop overlapping the time
'' slice the COUNTER_BEGIN macro starts a new time slice at the
'' beginning of the loop. If the execution time for a single loop
'' is greater than the duration of a time slice (approximately 20ms
'' under Windows), then the loop will overlap the time slice, and
'' if another thread of equal priority is ready to run, then a
'' context switch will occur.
''
'' A higher-priority thread can “preempt” a lower priority thread,
'' causing a context switch to occur before the end the time slice.
'' Raising the priority of the process (thread) can reduce the
'' frequency of these context switches. To do so, specify a higher
'' than normal priority class in the priority_class parameter of
'' the COUNTER_BEGIN macro. REALTIME_PRIORITY_CLASS specifies the
'' highest possible priority, but using it involves some risk,
'' because your process will preempt *all* other processes,
'' including critical Windows processes, and if the timed code
'' takes too long to execute then Windows may hang, even the NT
'' versions. HIGH_PRIORITY_CLASS is a safer alternative that in
'' most cases will produce the same cycle count.
''
'' You can avoid the bulk of the context switches by simply
'' inserting a few second delay in front of the first macro
'' call, to allow the system disk cache activity to subside.
''
'' For the loop_count parameter of the COUNTER_BEGIN macro, larger
'' values increase the number of samples and so tend to improve
'' the consistency of the returned cycle counts, but in most cases
'' you are unlikely to see any improvement beyond a value of about
'' 1000.
''
'' Note that the block of code must be able to withstand repeated
'' looping, and that this method will not work for timing a loop
'' where the number of loops is variable, because the lowest cycle
'' count that occurs will be when the loop falls through.
''
'' These macros require a Pentium-class processor that supports
'' the CPUID (function 0 only) and RDTSC instructions.
'====================================================================
#include once "windows.bi"
dim shared _counter_tsc1_ as ulongint, _counter_tsc2_ as ulongint
dim shared _counter_overhead_ as ulongint, counter_cycles as ulongint
dim shared _counter_loop_counter_ as uinteger
sub _counter_code1_
asm
'
' Use same CPUID input value for each call.
'
xor eax, eax
'
' Flush pipe and wait for pending ops to finish.
'
cpuid
'
' Read Time Stamp Counter.
'
rdtsc
'
' Save count.
'
mov [_counter_tsc1_], eax
mov [_counter_tsc1_+4], edx
end asm
end sub
sub _counter_code2_
asm
xor eax, eax
cpuid
rdtsc
mov [_counter_tsc2_], eax
mov [_counter_tsc2_+4], edx
end asm
end sub
'' Unlike the #define directive, the #macro directive
'' allows inline asm.
''
#macro COUNTER_BEGIN( loop_count, priority_class )
_counter_overhead_ = 2000000000
counter_cycles = 2000000000
SetPriorityClass( GetCurrentProcess(), priority_class )
Sleep_(0) '' Start a new time slice
''
'' The nops compensate for the 10-byte instruction (that
'' initializes _counter_loop_counter_) between the alignment
'' directive and the loop label, which ideally needs to be
'' aligned on a 16-byte boundary.
''
asm
.balign 16
nop
nop
nop
nop
nop
nop
end asm
for _counter_loop_counter_ = 1 to loop_count
_counter_code1_
_counter_code2_
if (_counter_tsc2_ - _counter_tsc1_) < _counter_overhead_ then
_counter_overhead_ = _counter_tsc2_ - _counter_tsc1_
endif
next
Sleep_(0) '' Start a new time slice
asm
.balign 16
nop
nop
nop
nop
nop
nop
end asm
for _counter_loop_counter_ = 1 to loop_count
_counter_code1_
#endmacro
''
'' *** Note the open FOR loop ***
''
#define COUNTER_END _
_counter_code2_ :_
if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
endif :_
next :_
SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS ) :_
counter_cycles -= _counter_overhead_
This is a test app that measures and displays the clock cycle counts for various operations in pure FB code.
Code: Select all
'====================================================================
#include once "windows.bi"
#include once "\crt\string.bi"
#include "counter.bas"
'====================================================================
Dim As uint x,y,z = 1
Dim As Double d1,d2
'---------------------------------------------------------------------
'' Delay for a few seconds to allow the system activities involved in
'' launching the program to finish. This will reduce the number of
'' interruptions in the counter loops, and substantially improve the
'' consistency of the results.
'---------------------------------------------------------------------
sleep 3000
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = 1
counter_end
Print "x = 1 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x + 2
counter_end
Print "x = x + 2 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x += 2
counter_end
Print "x += 2 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x * 3
counter_end
Print "x = x * 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x *= 3
counter_end
Print "x *= 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x * 4
counter_end
Print "x = x * 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x *= 4
counter_end
Print "x *= 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x / 3
counter_end
Print "x = x / 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x /= 3
counter_end
Print "x /= 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x / 4
counter_end
Print "x = x / 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x /= 4
counter_end
Print "x /= 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x \ 3
counter_end
Print "x = x \ 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x \= 3
counter_end
Print "x \= 3 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = x \ 4
counter_end
Print "x = x \ 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x \= 4
counter_end
Print "x \= 4 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y / z
counter_end
Print "x = y / z : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y \ z
counter_end
Print "x = y \ z : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y Mod z
counter_end
Print "x = y mod z : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y ^ z
counter_end
Print "x = y ^ z : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y * y
counter_end
Print "x = y * y : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = y ^ 2
counter_end
Print "x = y ^ 2 : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
x = Sqr(y)
counter_end
Print "x = sqr(y) : ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
d1 = Sin(d2)
counter_end
Print "d1 = sin(d2): ";counter_cycles;" cycles"
Sleep
Code: Select all
'==============================================================================
#include once "windows.bi"
#include once "\crt\string.bi"
#include "counter.bas"
'==============================================================================
Sub memset_for( dest As Byte Ptr, Byval c As uint, Byval count As uint )
Dim As uint i
For i = 0 To count - 1
*(dest+i) = c
Next
End Sub
'==============================================================================
Sub memset_for_unroll( dest As Byte Ptr, Byval c As uint, Byval count As uint )
Dim As uint i, offset, dwordc
dwordc = c + c Shl 8 + c Shl 16 + c Shl 24
If count Shr 5 Then
For i = 0 To count - 32 Step 32
offset = cast(uint,dest) + i
*(cast(dword Ptr,offset)) = dwordc
*(cast(dword Ptr,offset+4)) = dwordc
*(cast(dword Ptr,offset+8)) = dwordc
*(cast(dword Ptr,offset+12)) = dwordc
*(cast(dword Ptr,offset+16)) = dwordc
*(cast(dword Ptr,offset+20)) = dwordc
*(cast(dword Ptr,offset+24)) = dwordc
*(cast(dword Ptr,offset+28)) = dwordc
Next
Endif
If count And 31 Then
For i = i To count - 1
*(dest+i) = c
Next
Endif
End Sub
'==============================================================================
Sub memset_asm( Byval dest As Byte Ptr, Byval c As uint, Byval count As uint )
asm
mov eax, [c]
Or eax, eax
jz dozero
mov ebx, [c]
mov esi, [c]
Shl eax, 8
Shl ebx, 16
Or eax, [c]
Shl esi, 24
Or eax, ebx
Or eax, esi
dozero:
mov ecx, [count]
mov edi, [dest]
Shr ecx, 2
jz dobytes
rep stosd
dobytes:
mov ecx, [count]
And ecx, 3
jz fini
byteloop:
mov Byte Ptr[edi], al
add edi, 1
Sub ecx, 1
jnz byteloop
fini:
End asm
End Sub
'==============================================================================
#define MEMSIZE 1000
Dim pMem1 As Byte Ptr, pMem2 As Byte Ptr
pMem1 = allocate( MEMSIZE )
pMem2 = allocate( MEMSIZE )
If pMem1 = 0 Or pMem2 = 0 Then
Print "allocation failed"
Sleep
Endif
memset( pMem1, 0, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_for pMem2, 0, MEMSIZE
Print "test memset_for: ";memcmp( pMem1, pMem2, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_for_unroll pMem2, 0, MEMSIZE
Print "test memset_for_unroll: ";memcmp( pMem1, pMem2, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_asm pMem2, 0, MEMSIZE
Print "test memset_asm: ";memcmp( pMem1, pMem2, MEMSIZE )
'==============================================================================
'---------------------------------------------------------------------
'' Delay for a few seconds to allow the system activities involved in
'' launching the program to finish. This will reduce the number of
'' interruptions in the counter loops, and substantially improve the
'' consistency of the results.
'---------------------------------------------------------------------
sleep 3000
counter_begin( 1000, HIGH_PRIORITY_CLASS )
memset( pMem2, 0, MEMSIZE )
counter_end
Print "memset: ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
memset_for pMem1, 0, MEMSIZE
counter_end
Print "memset_for: ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
memset_for_unroll pMem1, 0, MEMSIZE
counter_end
Print "memset_for_unroll: ";counter_cycles;" cycles"
counter_begin( 1000, HIGH_PRIORITY_CLASS )
memset_asm pMem1, 0, MEMSIZE
counter_end
Print "memset_asm: ";counter_cycles;" cycles"
deallocate pMem1
deallocate pMem2
Sleep