Clock Cycle Count Macros

MichaelW · Post by **MichaelW** » May 20, 2006 23:00

Hello all,

Here is a set of FB macros for benchmarking code. Instead of returning the elapsed time, these macros return the elapsed processor clock cycles. When benchmarking code for the purpose of optimizing/tuning it, I find the elapsed processor clock cycles to be generally more useful than the elapsed time for two reasons. The first is that the much (for recent processors ~1000 times) higher resolution makes it possible to detect very small changes in execution time, without having to wait for the code to loop a zillion times. The second is that for short sequences of assembly code the processor clock cycles, combined with some knowledge of the optimal clock cycle counts for the instructions, provide a simple method of gauging the efficiency of the code. Recent processors can execute most simple instructions in less than one clock cycle, and much less than one clock cycle for the most recent processors. So, for example, if you were running on a recent processor and you had ten simple instructions in a loop that ran 100 times, the entire loop should be able to execute in well under 1000 clock cycles. If the loop executed in, say 400 clock cycles, then the code would be somewhere near optimal, and if it executed in >1000 clock cycles, then the code would be far from optimal.

EDIT:

I finally got around to updating these macros. The new version:

Allows you to specify a loop count and a priority class.

Aligns the loop label for the timing loops to make the cycle counts insensitive to the alignment of the macro call.

Starts a new time slice at the beginning of the loop, to minimize the possibility of the loop overlapping the time slice (see the comments).

I will update the examples in this thread and elsewhere as time permits.

Code: Select all

'====================================================================
'' The COUNTER_BEGIN and COUNTER_END macros provide a convenient
'' method of measuring the processor clock cycle count for a block
'' of code. These macros must be called in pairs, and the block of
'' code must be placed in between the COUNTER_BEGIN and COUNTER_END
'' macro calls. The clock cycle count for a single loop through the
'' block of code, corrected for the test loop overhead, is returned
'' in the shared variable COUNTER_CYCLES.
''
'' These macros capture the lowest cycle count that occurs in a
'' single loop through the block of code, on the assumption that
'' the lowest count is the correct count. The higher counts that
'' occur are the result of one or more context switches within the
'' loop. Context switches can occur at the end of a time slice, so
'' to minimize the possibility of the loop overlapping the time
'' slice the COUNTER_BEGIN macro starts a new time slice at the
'' beginning of the loop. If the execution time for a single loop
'' is greater than the duration of a time slice (approximately 20ms
'' under Windows), then the loop will overlap the time slice, and
'' if another thread of equal priority is ready to run, then a
'' context switch will occur.
''
'' A higher-priority thread can “preempt” a lower priority thread,
'' causing a context switch to occur before the end the time slice.
'' Raising the priority of the process (thread) can reduce the
'' frequency of these context switches. To do so, specify a higher
'' than normal priority class in the priority_class parameter of
'' the COUNTER_BEGIN macro. REALTIME_PRIORITY_CLASS specifies the
'' highest possible priority, but using it involves some risk,
'' because your process will preempt *all* other processes,
'' including critical Windows processes, and if the timed code
'' takes too long to execute then Windows may hang, even the NT
'' versions. HIGH_PRIORITY_CLASS is a safer alternative that in
'' most cases will produce the same cycle count.
''
'' You can avoid the bulk of the context switches by simply
'' inserting a few second delay in front of the first macro
'' call, to allow the system disk cache activity to subside.
''
'' For the loop_count parameter of the COUNTER_BEGIN macro, larger
'' values increase the number of samples and so tend to improve
'' the consistency of the returned cycle counts, but in most cases
'' you are unlikely to see any improvement beyond a value of about
'' 1000.
''
'' Note that the block of code must be able to withstand repeated
'' looping, and that this method will not work for timing a loop
'' where the number of loops is variable, because the lowest cycle
'' count that occurs will be when the loop falls through.
''
'' These macros require a Pentium-class processor that supports
'' the CPUID (function 0 only) and RDTSC instructions.
'====================================================================
#include once "windows.bi"

dim shared _counter_tsc1_ as ulongint, _counter_tsc2_ as ulongint
dim shared _counter_overhead_ as ulongint, counter_cycles as ulongint
dim shared _counter_loop_counter_ as uinteger

sub _counter_code1_
    asm
        '
        ' Use same CPUID input value for each call.
        '
        xor eax, eax
        '
        ' Flush pipe and wait for pending ops to finish.
        '
        cpuid
        '
        ' Read Time Stamp Counter.
        '
        rdtsc
        '
        ' Save count.
        '
        mov [_counter_tsc1_], eax
        mov [_counter_tsc1_+4], edx
    end asm
end sub

sub _counter_code2_
    asm
        xor eax, eax
        cpuid
        rdtsc
        mov [_counter_tsc2_], eax
        mov [_counter_tsc2_+4], edx
    end asm
end sub

'' Unlike the #define directive, the #macro directive
'' allows inline asm.
''
#macro COUNTER_BEGIN( loop_count, priority_class )
    _counter_overhead_ = 2000000000
    counter_cycles = 2000000000
    SetPriorityClass( GetCurrentProcess(), priority_class )
    Sleep_(0)                 '' Start a new time slice
    ''
    '' The nops compensate for the 10-byte instruction (that
    '' initializes _counter_loop_counter_) between the alignment
    '' directive and the loop label, which ideally needs to be
    '' aligned on a 16-byte boundary.
    ''
    asm
      .balign 16
      nop
      nop
      nop
      nop
      nop
      nop
    end asm
    for _counter_loop_counter_ = 1 to loop_count
        _counter_code1_
        _counter_code2_
        if (_counter_tsc2_ - _counter_tsc1_) < _counter_overhead_ then
            _counter_overhead_ = _counter_tsc2_ - _counter_tsc1_
        endif
    next
    Sleep_(0)                 '' Start a new time slice
    asm
      .balign 16
      nop
      nop
      nop
      nop
      nop
      nop
    end asm
    for _counter_loop_counter_ = 1 to loop_count
        _counter_code1_
#endmacro
''
'' *** Note the open FOR loop ***
''
#define COUNTER_END _
        _counter_code2_ :_
        if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
            counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
        endif :_
    next :_
    SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS ) :_
    counter_cycles -= _counter_overhead_

EDIT: I finally got around to updating the examples here, after ~2 years :(

This is a test app that measures and displays the clock cycle counts for various operations in pure FB code.

Code: Select all

'====================================================================
#include once "windows.bi"
#include once "\crt\string.bi"
#include "counter.bas"
'====================================================================

Dim As uint x,y,z = 1
Dim As Double d1,d2

'---------------------------------------------------------------------
'' Delay for a few seconds to allow the system activities involved in
'' launching the program to finish. This will reduce the number of
'' interruptions in the counter loops, and substantially improve the
'' consistency of the results.
'---------------------------------------------------------------------

sleep 3000

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = 1
counter_end
Print "x = 1       : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x + 2
counter_end
Print "x = x + 2   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x += 2
counter_end
Print "x += 2      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x * 3
counter_end
Print "x = x * 3   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x *= 3
counter_end
Print "x *= 3      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x * 4
counter_end
Print "x = x * 4   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x *= 4
counter_end
Print "x *= 4      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x / 3
counter_end
Print "x = x / 3   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x /= 3
counter_end
Print "x /= 3      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x / 4
counter_end
Print "x = x / 4   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x /= 4
counter_end
Print "x /= 4      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x \ 3
counter_end
Print "x = x \ 3   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x \= 3
counter_end
Print "x \= 3      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = x \ 4
counter_end
Print "x = x \ 4   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x \= 4
counter_end
Print "x \= 4      : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y / z
counter_end
Print "x = y / z   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y \ z
counter_end
Print "x = y \ z   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y Mod z
counter_end
Print "x = y mod z : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y ^ z
counter_end
Print "x = y ^ z   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y * y
counter_end
Print "x = y * y   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = y ^ 2
counter_end
Print "x = y ^ 2   : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  x = Sqr(y)
counter_end
Print "x = sqr(y)  : ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  d1 = Sin(d2)
counter_end
Print "d1 = sin(d2): ";counter_cycles;" cycles"

Sleep

And this is a test app that tests several versions of a memset procedure, and compares the execution clock cycles to that of the CRT memset function. The memset_for_unroll and memset_asm procedures are somewhat optimized for execution speed. I tried several versions of the memset_asm code, including a complex one that was essentially an assembly version of the memset_for_unroll procedure, before I settled on the current version. Running on my P3 memset_asm is faster than the CRT memset function only for buffer lengths that are not a multiple of 4. For large buffers I think an MMX or SSE version could be significantly faster then the CRT memset function.

Code: Select all

'==============================================================================
#include once "windows.bi"
#include once "\crt\string.bi"
#include "counter.bas"
'==============================================================================

Sub memset_for( dest As Byte Ptr, Byval c As uint, Byval count As uint )
    Dim As uint i
    For i = 0 To count - 1
        *(dest+i) = c
    Next
End Sub

'==============================================================================

Sub memset_for_unroll( dest As Byte Ptr, Byval c As uint, Byval count As uint )
    Dim As uint i, offset, dwordc
    dwordc = c + c Shl 8 + c Shl 16 + c Shl 24
    If count Shr 5 Then
        For i = 0 To count - 32 Step 32
            offset = cast(uint,dest) + i
            *(cast(dword Ptr,offset)) = dwordc
            *(cast(dword Ptr,offset+4)) = dwordc
            *(cast(dword Ptr,offset+8)) = dwordc
            *(cast(dword Ptr,offset+12)) = dwordc
            *(cast(dword Ptr,offset+16)) = dwordc
            *(cast(dword Ptr,offset+20)) = dwordc
            *(cast(dword Ptr,offset+24)) = dwordc
            *(cast(dword Ptr,offset+28)) = dwordc
        Next
    Endif
    If count And 31 Then
        For i = i To count - 1
            *(dest+i) = c
        Next
    Endif
End Sub

'==============================================================================

Sub memset_asm( Byval dest As Byte Ptr, Byval c As uint, Byval count As uint )
    asm
        mov eax, [c]
        Or  eax, eax
        jz  dozero
        mov ebx, [c]
        mov esi, [c]
        Shl eax, 8
        Shl ebx, 16
        Or  eax, [c]
        Shl esi, 24
        Or  eax, ebx
        Or  eax, esi
    dozero:
        mov ecx, [count]
        mov edi, [dest]
        Shr ecx, 2
        jz  dobytes
        rep stosd
    dobytes:
        mov ecx, [count]
        And ecx, 3
        jz  fini
    byteloop:
        mov Byte Ptr[edi], al
        add edi, 1
        Sub ecx, 1
        jnz byteloop
    fini:
    End asm
End Sub

'==============================================================================

#define MEMSIZE 1000

Dim pMem1 As Byte Ptr, pMem2 As Byte Ptr

pMem1 = allocate( MEMSIZE )
pMem2 = allocate( MEMSIZE )
If pMem1 = 0 Or pMem2 = 0 Then
    Print "allocation failed"
    Sleep
Endif
memset( pMem1, 0, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_for pMem2, 0, MEMSIZE
Print "test memset_for: ";memcmp( pMem1, pMem2, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_for_unroll pMem2, 0, MEMSIZE
Print "test memset_for_unroll: ";memcmp( pMem1, pMem2, MEMSIZE )
memset( pMem2, 1, MEMSIZE )
memset_asm pMem2, 0, MEMSIZE
Print "test memset_asm: ";memcmp( pMem1, pMem2, MEMSIZE )

'==============================================================================

'---------------------------------------------------------------------
'' Delay for a few seconds to allow the system activities involved in
'' launching the program to finish. This will reduce the number of
'' interruptions in the counter loops, and substantially improve the
'' consistency of the results.
'---------------------------------------------------------------------

sleep 3000

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  memset( pMem2, 0, MEMSIZE )
counter_end
Print "memset: ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  memset_for pMem1, 0, MEMSIZE
counter_end
Print "memset_for: ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  memset_for_unroll pMem1, 0, MEMSIZE
counter_end
Print "memset_for_unroll: ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  memset_asm pMem1, 0, MEMSIZE
counter_end
Print "memset_asm: ";counter_cycles;" cycles"

deallocate pMem1
deallocate pMem2

Sleep

yetifoot · Post by **yetifoot** » May 20, 2006 23:17

Hi very interesting. I recently was looking at this myself. I understand that CPUID is a serialising command, and that is why it is useful to clock counters, but i found that it doesn't work as expected on my P4, even in DOS. I get differing times for the counter overhead almost everytime, which throws my results. Also even with serialising, the same piece of code will take a differnt amount of time to execute. Do you experience these problems? I asked about this elsewhere and was told it's only worth tring to count cycles on P1/PPRO, after that its pointless, because of all the crazy branch prediction, pipelines, caching, out of order execution, etc etc etc, although i understand little of this.

EDIT

I just tried the memset test. I ran it about 25 times.

memset_for 10040 cycles (everytime this count)
memset_for_unroll 1532 cycles (everytime this count)
memset_asm 344-360 (varies)
memset 344-360 (varies)

I found that whichever of the memset_asm and memset came last in the code was always the fastest, i guess this could be down to caching, but i'm quite new to this area.

I use a P4 which seems an interesting CPU. Every simple trick from the old CPU's seems to fail.

ie:
its faster to inc/dec than add
mov eax, 0 is faster than xor eax, eax
rep commands are faster than manual loop

Theres a lot of interesting stuff on http://flatassembler.net/ on this subject

MichaelW · Post by **MichaelW** » May 21, 2006 0:24

If you did not control the eax value on input to cpuid then this could explain the variation under DOS. I have seen some Intel example code where the author neglected to do this. As an example (this code does not bother to determine which cpuid functions are supported):

Code: Select all

'=========================================================================
#include once "windows.bi"
#include once "\crt\string.bi"
#include "counter.bas"
'=========================================================================

'---------------------------------------------------------------------
'' Delay for a few seconds to allow the system activities involved in
'' launching the program to finish. This will reduce the number of
'' interruptions in the counter loops, and substantially improve the
'' consistency of the results.
'---------------------------------------------------------------------

sleep 3000

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  asm mov eax, 0
  asm cpuid
counter_end
Print "cpuid function 0: ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  asm mov eax, 1
  asm cpuid
counter_end
Print "cpuid function 1: ";counter_cycles;" cycles"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
  asm mov eax, 2
  asm cpuid
counter_end
Print "cpuid function 2: ";counter_cycles;" cycles"

Sleep

The results are very consistent on my P3:

Code: Select all

cpuid function 0: 79 cycles
cpuid function 1: 81 cycles
cpuid function 2: 104 cycles

As were the results for the first test app:

Code: Select all

x = 1       : 0 cycles
x = x + 2   : 1 cycles
x += 2      : 1 cycles
x = x * 3   : 6 cycles
x *= 3      : 6 cycles
x = x * 4   : 1 cycles
x *= 4      : 1 cycles
x = x / 3   : 39 cycles
x /= 3      : 39 cycles
x = x / 4   : 39 cycles
x /= 4      : 39 cycles
x = x \ 3   : 39 cycles
x \= 3      : 39 cycles
x = x \ 4   : 1 cycles
x \= 4      : 1 cycles
x = y / z   : 56 cycles
x = y \ z   : 41 cycles
x = y mod z : 42 cycles
x = y ^ z   : 204 cycles
x = y * y   : 7 cycles
x = y ^ 2   : 52 cycles
x = sqr(y)  : 41 cycles
d1 = sin(d2): 30 cycles

And only slightly less consistent for the memset app:

Code: Select all

memset: 339 cycles
memset_for: 11601 cycles
memset_for_unroll: 2176 cycles
memset_asm: 344 cycles

At least among assembly language programmers, counting clock cycles is a common method of timing code, and it will work on all x86 processors that support the necessary instructions. Variation is a normal consequence of operating in a preemptive multi-tasking environment. The current version of the MASM macros that I have been using, and may now change as a result of my experiences with these macros, just rely on long loops to average out the variation (effective in most, but not all, cases). The method I used here, explained in the header comments, cannot eliminate all variation, but assuming a system that does not have an abnormal number of tasks running, or very intensive tasks running, the variations should be small. For the purpose of optimization absolute repeatability would be a good thing, but it is not a necessity. If multiple trials can be run quickly then it’s easy to approximate a mean value.

The variations with the position of the code may have something to do with alignment. I neglected to explain that I did not attempt to align the code from within the macros because it was not convenient to do so. I assumed that most of the timed code would be in procedures, for which the FB compiler generates alignment code. For assembly code you can align as necessary.

Also, I neglected to state that my optimizations were tested on a P3 only, so the code could be far from optimal for other processors. And for the asm version it’s entirely possible that someone with better optimization skills could make the code run significantly faster even on a P3. I’ve been put to shame more than once with this kind of stuff :)

yetifoot · Post by **yetifoot** » May 21, 2006 1:07

If you did not control the eax value on input to cpuid then this could explain the variation under DOS.

Perhaps i did neglect to clear eax. Your code seems much more consistent than the code i made, and i'm testing on the same CPU. Using your counter on a few examples, it seems much more precise than the one i made.

I’ve been put to shame more than once with this kind of stuff :)

I get put to shame on just about everything I code in asm :)

1000101 · Post by **1000101** » May 21, 2006 8:38

yetifoot wrote:I understand that CPUID is a serialising command, and that is why it is useful to clock counters, but i found that it doesn't work as expected on my P4, even in DOS. I get differing times for the counter overhead almost everytime, which throws my results. Also even with serialising, the same piece of code will take a differnt amount of time to execute. Do you experience these problems?

In DOS: Did you turn interrupts off?
In [Multithreading OS]: Did you use a large enough sample run to elliminate the occasional spike due to task switchs?

That it sometimes takes longer then others is typically a result of another piece of software hijacking the CPU between the timing code. This is why you want to turn interrupts off in DOS or use large samples in [Multithreading OS]. Further, using this timing information is only useful in the context of comparing code modifications as you are attempting to optimize the code, not general purpose timings.

MichaelW · Post by **MichaelW** » May 21, 2006 11:35

1000101 wrote: In [Multithreading OS]: Did you use a large enough sample run to elliminate the occasional spike due to task switchs?

Using large samples is an effective method, but large samples mean long run times, and long run times mean that the results will include many interruptions. Another method is to collect a moderate number of samples, reject any that are too far out of line, and use the mean value of the samples that remain. And another method, the one I selected, is to collect a moderate number of samples and use the smallest.

Further, using this timing information is only useful in the context of comparing code modifications as you are attempting to optimize the code, not general purpose timings.

Knowledge of instruction clock cycle counts is also useful in the context of selecting the best instructions, or HLL statements, to use when you are designing code.

I’m not sure what you mean by general-purpose timings. When it comes to code execution speed, clock cycle counts are less variable across systems than timings, because where clock cycle counts are mostly a function of the processor family, timings are a function of the processor family and, more significantly, the processor clock speed.

yetifoot · Post by **yetifoot** » May 21, 2006 13:31

In DOS: Did you turn interrupts off?
In [Multithreading OS]: Did you use a large enough sample run to elliminate the occasional spike due to task switchs?

Under DOS i used CLI, STI. I am too new to understand about NMI's though.
Under Win i used SetPriorityClass (although i found if i run too long a test the machine hangs)

I think my problem must have been neglecting to clear EAX before CPUID, because my code was almost the same as Michaels. His code doesn't cause me any problems though, it seems to work very well.

Further, using this timing information is only useful in the context of comparing code modifications as you are attempting to optimize the code, not general purpose timings.

Yes i was just using it on little fragments, to see which order of instructions, or different loop constructs were faster on my CPU. I think though in reality, if i do optimize i'll use the optimizations that work well on older machines like P1, which are the machines that need the optimizations the most.

1000101 · Post by **1000101** » May 21, 2006 16:37

Also when optimizing, the biggest thing to optimize is inner loops. Optimizing an obscure function only called twice in the program is meaningless. It's the core loops which are run thousands of times per frame which mean something.

yetifoot · Post by **yetifoot** » May 21, 2006 17:41

1000101 wrote:Also when optimizing, the biggest thing to optimize is inner loops. Optimizing an obscure function only called twice in the program is meaningless. It's the core loops which are run thousands of times per frame which mean something.

Thats why i love HLLs like FB that allow inline ASM. The code generated by FB is good enough most of the time but those really intensive inner loops can often get a significant boost even with my limited knowledge of ASM

MichaelW · Post by **MichaelW** » May 21, 2006 20:36

yetifoot wrote: Under Win i used SetPriorityClass (although i found if i run too long a test the machine hangs)

The only hangs that I have encountered were with REALTIME_PRIORITY_CLASS, and this is why I use HIGH_PRIORITY_CLASS instead. REALTIME_PRIORITY_CLASS causes your process to preempt too many important system processes, and if you make a mistake that causes your test code to hang, or just run too long, then Windows hangs.

yetifoot · Post by **yetifoot** » May 21, 2006 21:03

REALTIME

Yes, that would be me making another newbie mistake. Thanks for posting this, i've definately learnt a few things. I've grown tired of asking questions on ASM forums, usually i get flamed for being a newb. ASM forums seem worse than C forums in that respect.

I3I2UI/I0 · Post by **I3I2UI/I0** » Aug 28, 2008 10:11

SetThreadAffinityMask( GetCurrentProcess(), 1) 'only Core 0 on MultiCore-CPU

Code: Select all

''
#macro COUNTER_BEGIN( loop_count, priority_class )
    _counter_overhead_ = 2000000000
    counter_cycles = 2000000000
    SetThreadAffinityMask( GetCurrentProcess(), 1) 'only Core 0 on MultiCore-CPU
    SetPriorityClass( GetCurrentProcess(), priority_class )
    Sleep_(0)                 '' Start a new time slice
    ''

I'm not sure it helps, but it cannot harm also.

shadfurman · Post by **shadfurman** » Mar 18, 2010 21:21

I know this is an old post, but I've been using a 10,000,000+for/next and time difference. I didn't even know this was possible! But it doesn't work. I'm sure some of this is due to compiler changes as it say the "option explicit" is deprecated, however using -lang fblite, qb or deprecated don't yield results either. I just keep getting "Variable not declared, counter_begin" but its declared in the counter.bas file and the file is included.

I'm not a great programmer, I only learn enough to do what I want to do right now so take that into account.

MichaelW · Post by **MichaelW** » Mar 19, 2010 0:30

Sorry, I forgot to update the example code in this thread as I promised when I updated the macros. The examples were all for the first version of the macros, not the current version. I have now updated the examples in this thread. There are probably a few examples elsewhere that were coded for the first version of the macros, but I don’t really have the time to find and update them, and once you know what the problem is it’s easy to fix.

elizas · Post by **elizas** » May 07, 2010 6:59

ASPX:

<%@ Page Language="C#" AutoEventWireup="true" CodeFile="SpeedTest.aspx.cs" Inherits="SpeedTest" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head runat="server">

<title>Test Speed of the code</title>

</head>

<body>

<form id="form1" runat="server">

<div>

<%-- To Display the elapsed time--%>

<asp:Label ID="lblShowTime" runat="server" Text="Label"></asp:Label>

</div>

</form>

</body>

</html>

http://www.mindfiresolutions.com/Measur ... ET-880.php

Clock Cycle Count Macros

Clock Cycle Count Macros

SetThreadAffinityMask( GetCurrentProcess(), 1) 'only Core 0

Completely didn't work for me...

Measure execution time of a block of code in .NET