New cycle-count macros (Windows only)

Post your FreeBASIC tips and tricks here. Please don’t post your code without including an explanation.
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

New cycle-count macros (Windows only)

Postby MichaelW » Jun 09, 2012 23:27

These two macros provide a convenient method of measuring the processor clock-cycle count for a block of code. The macros must be called in pairs, and the block of code, or a call to a procedure containing the block of code, must be placed between the COUNTER_BEGIN and COUNTER_END macro calls. The average per-loop cycle count, corrected for the loop overhead, is returned in the global variable counter_cycles.

These macros are substantially different than my previous macros. Where the previous macros captured the lowest cycle count that occurred in any single loop through the block of code, these macros average the cycle count over all of the loops. The main reason for the change was that while my previous macros worked well for the AMD processors and for the Intel processors that preceded and followed the Pentium IV, they were almost useless for the Pentium IV (or any of the NetBurst architecture processors AFAIK), returning cycle counts that were typically a multiple of 4 and with a wide run-to-run variation. In my test of these macros running on a Pentium IV, under Windows XP, and assuming a sufficiently high loop count and a high enough priority, the run-to-run variation was typically no more than a few cycles.

I provided access to the process priority class and the thread priority to make it possible to operate at the highest possible priority (31) by using the combination of REALTIME_PRIORITY_CLASS and THREAD_PRIORITY_TIME_CRITICAL. On a multi-core system (or even a P4 with HT) running Windows XP, doing so appears to be reasonably safe, even if the code being timed triggers an exception. But note the Microsoft warnings about operating at such a high priority for more than a very brief interval. And note also that while I can routinely use this priority level on my P3 Windows 2000 system without problems, if the timed code triggers an exception, or just takes too long to execute, the system will lock up and a reset or power cycle will be required to recover.

This code is a FreeBASIC port of a set of C macros that I did for the Microsoft compilers, where the loops and cycle-count calculations were done entirely in assembly to avoid problems with compiler optimizations breaking the code.

COUNTER.BAS:

Code: Select all

''=============================================================================
#include "windows.bi"
''=============================================================================

dim shared as longint counter_cycles
dim shared as integer _counter_loopcount_, _counter_loopcounter_
dim shared as integer _process_priority_class_, _thread_priority_

#macro COUNTER_BEGIN( loop_count, process_priority, thread_priority )
    _counter_loopcount_ = loop_count
    _process_priority_class_ = GetPriorityClass(GetCurrentProcess())
    _thread_priority_ = GetThreadPriority(GetCurrentThread())
    SetPriorityClass(GetCurrentProcess(), process_priority)
    SetThreadPriority(GetCurrentThread(), thread_priority)
    _counter_loopcounter_ = _counter_loopcount_
    asm
        xor eax, eax
        cpuid               '' serialize
        rdtsc               '' get reference loop start count
        push edx            '' preserve msd (most significant dword)
        push eax            '' preserve lsd
        xor eax, eax
        cpuid               '' serialize
        .balign 16
      0:                    '' start of reference loop
        sub DWORD PTR _counter_loopcounter_, 1
        jnz 0b              '' end of reference loop
        xor eax, eax
        cpuid               '' serialize
        rdtsc               '' get reference loop end count
        pop ecx             '' recover lsd of start count
        sub eax, ecx        '' calc lsd of reference loop count
        pop ecx             '' recover msd of start count
        sbb edx, ecx        '' calc msd of reference loop count
        push edx            '' preserve msd of reference loop count
        push eax            '' preserve lsd of reference loop count
        xor eax, eax
        cpuid               '' serialize
        rdtsc               '' get test loop start count
        push edx            '' preserve msd
        push eax            '' preserve lsd
    end asm
    _counter_loopcounter_ = _counter_loopcount_
    asm
        xor eax, eax
        cpuid               '' serialize
        .balign 16
      1:                    '' start of test loop
    end asm
#endmacro

''=============================================================================

#macro COUNTER_END()
    asm
        sub DWORD PTR _counter_loopcounter_, 1
        jnz 1b              '' end of test loop
        xor eax, eax
        cpuid               '' serialize
        rdtsc
        pop ecx             '' recover lsd of start count
        sub eax, ecx        '' calc lsd of test loop count
        pop ecx             '' recover msd of start count
        sbb edx, ecx        '' calc msd of test loop count
        pop ecx             '' recover lsd of reference loop count
        sub eax, ecx        '' calc lsd of corrected loop count
        pop ecx             '' recover msd of reference loop count
        sbb edx, ecx        '' calc msd of corrected loop count
        mov DWORD PTR [counter_cycles], eax
        mov DWORD PTR [counter_cycles+4], edx
    end asm
    SetPriorityClass(GetCurrentProcess(),_process_priority_class_)
    SetThreadPriority(GetCurrentThread(),_thread_priority_)
    counter_cycles /= _counter_loopcount_
#endmacro

''=============================================================================

Code: Select all

''=============================================================================
#include "counter.bas"
''=============================================================================

''----------------------------------------------------------------------------
'' Classic binary divide-and-conquer popcount.
'' This is popcount_2() from
'' http://en.wikipedia.org/wiki/Hamming_weight
''----------------------------------------------------------------------------

#define m1 &h55555555
#define m2 &h33333333
#define m4 &h0f0f0f0f
#define h01 &h01010101

function popcount_2( byval x as uinteger ) as integer
    x -= (x shr 1) and m1
    x = (x and m2) + ((x shr 2) and m2)
    x = (x + (x shr 4)) and m4
    x += x shr 8
    return (x + (x shr 16)) and &h3f
end function

''----------------------------------------------------------------------------
'' Popcount using multiply.
'' This is popcount_3() from
'' http://en.wikipedia.org/wiki/Hamming_weight
''----------------------------------------------------------------------------

function popcount_mult( byval x as uinteger ) as integer
    x -= (x shr 1) and m1
    x = (x and m2) + ((x shr 2) and m2)
    x = (x + (x shr 4)) and m4
    return (x * h01) shr 24
end function

''=============================================================================

dim as uinteger x
dim as integer r

print popcount_2(1), popcount_mult(1)
print popcount_2(3), popcount_mult(3)
print popcount_2(7), popcount_mult(7)
print popcount_2(8), popcount_mult(8)
print

sleep 5000

''-------------------------------------------------------------------------
'' This is necessary to get consistent results on multi-core processors
'' (or on the nominally single-core P4 with HT). It ensures that Windows
'' will restrict the threads of the process to running on only a single
'' core, and so avoid the problem with the TSCs not being synchronized
'' between cores.
''-------------------------------------------------------------------------

SetProcessAffinityMask( GetCurrentProcess(), 1)

for i as integer = 1 to 4
    counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
    counter_end()
    print counter_cycles;" cycles"
    counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
        r = popcount_2(3)
    counter_end()
    print counter_cycles;" cycles"
    counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
        r = popcount_mult(3)
    counter_end()
    print counter_cycles;" cycles"
    x = 3
    counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
        x -= (x shr 1) and m1
        x = (x and m2) + ((x shr 2) and m2)
        x = (x + (x shr 4)) and m4
        x += x shr 8
        r = (x + (x shr 16)) and &h3f
    counter_end()
    print counter_cycles;" cycles"
    x = 3
    counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
        x -= (x shr 1) and m1
        x = (x and m2) + ((x shr 2) and m2)
        x = (x + (x shr 4)) and m4
        r = (x * h01) shr 24
    counter_end()
    print counter_cycles;" cycles"
    print
next

sleep


Running on a P3 under Windows 2000:

Code: Select all

 1             1
 2             2
 3             3
 1             1

 0 cycles
 44 cycles
 32 cycles
 31 cycles
 22 cycles

 0 cycles
 44 cycles
 32 cycles
 31 cycles
 22 cycles

 0 cycles
 44 cycles
 32 cycles
 31 cycles
 21 cycles

 0 cycles
 44 cycles
 32 cycles
 31 cycles
 22 cycles

Running on a P4 (Northwood) under Windows XP:

Code: Select all

 1             1
 2             2
 3             3
 1             1

 0 cycles
 48 cycles
 79 cycles
 56 cycles
 73 cycles

 -1 cycles
 48 cycles
 79 cycles
 56 cycles
 73 cycles

 0 cycles
 49 cycles
 78 cycles
 56 cycles
 73 cycles

 0 cycles
 48 cycles
 79 cycles
 57 cycles
 73 cycles

And FWIW, the cycle counts for a C version compiled with the Microsoft VC Toolkit 2003 compiler with /O2 /G6 optimizations, running on a P3 under Windows 2000:

Code: Select all

 0 cycles
16 cycles
12 cycles

0 cycles
16 cycles
12 cycles

0 cycles
16 cycles
12 cycles

0 cycles
16 cycles
12 cycles

And on a P4 (Northwood) under Windows XP:

Code: Select all

0 cycles
9 cycles
4 cycles

0 cycles
9 cycles
4 cycles

0 cycles
9 cycles
4 cycles

0 cycles
9 cycles
4 cycles

Edit2:

Regarding the differences in the cycle counts returned by the new macros, compared to the old, I have more confidence in the new macros because:

The loops and cycle-count calculations are done entirely in assembly, so I have complete control over the instruction sequence and direct control over the loop label alignment.

The serialization and RTSC code are inline, instead of in separate procedures, so the code is (probably) more cache friendly.

I remembered to serialize both before and after RTSC.

Edit: Corrected problems in the GetThreadPriority and SetThreadPriority calls where I specified GetCurrentProcess instead of GetCurrentThread.
Last edited by MichaelW on Jul 12, 2012 2:58, edited 3 times in total.
pestery
Posts: 493
Joined: Jun 16, 2007 2:00
Location: Australia

Re: New cycle-count macros (Windows only)

Postby pestery » Jun 10, 2012 2:28

I've been using the old version for a long time and it helped a lot when trying to speed up code (before I started using -gen gcc -O max), although the new version seems to runs about twice as fast as the old version on my computer. The results are still consistent so, I guess it doesn't matter. By the way, my processor is an AMD 64 X2. Thanks for the update :-)
ytwinky
Posts: 217
Joined: Dec 03, 2005 12:44
Location: MD, Germany

Re: New cycle-count macros (Windows only)

Postby ytwinky » Jun 11, 2012 16:44

Hi MichaelW,
I wanted to update my (german) tutorial: http://www.freebasic-portal.de/tutorial ... it-26.html
Using your counter.bas and your example-program on my P4[(Northwood, 2.7GHz, 32 bit), 2GB Ram, XP~Home(SP3++), fb 0.23] I get results similar to yours(+-1). If I run my example-prog I get a lot of negative results:

Image
..and I'm not quite sure how to interpret them correctly..
Should I buy a new computer or OS?^^ Should I install an other fb-version? Or am I doing simply something wrong?
I hope you find some time to answer me ;-))
Kind regards
ytwinky
counting_pine
Site Admin
Posts: 6180
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: New cycle-count macros (Windows only)

Postby counting_pine » Jun 11, 2012 18:41

Looks like you should definitely keep that computer. If ever you have code that's running too slowly, adding in a few thousand i+=2 instructions to the inner loops should give it a nice speed boost.

Perhaps in a good multitasking system you can have a process constantly executing i+=2, and it might generate extra cycles for the other processes to share.
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Re: New cycle-count macros (Windows only)

Postby MichaelW » Jun 12, 2012 3:42

When the cycle count for a section of code drops below zero it exerts a sucking effect, making the code above it run faster :)

Negative counts result when the empty reference loop takes more cycles than the test loop. I had been assuming that when this happened it was because of one or more context switches in the reference loop. And while that may be the reason for the occasional negative count, in this case it does not explain counts that are almost always negative.

Code: Select all

''=============================================================================
#include "counter.bas"
''=============================================================================

dim as integer i,j,r(16)
SetProcessAffinityMask( GetCurrentProcess(), 1)
sleep 5000

for j as integer = 0 to 3

  counter_begin(10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL)
  counter_end()
  r(j*4) = counter_cycles

  counter_begin(10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL)
    i = 1
  counter_end()
  r(j*4+1) = counter_cycles

  counter_begin(10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL)
    i = i + 2
  counter_end()
  r(j*4+2) = counter_cycles

  counter_begin(10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL)
    i += 2
  counter_end()
  r(j*4+3) = counter_cycles

next

for j as integer = 0 to 15
  print  r(j)
next

sleep


Typical results on my P3 system under Windows 2000:

Code: Select all

 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

Typical results on my P4 (Northwood) system under Windows XP:

Code: Select all

 0
 -10
 -11
 -11
 0
 -8
 -11
 -10
 0
 -9
 -11
 -10
 2
 -10
 -10
 -11

Perhaps there is an explanation to be found in the double-pumped ALU, or in the references I recall seeing to the TSC running at the core frequency but updating at a lower frequency. In any case, you cannot expect meaningful counts for sequences of instructions that execute in less than one clock cycle, and probably not for sequences that execute in a few clock cycles.
nobozoz
Posts: 238
Joined: Nov 17, 2005 6:24
Location: Chino Hills, CA, USA

Re: New cycle-count macros (Windows only)

Postby nobozoz » Jun 12, 2012 3:45

MichaelW-

I compiled your new version on my WIN2KSP3, P4 system using FreeBASIC Compiler - Version 0.24.0 (05-21-2012) for win32.

I occasionally get -1 and rarely -2 cycles on the first pass through the code sequence

counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
counter_end()
Maybe 3.05 GHz, 1 GB of RAMBUS causes problems. There are just (3) HIGH PRIORITY tasks in the TaskManager (including TaskManager).

-----
1 1
2 2
3 3
1 1

0 cycles
58 cycles
79 cycles
65 cycles
73 cycles

0 cycles
48 cycles
79 cycles
56 cycles
74 cycles

-1 cycles
52 cycles
79 cycles
56 cycles
73 cycles

-2 cycles
48 cycles
88 cycles
56 cycles
74 cycles
-----
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Re: New cycle-count macros (Windows only)

Postby MichaelW » Jun 12, 2012 7:16

AFAIK REALTIME_PRIORITY_CLASS + THREAD_PRIORITY_TIME_CRITICAL preempts everything, so while the counter code is running it is the only thing running on the core.
ytwinky
Posts: 217
Joined: Dec 03, 2005 12:44
Location: MD, Germany

Re: New cycle-count macros (Windows only)

Postby ytwinky » Jun 12, 2012 19:07

counting_pine wrote:Looks like you should definitely keep that computer.

Ooooh, I'm glad you said this ;-))
About your tip concerning i+=2: Maybe I should install Win7, perhaps I could use these advantages then..

@MichaelW: Thank you for your answer, I'll try to translate and understand it.
And I will try your example and look at the results..
..maybe tomorrow, when I have got time for it..
Regards
ytwinky

Return to “Tips and Tricks”

Who is online

Users browsing this forum: No registered users and 1 guest