Execution speed of nested loops

New to FreeBASIC? Post your questions here.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Execution speed of nested loops

Post by fzabkar »

I don't understand why there is a difference in execution speed between these two versions of my code. Essentially it counts the frequency of 1 bits in each bit position of a block of 32-bit words.

The first loop is faster than the second version (72 secs versus 86 secs for 1.2GB file).

Code: Select all

    For i = 0 To dwNumLongs - 1
    
        For j = 0 To 31
            
            uliSumOne32( j ) -= Bit( dwInptr[ i ], j )
        
        Next j
        
    Next i

Code: Select all

    For j = 0 To 31
    
        For i = 0 To dwNumLongs - 1
        
            uliSumOne32( j ) -= Bit( dwInptr[ i ], j )
        
        Next i
        
    Next j
adeyblue
Posts: 299
Joined: Nov 07, 2019 20:08

Re: Execution speed of nested loops

Post by adeyblue »

Imagine your'e doing a survey of everybody who lives in a street.

You could go to each house. and while you're there, you talk to everybody in it before moving on to the next house. (the first code)

Or you could go a house, talk to one person, go to the next house. talk to one person... and then when you're done visiting each house, you go back to the first one to talk to the second person, go to the next house, talk to the second person... and then when you're done talking to the second people, you go back to the first house and talk to the third person etc (the second code)

Just because it's computers and numbers and not clipboards and people doesn't make the second any less laborious of an algorithm
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

So, IIUC, the outermost loop should always have the longest range, and the innermost loop the shortest?
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Execution speed of nested loops

Post by MrSwiss »

fzabkar wrote:So, IIUC, the outermost loop should always have the longest range, and the innermost loop the shortest?
Nope, not necessarily.

The point is this (relating to your example):
- outer loop = next fixed-size variable (e.g. going through a array)
- inner loop = go through the whole fixed-size variable (one at the time)
This is of course always depending on the issue to be solved at the time.
marcov
Posts: 3455
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Execution speed of nested loops

Post by marcov »

What MrSwiss says. probably the inner loop 0..31 can hold the dword to be mutated in a register value, only storing it after the loop.

That's a lot of load/store operations eliminated, and if dwnumlongs is high, less cache polution
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Execution speed of nested loops

Post by jj2007 »

fzabkar wrote:(72 secs versus 86 secs for 1.2GB file)
It's all about cache, and adeyblue's example is brilliant. Your cpu has probably some MB of data cache; for example, a Core i7 has 8MB of level 3 cache. When you make "big" steps in the innermost loop, the cpu has to load huge volumes of data with each iteration, and that costs lots of cycles.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

Sorry, I should have mentioned that I'm running a Core 2 Duo on a 32-bit Windows 10 platform.

I've just realised that LongInt computations occur a significant penalty in 32-bit mode. I've now reduced the execution time from 72s to 59s by amending the code as follows.

Code: Select all

    For i = 0 To dwNumLongs - 1

        For j = 0 To 31
            
            dwSumOne32( j ) -= Bit( dwInptr[ i ], j )
        
        Next j

    Next i
    
    
    For j = 0 To 31
        
        uliSumOne32( j ) += dwSumOne32( j )
        
    Next j
This is my bit counting tool:

http://www.users.on.net/~fzabkar/FreeBa ... ount32.bas
http://www.users.on.net/~fzabkar/FreeBa ... ount32.exe

... and this is the discussion that gave rise to it:

https://groups.google.com/g/datarecover ... shfKZ--t5A
deltarho[1859]
Posts: 4292
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Execution speed of nested loops

Post by deltarho[1859] »

Not always possible but it may be worthwhile to use a reverse loop rather than a forward loop.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

I reversed each loop in turn, but there was no difference in execution time.

However, there was one weird result. When loop counter, i, is defined as a ULongInt, the execution time is 59s. Otherwise, if I redefine it as a ULong, the execution time is then 72s.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Execution speed of nested loops

Post by caseih »

This is a situation where someone like @jj2007 and assembly can win (at least until you hit the I/O limit). On linux I looked at the C code emitted by a loop that involves the bit() keyword, and I certainly see room for improvement there. Here's the FB code I tested, followed by the C code emitted for the bit() line:

Code: Select all

dim as integer b
dim as integer a
dim as integer x

b = 231
for x=0 to 31
    a = bit(b,x)
    print a
next

Code: Select all

A$0 = (int64)-((B$0 & (1ll << X$0)) != 0ll);
At least bit() is not implemented as a function call! I do not know if the assembly emitted is the same but I suspect it is similar. Right away we can see that for every bit it's doing a shift left using the variable for the number of bits. In other words it has to be computed from that variable each time. So an obvious optimization is to unroll that loop---yes it means 32 lines of code---so that the shift value is a constant. Also the way it's written, the CPU has to fetch the same value from memory 32 times (although it should be cached) in addition to recomputing the shift value 32 times. Also bit() has a negation that's really not needed for your purposes (if you add the sums instead of subtract), so that's a redundant operation too.

But you can do much better than that with assembly. In a loop over 32 bits, for each bit shift the value right and test the carry flag and only add if the flag is set.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Execution speed of nested loops

Post by MrSwiss »

caseih wrote:On linux I looked at the C code emitted by a loop that involves the bit() keyword, and I certainly see room for improvement there.

Code: Select all

A$0 = (int64)-((B$0 & (1ll << X$0)) != 0ll);
At least bit() is not implemented as a function call!
Everyone that can read the language-reference knows, that Bit() is implemented as single-line macro:

Code: Select all

#define Bit( value, bit_number ) (((value) And (Cast(TypeOf(value), 1) Shl (bit_number))) <> 0)
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

I've written an ASM routine, but there is one op code which I can't get right.

Code: Select all

inc [dwSumOne32(31)]
dwSumOne32(n) is an array of Longs which I'm trying to increment whenever the carry bit is set.

I do this 32 times in a long loop.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Execution speed of nested loops

Post by MrSwiss »

The assemby instruction inc typically works on registers:

Code: Select all

ASM
	' ...
	inc 	cl
	shl	eax, cl
	' ...
End ASM
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

This is what I'm trying to do:

Code: Select all

    ASM
        mov ecx, [dwNumLongs]
        mov esi, [dwInptr]
L31:    lodsd                    ' load EAX with contents of address pointed to by ESI - ESI increments by 4
        shl eax, 1               ' shift EAX left 1 bit
        jnc L30
        inc [dwSumOne32(31)]
L30:    shl eax, 1
        jnc L29
        inc [dwSumOne32(30)]
L29:    shl eax, 1
        jnc L28
        inc [dwSumOne32(29)]
L28:    shl eax, 1
        jnc L27
        inc [dwSumOne32(28)]
L27:    shl eax, 1
        jnc L26
        inc [dwSumOne32(27)]
L26:    shl eax, 1
        jnc L25
        inc [dwSumOne32(26)]
L25:    shl eax, 1
        jnc L24
        inc [dwSumOne32(25)]
L24:    shl eax, 1
        jnc L23
        inc [dwSumOne32(24)]
L23:    shl eax, 1
        jnc L22
        inc [dwSumOne32(23)]
L22:    shl eax, 1
        jnc L21
        inc [dwSumOne32(22)]
L21:    shl eax, 1
        jnc L20
        inc [dwSumOne32(21)]
L20:    shl eax, 1
        jnc L19
        inc [dwSumOne32(20)]
L19:    shl eax, 1
        jnc L18
        inc [dwSumOne32(19)]
L18:    shl eax, 1
        jnc L17
        inc [dwSumOne32(18)]
L17:    shl eax, 1
        jnc L16
        inc [dwSumOne32(17)]
L16:    shl eax, 1
        jnc L15
        inc [dwSumOne32(16)]
L15:    shl eax, 1
        jnc L14
        inc [dwSumOne32(15)]
L14:    shl eax, 1
        jnc L13
        inc [dwSumOne32(14)]
L13:    shl eax, 1
        jnc L12
        inc [dwSumOne32(13)]
L12:    shl eax, 1
        jnc L11
        inc [dwSumOne32(12)]
L11:    shl eax, 1
        jnc L10
        inc [dwSumOne32(11)]
L10:    shl eax, 1
        jnc L09
        inc [dwSumOne32(10)]
L09:    shl eax, 1
        jnc L08
        inc [dwSumOne32(9)]
L08:    shl eax, 1
        jnc L07
        inc [dwSumOne32(8)]
L07:    shl eax, 1
        jnc L06
        inc [dwSumOne32(7)]
L06:    shl eax, 1
        jnc L05
        inc [dwSumOne32(6)]
L05:    shl eax, 1
        jnc L04
        inc [dwSumOne32(5)]
L04:    shl eax, 1
        jnc L03
        inc [dwSumOne32(4)]
L03:    shl eax, 1
        jnc L02
        inc [dwSumOne32(3)]
L02:    shl eax, 1
        jnc L01
        inc [dwSumOne32(2)]
L01:    shl eax, 1
        jnc L00
        inc [dwSumOne32(1)]
L00:    shl eax, 1
        jnc L31
        inc [dwSumOne32(0)]
        dec ecx                 ' decrement ECX
        jg L31                  ' jump to L31 if greater than 0
   
    End ASM
The compiler doesn't appear to resolve "dwSumOne32(n)" for any op code.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Execution speed of nested loops

Post by fzabkar »

MrSwiss wrote:The assemby instruction inc typically works on registers:

Code: Select all

ASM
	' ...
	inc 	cl
	shl	eax, cl
	' ...
End ASM
The ASM docs use this example:

Code: Select all

' Assuming "n" is a FB global or local ULONG variable
mov  eax, [n]        ' OK: size is apparent from eax
inc  [n]             ' Not OK: size is not given
inc  dword [n]       ' Not OK: size given, but still not accepted by GAS
inc  dword Ptr [n]   ' OK: "ptr" is needed by GAS here
But even if I use "dword Ptr" I still run into trouble. It seems that ASM doesn't like arrays.
Post Reply