Filled Box, ASM

General FreeBASIC programming questions.
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Filled Box, ASM

Postby noop » Aug 30, 2015 18:14

Hey guys,

it's the first time for me using assembler. I implemented a horizontal line which is much faster than FreeBasic's implementation. Perhaps this is due to the overhead, FreeBasic still has to check whether the line is horizontal or something else (among other things).

I then expanded my code to fill a box,

Code: Select all

line buffer,(x1,y1)-(x2,y2),colour,bf

and I get significant speed differences.

For very small boxes my assembler code is quicker (e.g. box size 5x5), for medium sized boxes it's slower (e.g. box size 50x50) and for large boxes it's quicker again (e.g. box size 500x500).
I don't understand where those speed differences could come from.

Do you have an idea? Also, could you check my assembler code for newbie mistakes? I just had a look at assembler today. So I know next to nothing about it ;-)

My code:

Code: Select all

#include "fbgfx.bi"
#define getPixelAddress(img,row,col) cast(any ptr,img) + _
    sizeof(FB.IMAGE) + img->pitch * row + img->bpp * col

screen 18,32
dim as FB.IMAGE ptr img = imagecreate(10,10,0)
line img,(0,0)-(img->width-1,img->height-1),rgb(222,222,222),bf
dim as uinteger colour = rgb(255,0,0)

dim as uinteger x1,x2,y1,y2
x1 = 1
x2 = 6
y1 = 1
y2 = 6

dim as uinteger numberOfRuns = 10000000

dim as double t1,t2
t1 = timer
for k as integer = 0 to numberOfRuns-1
    line img,(x1,y1)-(x2,y2),colour,bf
next k
t2 = timer
print t2-t1

t1 = timer
for k as integer = 0 to numberOfRuns-1
    dim as integer w,h
    w = x2-x1+1
    h = y2-y1+1
    dim as uinteger ptr p = getPixelAddress(img,y1,x1)
    dim as integer bytes_lineBreak = img->pitch - w*img->bpp
   
    asm
    ' Row counter.
    mov ebx, [h]
   
    ' Store the pixel address.
    mov edi, dword ptr [p]
   
    ' Store the colour.
    mov eax, dword ptr [colour]
   
    REPEAT:
        ' Column counter.
        mov ecx, [w]
       
        ' Store the 4-byte value in the register eax in the variable
        ' whose address is stored in edi and repeat this process
        ' n-times where n is the value stored in the register ecx.
        rep stosd
       
        ' Add bytes_lineBreak to the pixel address for a line break.
        add edi,[bytes_lineBreak]
       
        ' One less row to process.
        dec ebx
       
        ' Are we done?
        test ebx, ebx
        jne REPEAT
    end asm
next k
t2 = timer
print t2-t1

put (100,100),img,pset

sleep


Some timings:
large box
imgWidth=imgHeight=1000
x1=y1=100, x2=y2=600
numberOfRuns = 30,000
FB: 1.8s, MINE: 1.1s
medium box
imgWidth=imgHeight=100
x1=y1=10, x2=y2=60
numberOfRuns = 1,000,000
FB: 0.8s, MINE: 1.0s
small box
imgWidth=imgHeight=10
x1=y1=1, x2=y2=6
numberOfRuns = 10,000,000 (Edit: fixed typo)
FB: 1.5s, MINE: 0.5s
Last edited by noop on Sep 10, 2015 12:36, edited 1 time in total.
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Re: Filled Box, ASM

Postby MichaelW » Sep 01, 2015 5:04

The "test ebx, ebx" is not necessary, because the "dec ebx" will set the zero flag when ebx reaches zero. And while "jne" does exactly the same thing as "jnz", using "jnz" would make it easier to understand the code.

Also, while the "rep stos_" instructions, like the "rep movs_" instructions, are very fast for large counts, they may not be the fastest for small counts.
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Sep 03, 2015 13:22

MichaelW wrote:The "test ebx, ebx" is not necessary, because the "dec ebx" will set the zero flag when ebx reaches zero. And while "jne" does exactly the same thing as "jnz", using "jnz" would make it easier to understand the code.

Thanks!

MichaelW wrote:Also, while the "rep stos_" instructions, like the "rep movs_" instructions, are very fast for large counts, they may not be the fastest for small counts.

Hmm, perhaps the following then explains the timings:
1) Small size: The overhead of checks in the line-call make the asm version faster.
2) Medium size: This overhead is now less of an issue (fewer calls to line ... more to do within the line call). Now the overhead of the "rep stosd" instruction shows its effect and the FB line-call is a bit faster.
3) Large size: The "rep stosd" can write the data quicker than the FB line-function. The overheads are now negligible and the asm version is faster.

One could formulate this as a function:
f(x) = (10+x) - (sqrt(x)*5+0.8*x)
x: number of pixels
10: overhead cost of calling line.
1: cost of setting one pixel with the line-call.
sqrt(x)*5: (let's assume we draw a square box) overhead cost of calling "rep stosd" (5 per line).
0.8: cost of setting one pixel in the asm-version.

I did try to replace the "rep stosd" command with a loop setting the pixels individually. However this was slower in every scenario.
D.J.Peters
Posts: 8019
Joined: May 28, 2005 3:28
Contact:

Re: Filled Box, ASM

Postby D.J.Peters » Sep 03, 2015 20:39

This old Screen/Image code works on 32/64-bit OS with clipping and all gfx modes.

Joshy

Code: Select all

' FillBox for all gfx modes by Joshy 2008

#macro fill(_T_)
  dim as _T_ c=colour
  while h
    var row=cptr(_T_ ptr,pTarget)
    for i as uinteger = 1 to w: *row=c:row+=1: next
    pTarget+=TargetPitch : h-=1
  wend
#endmacro

sub fillbox(byval pTarget as any ptr=0, _
            byval x as uinteger, _
            byval y as uinteger, _
            byval w as uinteger, _
            byval h as uinteger, _
            byval colour as ulong) ' note: max 32 bit

  Dim As Integer TargetWidth=any,TargetHeight=any,TargetBytes=any,TargetPitch=any
  Dim As boolean MustLock=any

  ' !!! no active gfx mode !!!
  If ScreenPtr() = 0 Then Exit Sub

  ' !!! wrong dimension !!!
  If w < 1 Then Exit Sub
  If h < 1 Then Exit Sub

  ' !!! screen or image !!!
  MustLock = iif(pTarget = 0,True,False)

  If MustLock Then
    ScreenInfo TargetWidth , _
               TargetHeight, _
                           , _
               TargetBytes , _
               TargetPitch
    ScreenLock
    pTarget=ScreenPtr()
  Else
    var pImage   = cptr(integer ptr,pTarget)
    TargetBytes  = pImage[1]
    TargetWidth  = pImage[2]
    TargetHeight = pImage[3]
    TargetPitch  = pImage[4]
    pTarget     += 32 ' size of image header
  End If

  ' !!! clipping !!!
  if x+w > TargetWidth  then w=TargetWidth -x
  if y+h > TargetHeight then h=TargetHeight-y
 
  ' !!! position of first pixel !!!
  pTarget+=x*TargetBytes + y*TargetPitch

  select case as const TargetBytes
  case 1
    fill(UBYTE)
  case 2
    fill(USHORT)
  case else
    fill(ULONG)
  end select

  If MustLock Then ScreenUnlock()
end sub

screenres 640,480,24 ' try 8/15/16/24/32
var image=ImageCreate(640,480)
while inkey()=""
  FillBox image,0,0,640,480,rgb(rnd*255,rnd*255,rnd*255)
  put (0,0),image,PSET
  FillBox ,100,100,440,280,rgb(rnd*255,rnd*255,rnd*255)
wend
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Sep 06, 2015 1:02

Thanks Joshy. This does however add a significant overhead once again. I guess one has to have procedures for each specific usage case to avoid unnecessary checks, if speed is an issue.
DamageX
Posts: 121
Joined: Nov 21, 2009 8:42

Re: Filled Box, ASM

Postby DamageX » Sep 10, 2015 6:31

Strange results. What if you run more loops?
I ran 10,000,000 iterations of the small and medium box. Presumably these hit only the L1 cache, so overhead makes all the difference.
small box
LINE routine: 2.79s
ASM routine: 0.55s
medium box
LINE routine: 13.2s
ASM routine: 8.6s
I ran 100,000 iterations of the large box. It shows practically no difference (memory bandwidth limited?)
LINE routine: 52s
ASM routine: 50s
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Sep 10, 2015 12:34

DamageX wrote:Strange results. What if you run more loops?

Same qualitative results.

DamageX wrote:I ran 10,000,000 iterations of the small and medium box. Presumably these hit only the L1 cache, so overhead makes all the difference.
small box
LINE routine: 2.79s
ASM routine: 0.55s
medium box
LINE routine: 13.2s
ASM routine: 8.6s
I ran 100,000 iterations of the large box. It shows practically no difference (memory bandwidth limited?)
LINE routine: 52s
ASM routine: 50s

The qualitative difference in our results, especially for the medium box, is strange. Perhaps this is based on the CPU model?

Did you really run 100,000 iterations for the large box? For the medium sized box and 10,000,000 iterations I get LINE(7.7) vs. ASM(10.0), which doesn't differ too much from your result, but with the large box and 100,000 iterations I get LINE(6.0) vs. ASM(3.8).

Note: I did make a minor mistake above. I had run the small-box-test 10 million times (instead of 1 mio. as I had written).
DamageX
Posts: 121
Joined: Nov 21, 2009 8:42

Re: Filled Box, ASM

Postby DamageX » Sep 12, 2015 5:26

My test was run on an Athlon X2 (512KB of L2 per core) at 2.7GHz. Here is another test on a Core 2 Duo at 1.6GHz. The Core 2 has 3MB of shared L2 which is enough to contain the large box (unlike the Athlon)
small box 10,000,000
4.28s
1.14s
medium box 10,000,000
34s
21s
large box 100,000
23s
14s
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Sep 15, 2015 1:05

Interesting. I didn't consider caching problems. It's nice to see that the assembler code is quicker in all cases for you. I will try to run this on different machines to get more results.
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Sep 18, 2015 15:05

The tests so far, I have run on my Laptop with a Core i7-4710HQ.

Now I have tested it on my Desktop system with a Core i5-750: (LINE vs. ASM)
small: 1.8 vs. 0.60
medium: 8.6 vs. 11.7
large: 7.2 vs. 5.3

And also on another Laptop with a Pentium B950:
small: 2.7 vs. 1.0
medium: 13.1 vs. 18.0
large: 16.8 vs. 7.1

I don't know the speed differences between L3 and L2-Cache but they all have enough L3-Cache.
DamageX
Posts: 121
Joined: Nov 21, 2009 8:42

Re: Filled Box, ASM

Postby DamageX » Sep 25, 2015 2:29

Does having speedstep/turboboost or whatever CPU speed throttling enabled affect your results? (I have it disabled)
Pim Scheffers
Posts: 46
Joined: Jun 29, 2014 17:15

Re: Filled Box, ASM

Postby Pim Scheffers » Sep 25, 2015 16:41

He,

You could try my version that I implemented a while ago in a little game that I made.
It is now in the form of a sub:

The main difference is that I don't fetch the W variable from "I guess memory" but store it in a register.

Greets, Pim

Code: Select all

declare sub gfx_box(buffer as integer ptr, x1 as integer, y1 as integer, x2 as integer, y2 as integer, col as integer, scr_size_x as integer)

screenres 1280,800,32,1

gfx_box(screenptr, 9, 9, 1269, 789, &hff0000fc, 1280)

sub gfx_box(buffer as integer ptr, x1 as integer, y1 as integer, x2 as integer, y2 as integer, col as integer, scr_size_x as integer)
    dim as integer x, y, blit_corr

    x = (x2 - x1) + 1
    y = (y2 - y1) + 1

    buffer += (y1 * scr_size_x) + x1
    blit_corr = (scr_size_x - x) shl 2

    asm
        mov edi,[buffer]
        mov edx,[y]
        mov ebx,[x]'<---|
        mov eax,[col]'  |
        nl_box:'        |
        mov ecx,ebx'<-----difference = not fetching var W but keep it in a register
        rep stosd
        add edi,[blit_corr]
        dec edx
        jnz nl_box
    end asm
end sub
noop
Posts: 130
Joined: Sep 18, 2006 10:29

Re: Filled Box, ASM

Postby noop » Oct 03, 2015 15:50

DamageX wrote:Does having speedstep/turboboost or whatever CPU speed throttling enabled affect your results? (I have it disabled)

Disabling turboboost has no effect on the quality of the results. Good thinking, though. I keep forgetting that I have this feature and it may affect my timing results.

Pim Scheffers wrote:You could try my version that I implemented a while ago in a little game that I made.
It is now in the form of a sub:

The main difference is that I don't fetch the W variable from "I guess memory" but store it in a register.

Thanks! Putting the variable into the register gives me a speedup of about 5-10% for the small-box-case.
The overall result stays the same, as expected.

Return to “General”

Who is online

Users browsing this forum: MSN [Bot] and 3 guests