Need help with ASM

New to FreeBASIC? Post your questions here.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Need help with ASM

Post by fzabkar »

I need to convert this code ...

Code: Select all

    For i = 0 To dwNumLongs - 1
        
        dwinptrB[ i ] Or= dwinptrS[ i ]
        
    Next i
... to ASM to improve execution speed.

Code: Select all

    ASM
        mov ecx, [dwNumLongs]
        mov esi, [dwinptrS]
        mov edi, [dwinptrB]
L1:     lodsd                   ' load EAX with contents of address pointed to by ESI - ESI increments by 4
        ?????                   ' load EBX with contents of address pointed to by EDI - no increment in EDI
        or eax, ebx             ' bitwise OR infile1 and infile2
        stosd                   ' copy contents of EAX to address pointed to by EDI - EDI increments by 4
        dec ecx                 ' decrement ECX
        jg L1                   ' jump to L1 if greater than 0
   
    End ASM
The above code performs a bitwise OR of two memory blocks, dwinptrB and dwinptrS. Ideally I would like to replace ????? with a suitable op code, if such a thing exists.

Thanks.
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Need help with ASM

Post by marcov »

mov ebx, [edi]
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

Thanks.
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Need help with ASM

Post by caseih »

I am genuinely curious to know what the speed difference is. That's such a simple operation. I'd be seriously surprised if the compiler didn't match the hand-coded assembly. According to stackoverflow, if you perform your operation back to front (decrement in the loop instead of increment), it will be faster because of how the CPU cache operates.
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

The application requires that every file on drive #1 be OR-ed with the corresponding file on drive #2. In fact it's a data recovery scenario where the two copies of a file may have bad sectors in different locations. Bad sectors are zero filled, so a composite file can be created by OR-ing the two damaged files. Drives can have capacities of 8TB or more, so any saving in execution time would add up.

My program is here:

http://www.users.on.net/~fzabkar/FreeBa ... lmerg6.bas

I compile it for Win32. The program is executed in batch mode, once for each file. I tried to get it to recurse directories, but I failed.
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Need help with ASM

Post by marcov »

Then you want AVX2 .
badidea
Posts: 2591
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Need help with ASM

Post by badidea »

You have 8TB discs that are so fast that optimizing assembly code makes a difference?
fzabkar
Posts: 154
Joined: Sep 29, 2018 2:52
Location: Australia

Re: Need help with ASM

Post by fzabkar »

badidea wrote:You have 8TB discs that are so fast that optimizing assembly code makes a difference?
If I can save 1 nanosecond per OR, then that's 8000 seconds saved, ie about 2 hours. Or am I having a brain fart? (Maybe that should be 2000 seconds for 32-bit ORs)

I have a Core 2 Duo.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

Code: Select all

Dim as long i, dwNumLongs=99999999
Dim as long dwinptrB(dwNumLongs)
Dim as long dwinptrS(dwNumLongs)
Dim t as double
For i = 0 To dwNumLongs - 1
	dwinptrB(i)=Rnd()*12345
	dwinptrS(i)=Rnd()*12345
Next i
  print i; " elements"
for outerloop As integer=1 to 5
t=Timer()
#if 0
  Print "Gcc:  ";
	' asm int 3
  For i = 0 To dwNumLongs - 1
	dwinptrB(i) Or = dwinptrS(i)
	' asm nop
  Next i
#else
  Print "asm: ";
	Dim as integer ps, pd
	ps=@dwinptrS(0)
	pd=@dwinptrB(0)
  asm
  	' int 3
	mov ecx, [dwNumLongs]
	mov esi, [ps]
	mov edi, [pd]
L0:	dec ecx
	js L1
	mov eax, [esi+4*ecx]
	or [edi+4*ecx], eax
	jmp L0
L1:
 end asm
#endif
t=Timer()-t
print int(t*1000); " ms for or'ing"; dwNumLongs; " elements"
Next
For i = 0 To 9
	print dwinptrB(i)
Next

  sleep
Results:

Code: Select all

asm:  133 ms for or'ing 99999999 elements
asm:  148 ms for or'ing 99999999 elements
asm:  136 ms for or'ing 99999999 elements
asm:  125 ms for or'ing 99999999 elements
asm:  124 ms for or'ing 99999999 elements

Gcc:   157 ms for or'ing 99999999 elements
Gcc:   142 ms for or'ing 99999999 elements
Gcc:   152 ms for or'ing 99999999 elements
Gcc:   128 ms for or'ing 99999999 elements
Gcc:   131 ms for or'ing 99999999 elements

Gas:   353 ms for or'ing 99999999 elements
Gas:   349 ms for or'ing 99999999 elements
Gas:   342 ms for or'ing 99999999 elements
Gas:   351 ms for or'ing 99999999 elements
Gas:   333 ms for or'ing 99999999 elements
Gcc32, under the hood:

Code: Select all

  lea esi, [esi]                         ; align 16
L0:
  mov edx, [local.23]
  add edx, eax                           ; addr dest
  mov ecx, [local.15]
  mov ecx, [eax+ecx]                     ; val src
  or [edx], ecx
  add eax, 4
  cmp eax, 17D783FC
  jne short L0
20% faster:

Code: Select all

  asm
	mov ecx, [dwNumLongs]
	mov esi, [ps]
	mov edi, [pd]
	dec ecx
L0:	mov eax, [esi+4*ecx]
	or [edi+4*ecx], eax
	dec ecx
	jns L0
  end asm
In any case, parallel processing with SIMD or AVX would be a lot faster. Try this one (a factor 5 faster than Gcc with -O3):

Code: Select all

	Dim as integer ps, pd
	ps=@dwinptrS(0)
	pd=@dwinptrB(0)
  asm
	mov ecx, [dwNumLongs]
	lea ecx, [ecx-16]
	mov esi, [ps]
	mov edi, [pd]
L0:	movups xmm0, [esi+ecx]
	movups xmm1, [edi+ecx]
	por xmm0, xmm1
	movups [edi+ecx], xmm0
	sub ecx, 16
	jns L0
  end asm
badidea
Posts: 2591
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Need help with ASM

Post by badidea »

fzabkar wrote:
badidea wrote:You have 8TB discs that are so fast that optimizing assembly code makes a difference?
If I can save 1 nanosecond per OR, then that's 8000 seconds saved, ie about 2 hours. Or am I having a brain fart? (Maybe that should be 2000 seconds for 32-bit ORs)

I have a Core 2 Duo.
That calculation sounds plausible. But the bottleneck is still the disc reading I think.
At 100 MB/s: 22 hours
At 500 MB/s: 4.4 hours
If the PC can read data (via DMA) and calculate the OR's in parallel then the process only waits for the discs.
I like to hear the results of a real benchmark (optimized vs non-optimized code).
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

badidea wrote:I like to hear the results of a real benchmark (optimized vs non-optimized code).
See above.
badidea
Posts: 2591
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Need help with ASM

Post by badidea »

jj2007 wrote:
badidea wrote:I like to hear the results of a real benchmark (optimized vs non-optimized code).
See above.
Does that include reading from disc?
caseih
Posts: 2157
Joined: Feb 26, 2007 5:32

Re: Need help with ASM

Post by caseih »

No it certainly doesn't include I/O in his little benchmark.

The OP's problem is intrinsically I/O bound. The CPU can easily outpace the disk for this sort of work. Classic case of premature optimization. I'll be seriously surprised if the assembly has any speedup at all. I keep harping on this, but you have to actually measure where in the code your CPU is spending its time before you optimize. Assembly language could be the answer, but often is not. Although you can do it for fun of course. But if this were a real-world application, this would be an improper optimization at this point. The first thing I would look at is how you're reading the files. Maybe something as simple as reading larger chunks would give the most speedup (depends on the disk and cache characteristics).
Last edited by caseih on Mar 16, 2021 1:11, edited 1 time in total.
jj2007
Posts: 2326
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Need help with ASM

Post by jj2007 »

badidea wrote:Does that include reading from disc?
Of course not. Read caseih's answer, he is 100% right.
deltarho[1859]
Posts: 4308
Joined: Jan 02, 2017 0:34
Location: UK
Contact:

Re: Need help with ASM

Post by deltarho[1859] »

Sage advice from caseih.
Maybe something as simple as reading larger chunks would give the most speedup (depends on the disk and cache characteristics).
There is another metric: What is the application doing with the chunks?

I have found that the performance is normally distributed. That is there exists a 'sweet spot' for the chunk size. On my HDD the sweet spot tends to be 256KB. The performance drops the further we are from the sweet spot. It is not a sharp peak, so we do not have to be 'bang on the button'. It should be worth the effort to find the sweet spot especially when dealing with 8TB drives. It is worth remembering that modern PCs use read ahead so whilst we are working on a buffer the system reads the next buffer and caches it. When we need it, we get the data from RAM and not disk.
Post Reply