Need help with ASM

fzabkar · Post by **fzabkar** » Mar 15, 2021 14:11

I need to convert this code ...

    For i = 0 To dwNumLongs - 1
        
        dwinptrB[ i ] Or= dwinptrS[ i ]
        
    Next i

... to ASM to improve execution speed.

Code: Select all

    ASM
        mov ecx, [dwNumLongs]
        mov esi, [dwinptrS]
        mov edi, [dwinptrB]
L1:     lodsd                   ' load EAX with contents of address pointed to by ESI - ESI increments by 4
        ?????                   ' load EBX with contents of address pointed to by EDI - no increment in EDI
        or eax, ebx             ' bitwise OR infile1 and infile2
        stosd                   ' copy contents of EAX to address pointed to by EDI - EDI increments by 4
        dec ecx                 ' decrement ECX
        jg L1                   ' jump to L1 if greater than 0
   
    End ASM

The above code performs a bitwise OR of two memory blocks, dwinptrB and dwinptrS. Ideally I would like to replace ????? with a suitable op code, if such a thing exists.

Thanks.

marcov · Post by **marcov** » Mar 15, 2021 14:51

mov ebx, [edi]

fzabkar · Post by **fzabkar** » Mar 15, 2021 18:50

Thanks.

caseih · Post by **caseih** » Mar 15, 2021 18:57

I am genuinely curious to know what the speed difference is. That's such a simple operation. I'd be seriously surprised if the compiler didn't match the hand-coded assembly. According to stackoverflow, if you perform your operation back to front (decrement in the loop instead of increment), it will be faster because of how the CPU cache operates.

fzabkar · Post by **fzabkar** » Mar 15, 2021 19:15

The application requires that every file on drive #1 be OR-ed with the corresponding file on drive #2. In fact it's a data recovery scenario where the two copies of a file may have bad sectors in different locations. Bad sectors are zero filled, so a composite file can be created by OR-ing the two damaged files. Drives can have capacities of 8TB or more, so any saving in execution time would add up.

My program is here:

http://www.users.on.net/~fzabkar/FreeBa ... lmerg6.bas

I compile it for Win32. The program is executed in batch mode, once for each file. I tried to get it to recurse directories, but I failed.

marcov · Post by **marcov** » Mar 15, 2021 21:10

Then you want AVX2 .

badidea · Post by **badidea** » Mar 15, 2021 21:28

You have 8TB discs that are so fast that optimizing assembly code makes a difference?

fzabkar · Post by **fzabkar** » Mar 15, 2021 23:00

badidea wrote:You have 8TB discs that are so fast that optimizing assembly code makes a difference?

If I can save 1 nanosecond per OR, then that's 8000 seconds saved, ie about 2 hours. Or am I having a brain fart? (Maybe that should be 2000 seconds for 32-bit ORs)

I have a Core 2 Duo.

jj2007 · Post by **jj2007** » Mar 15, 2021 23:13

Code: Select all

Dim as long i, dwNumLongs=99999999
Dim as long dwinptrB(dwNumLongs)
Dim as long dwinptrS(dwNumLongs)
Dim t as double
For i = 0 To dwNumLongs - 1
	dwinptrB(i)=Rnd()*12345
	dwinptrS(i)=Rnd()*12345
Next i
  print i; " elements"
for outerloop As integer=1 to 5
t=Timer()
#if 0
  Print "Gcc:  ";
	' asm int 3
  For i = 0 To dwNumLongs - 1
	dwinptrB(i) Or = dwinptrS(i)
	' asm nop
  Next i
#else
  Print "asm: ";
	Dim as integer ps, pd
	ps=@dwinptrS(0)
	pd=@dwinptrB(0)
  asm
  	' int 3
	mov ecx, [dwNumLongs]
	mov esi, [ps]
	mov edi, [pd]
L0:	dec ecx
	js L1
	mov eax, [esi+4*ecx]
	or [edi+4*ecx], eax
	jmp L0
L1:
 end asm
#endif
t=Timer()-t
print int(t*1000); " ms for or'ing"; dwNumLongs; " elements"
Next
For i = 0 To 9
	print dwinptrB(i)
Next

  sleep

Results:

Code: Select all

asm:  133 ms for or'ing 99999999 elements
asm:  148 ms for or'ing 99999999 elements
asm:  136 ms for or'ing 99999999 elements
asm:  125 ms for or'ing 99999999 elements
asm:  124 ms for or'ing 99999999 elements

Gcc:   157 ms for or'ing 99999999 elements
Gcc:   142 ms for or'ing 99999999 elements
Gcc:   152 ms for or'ing 99999999 elements
Gcc:   128 ms for or'ing 99999999 elements
Gcc:   131 ms for or'ing 99999999 elements

Gas:   353 ms for or'ing 99999999 elements
Gas:   349 ms for or'ing 99999999 elements
Gas:   342 ms for or'ing 99999999 elements
Gas:   351 ms for or'ing 99999999 elements
Gas:   333 ms for or'ing 99999999 elements

Gcc32, under the hood:

Code: Select all

  lea esi, [esi]                         ; align 16
L0:
  mov edx, [local.23]
  add edx, eax                           ; addr dest
  mov ecx, [local.15]
  mov ecx, [eax+ecx]                     ; val src
  or [edx], ecx
  add eax, 4
  cmp eax, 17D783FC
  jne short L0

20% faster:

Code: Select all

  asm
	mov ecx, [dwNumLongs]
	mov esi, [ps]
	mov edi, [pd]
	dec ecx
L0:	mov eax, [esi+4*ecx]
	or [edi+4*ecx], eax
	dec ecx
	jns L0
  end asm

In any case, parallel processing with SIMD or AVX would be a lot faster. Try this one (a factor 5 faster than Gcc with -O3):

Code: Select all

	Dim as integer ps, pd
	ps=@dwinptrS(0)
	pd=@dwinptrB(0)
  asm
	mov ecx, [dwNumLongs]
	lea ecx, [ecx-16]
	mov esi, [ps]
	mov edi, [pd]
L0:	movups xmm0, [esi+ecx]
	movups xmm1, [edi+ecx]
	por xmm0, xmm1
	movups [edi+ecx], xmm0
	sub ecx, 16
	jns L0
  end asm

badidea · Post by **badidea** » Mar 16, 2021 0:22

fzabkar wrote:
badidea wrote:You have 8TB discs that are so fast that optimizing assembly code makes a difference?
If I can save 1 nanosecond per OR, then that's 8000 seconds saved, ie about 2 hours. Or am I having a brain fart? (Maybe that should be 2000 seconds for 32-bit ORs)

I have a Core 2 Duo.

That calculation sounds plausible. But the bottleneck is still the disc reading I think.
At 100 MB/s: 22 hours
At 500 MB/s: 4.4 hours
If the PC can read data (via DMA) and calculate the OR's in parallel then the process only waits for the discs.
I like to hear the results of a real benchmark (optimized vs non-optimized code).

jj2007 · Post by **jj2007** » Mar 16, 2021 0:27

badidea wrote:I like to hear the results of a real benchmark (optimized vs non-optimized code).

See above.

badidea · Post by **badidea** » Mar 16, 2021 0:45

jj2007 wrote:
badidea wrote:I like to hear the results of a real benchmark (optimized vs non-optimized code).
See above.

Does that include reading from disc?

caseih · Post by **caseih** » Mar 16, 2021 0:57

No it certainly doesn't include I/O in his little benchmark.

The OP's problem is intrinsically I/O bound. The CPU can easily outpace the disk for this sort of work. Classic case of premature optimization. I'll be seriously surprised if the assembly has any speedup at all. I keep harping on this, but you have to actually measure where in the code your CPU is spending its time before you optimize. Assembly language could be the answer, but often is not. Although you can do it for fun of course. But if this were a real-world application, this would be an improper optimization at this point. The first thing I would look at is how you're reading the files. Maybe something as simple as reading larger chunks would give the most speedup (depends on the disk and cache characteristics).

jj2007 · Post by **jj2007** » Mar 16, 2021 1:01

badidea wrote:Does that include reading from disc?

Of course not. Read caseih's answer, he is 100% right.

deltarho[1859] · Post by **deltarho[1859]** » Mar 16, 2021 15:37

Sage advice from caseih.

Maybe something as simple as reading larger chunks would give the most speedup (depends on the disk and cache characteristics).

There is another metric: What is the application doing with the chunks?

I have found that the performance is normally distributed. That is there exists a 'sweet spot' for the chunk size. On my HDD the sweet spot tends to be 256KB. The performance drops the further we are from the sweet spot. It is not a sharp peak, so we do not have to be 'bang on the button'. It should be worth the effort to find the sweet spot especially when dealing with 8TB drives. It is worth remembering that modern PCs use read ahead so whilst we are working on a buffer the system reads the next buffer and caches it. When we need it, we get the data from RAM and not disk.

Need help with ASM

Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM

Re: Need help with ASM