64-bit op codes in 32-bit OS
Re: 64-bit op codes in 32-bit OS
In fact not really a problem of reserved/illegal word for Freebasic but rather for the assembler.
'qword = &H8877665544332211' is correctly compiled and executed but
'movq mm0, [qword] ' is compiled like that : 'movq mm0,QWORD PTR ds:0x8'
So obviously when executed a problem as the address is not accessible.
'qword = &H8877665544332211' is correctly compiled and executed but
'movq mm0, [qword] ' is compiled like that : 'movq mm0,QWORD PTR ds:0x8'
So obviously when executed a problem as the address is not accessible.
Re: 64-bit op codes in 32-bit OS
@fzabkar
I would add emms after storing the result to Function, emms is necessary when using the mmx registers because the mmx registers are mapped to the FPU registers and if don't use emms after using the mmx registers it leaves the FPU registers in a mess
see https://www.felixcloutier.com/x86/emms
I would add emms after storing the result to Function, emms is necessary when using the mmx registers because the mmx registers are mapped to the FPU registers and if don't use emms after using the mmx registers it leaves the FPU registers in a mess
see https://www.felixcloutier.com/x86/emms
Re: 64-bit op codes in 32-bit OS
I'm no SIMD expert but AFAIK there's no good reason to use MMX when SSE2+ is available (or some higher version, SSE3 maybe?).
Use the movsd instruction (move scalar double-precision with zero-extension, meaning move a single 64-bit value into the lower half of a register). It works with unaligned memory.
Use the movsd instruction (move scalar double-precision with zero-extension, meaning move a single 64-bit value into the lower half of a register). It works with unaligned memory.
Code: Select all
Function BytSwap8( Byval int64 As uLongInt ) As uLongInt
Dim Swap8Mask As Const uLongInt = &H0607040502030001
ASM
movsd xmm1, [Swap8Mask]
movsd xmm0, [int64]
pshufb xmm0, xmm1
movsd [Function], xmm0
End ASM
End Function
Re: 64-bit op codes in 32-bit OS
Thanks, but the SSE2+ code takes twice as long to execute as my original 32-bit ASM code (on my Core 2 Duo).
Re: 64-bit op codes in 32-bit OS
Yes it's very slow (about 5x slower for me), apparently because of the unaligned memory accesses. Unfortunately FB doesn't have a way to set the alignment of variables.
I thought you were just interested, and didn't really care about speed since you said all the data is streaming from disk. I don't think you'll beat the version with bswap for speed. In fact if I comment out the bswaps it only speeds up negligibly, I can't even reliably measure the difference. All time is spent on the overhead of calling a function, including moving arguments and results to/from the stack.
I thought you were just interested, and didn't really care about speed since you said all the data is streaming from disk. I don't think you'll beat the version with bswap for speed. In fact if I comment out the bswaps it only speeds up negligibly, I can't even reliably measure the difference. All time is spent on the overhead of calling a function, including moving arguments and results to/from the stack.
Re: 64-bit op codes in 32-bit OS
When testing the speed of the 2 methods I confirm that srvarldez is totally right, not using emms can cause weird issue.
Test with this code and see the value of tt printed the second time
I tested the speed mm (with emms on) vs xmm :
gas32 bit --> mm = 10.56s / xmm = 8.29s
gas 64 bit --> mm = 3.53s / xmm = 1.98s
Test with this code and see the value of tt printed the second time
Code: Select all
Function BytSwap8_mm( Byval int64 As uLongInt ) As uLongInt
Dim Swap8Mask As Const uLongInt = &H0001020304050607
ASM
movq mm1, [Swap8Mask]
movq mm0, [int64]
pshufb mm0, mm1 ' Swap8Mask
movq [Function], mm0
'emms
End ASM
End Function
dim as double tt=timer
print tt
BytSwap8_mm(&H8877665544332211)
print tt
Sleep
gas32 bit --> mm = 10.56s / xmm = 8.29s
gas 64 bit --> mm = 3.53s / xmm = 1.98s
Code: Select all
Function BytSwap8_xmm( Byval int64 As uLongInt ) As uLongInt
Dim Swap8Mask As Const uLongInt = &H0001020304050607
ASM
movsd xmm1, [Swap8Mask]
movsd xmm0, [int64]
pshufb xmm0, xmm1
movsd [Function], xmm0
End ASM
End Function
Function BytSwap8_mm( Byval int64 As uLongInt ) As uLongInt
Dim Swap8Mask As Const uLongInt = &H0001020304050607
ASM
movq mm1, [Swap8Mask]
movq mm0, [int64]
pshufb mm0, mm1
movq [Function], mm0
emms
End ASM
End Function
dim as double tt=timer
for i as integer =1 to 500000000
BytSwap8_mm(&H8877665544332211)
Next
print timer-tt
tt=timer
for i as integer =1 to 500000000
BytSwap8_xmm(&H8877665544332211)
Next
print timer-tt
Sleep
Re: 64-bit op codes in 32-bit OS
Afaik MM and xmm (SSE2) registers don't overlay the coprocessor.
Only MMX and 3DNOW did that. I'm not 100% sure about SSE-1 (mm* registers without x), it was before I got interested in SIMD, but iirc they are also separate (I find webpages about increasing speed using copro and SSE1 interleaved).
The unaligned penalties should mostly go away for CPUs released after 2010. Haswell (4th generation) and later however suffer from a shuffle bottleneck. Ivy bridge can emit more 128-bit shuffles than Haswell (but Haswell can do AVX256, with ymm)
Note also that XMM can do two swaps per pshufb. Coding a whole loop in assembler (giving it a pointer and a count) is advisable for bulk endian swapping.
Only MMX and 3DNOW did that. I'm not 100% sure about SSE-1 (mm* registers without x), it was before I got interested in SIMD, but iirc they are also separate (I find webpages about increasing speed using copro and SSE1 interleaved).
The unaligned penalties should mostly go away for CPUs released after 2010. Haswell (4th generation) and later however suffer from a shuffle bottleneck. Ivy bridge can emit more 128-bit shuffles than Haswell (but Haswell can do AVX256, with ymm)
Note also that XMM can do two swaps per pshufb. Coding a whole loop in assembler (giving it a pointer and a count) is advisable for bulk endian swapping.
Re: 64-bit op codes in 32-bit OS
Yes, SSE and SSE2 registers are the same (xmm*), they don't and never did overlay the x87 registers. In hindsight overlaying those registers is regarded as a really dumb idea that resulted in the obsolescence of MMX and its replacement with SSE. I believe the reason they did that was not to save transistors but so that MMX could be used without requiring an OS update (to save/restore the MMX registers when switching processes).