SSE2: pand xmm3, 31?

General FreeBASIC programming questions.
srvaldez
Posts: 2650
Joined: Sep 25, 2005 21:54

Re: SSE2: pand xmm3, 31?

Postby srvaldez » Sep 26, 2018 18:58

deltarho[1859] wrote:I can post this comparison if you want.

no need, I take your word for it.
deltarho[1859]
Posts: 2852
Joined: Jan 02, 2017 0:34
Location: UK

Re: SSE2: pand xmm3, 31?

Postby deltarho[1859] » Sep 27, 2018 4:38

I managed to get a performance boost with SSE2 by using one of Agner Fog's optimizing techniques. Instead of reading an immediate from memory we can load a register with some asm. However, the BASIC compiled code is still running faster. It would seem that although SSE2 is very powerful asm is faster if all we are doing is singleton 64 bit arithmetic. It is rather obvious putting it like that.
marcov
Posts: 3081
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: SSE2: pand xmm3, 31?

Postby marcov » Sep 27, 2018 6:31

deltarho[1859] wrote:I managed to get a performance boost with SSE2 by using one of Agner Fog's optimizing techniques. Instead of reading an immediate from memory we can load a register with some asm. However, the BASIC compiled code is still running faster. It would seem that although SSE2 is very powerful asm is faster if all we are doing is singleton 64 bit arithmetic. It is rather obvious putting it like that.


Does your assembler code include the loop, or do you have a Basic loop with a assembler body?

Keeping intermediates in registers from one loop iteration to the next is often key.

I myself have the feeling that for floating point, SSE is faster as x87 FPU for calculations that mostly do simple operations (+/-*), about 10-20% faster and the fpu wins out as soon as you start to use e.g. gonio, logarithms etc. (+/- 30-40% on the code itself, not overall)
Last edited by marcov on Sep 27, 2018 14:18, edited 2 times in total.
deltarho[1859]
Posts: 2852
Joined: Jan 02, 2017 0:34
Location: UK

Re: SSE2: pand xmm3, 31?

Postby deltarho[1859] » Sep 27, 2018 6:47

marcov wrote:Does your assembler code include the loop, or do you have a Basic loop with a assembler body?

No - I am simply replacing some BASIC with assembler in a function.

I must disassemble the BASIC to see what the compiler is doing. I may be able to handcraft something better then.
sean_vn
Posts: 283
Joined: Aug 06, 2012 8:26

Re: SSE2: pand xmm3, 31?

Postby sean_vn » Sep 27, 2018 13:23

You can do a movd with an ordinary register

Code: Select all

mov eax,31
movd xmm0,eax

It's quite slow, about 20 clock cycle I think
jj2007
Posts: 1962
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: SSE2: pand xmm3, 31?

Postby jj2007 » Sep 27, 2018 14:47

marcov wrote:I myself have the feeling that for floating point, SSE is faster as x87 FPU
We have evidence in the Masm32 Laboratory that most of the time FPU and SIMD instructions are equally fast. But SIMD can do 4 DWORDs at a time - if that is possible, the speed gain is considerable, of course.
deltarho[1859]
Posts: 2852
Joined: Jan 02, 2017 0:34
Location: UK

Re: SSE2: pand xmm3, 31?

Postby deltarho[1859] » Sep 27, 2018 16:00

@sean_vn

That is what I ended up doing after reading Fog's work.
MrSwiss
Posts: 3726
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: SSE2: pand xmm3, 31?

Postby MrSwiss » Sep 27, 2018 16:36

Well, well, who wants DWORD nowadays?

We want QWORD, with x64 CPU's & OS's. (FBC 64)
A QWORD in the *ALU, saves a lot of pain, while e.g. shifting, rotating e.t.c.
(No FPU / SSE needed, for the INT related stuff.)

I consider the 32 bit stuff, to be "legacy". (FBC 32)

*ALU = arithmetic logic unit, aka: rax (64bit ASM)
deltarho[1859]
Posts: 2852
Joined: Jan 02, 2017 0:34
Location: UK

Re: SSE2: pand xmm3, 31?

Postby deltarho[1859] » Sep 27, 2018 18:25

@MrSwiss

Have a look at gcc 5.2 vs gcc 8.1 - I have just added 64-bit.

I am seriously thinking of following your lead. <smile>
marcov
Posts: 3081
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: SSE2: pand xmm3, 31?

Postby marcov » Sep 27, 2018 18:52

jj2007 wrote:
marcov wrote:I myself have the feeling that for floating point, SSE is faster as x87 FPU
We have evidence in the Masm32 Laboratory that most of the time FPU and SIMD instructions are equally fast.


1. that is not a benchmark, but some random link to some forum.
2. you don't specify anything about the bias of your benchmark. Gonio heavy? Double? Single? FMA intensive?
jj2007
Posts: 1962
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: SSE2: pand xmm3, 31?

Postby jj2007 » Sep 27, 2018 19:06

Show me ONE piece of evidence that supports your "feeling".
MrSwiss
Posts: 3726
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: SSE2: pand xmm3, 31?

Postby MrSwiss » Sep 27, 2018 19:10

@deltarho[1859],

it is indeed far simpler code in 64bit ASM, compared to the same in 32bit ASM.
Main reason: we can operate straight away with the CPU registers (instead of,
having to use FPU registers, just because of required bit-size).

Example rotl/rotr (rotate left/right, just 4 lines ASM):

Code: Select all

Function rotl( ByVal value As ULongInt, ByVal r_val As UByte ) As ULongInt
    Asm
        mov rax, qword ptr [value]
        mov cl, byte ptr [r_val]
        rol rax, cl
        mov qword ptr [Function], rax
    End Asm
End Function
The very same for: rotr (just use ror, instead of rol).
marcov
Posts: 3081
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: SSE2: pand xmm3, 31?

Postby marcov » Sep 28, 2018 15:02

jj2007 wrote:Show me ONE piece of evidence that supports your "feeling".


I'm not sure why I should? It is a rough feeling, and not from benchmarks that I pretend are universal. It is just a hint to test.

Posting a reference that doesn't show anything concrete like you did is however bound to provoke a request for clarification.

Anyway, I don't mind most a calculations that do a lot of polar to carthesian and back calculations. Those are a somewhat (20-40%) slower using SSE2. On the other hand I have some 2D vector data (which I don't parallelize using SSE, the compiler is not vectorizing) for signed distance fonts. (vertices to upload for opengl), and those are a bit faster (again +/- 20%). YMMV.
jj2007
Posts: 1962
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: SSE2: pand xmm3, 31?

Postby jj2007 » Sep 28, 2018 15:25

marcov wrote:Posting a reference that doesn't show anything concrete like you did is however bound to provoke a request for clarification.
Agreed. I don't want to turn this thread into a context. Still, googling for site:masm32.com "fpu" "SSE" "faster" would demonstrate that we invested quite a bit of time into measuring FPU vs SIMD. In general, they are exactly equal, speed-wise, probably because SIMD and FPU use internally the same circuits. The FPU fairs better for precision, SSE wins if you can use the packed instructions. Btw what is "Gonio heavy"?
marcov
Posts: 3081
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: SSE2: pand xmm3, 31?

Postby marcov » Sep 28, 2018 20:22

jj2007 wrote:
marcov wrote:Posting a reference that doesn't show anything concrete like you did is however bound to provoke a request for clarification.
Agreed. I don't want to turn this thread into a context. Still, googling for site:masm32.com "fpu" "SSE" "faster" would demonstrate that we invested quite a bit of time into measuring FPU vs SIMD. In general, they are exactly equal, speed-wise, probably because SIMD and FPU use internally the same circuits. The FPU fairs better for precision, SSE wins if you can use the packed instructions. Btw what is "Gonio heavy"?


My bad. Goniometrics is triGONOmetrics in English. Dutch throws in an "i".

I did the search, but found nothing much interesting. But the precision vs speed remark is quite important. I do consider fairly full IEEE semantics, my calculations involve a revolving cylinder (a beer bottle to be more precise, an(*) Heineken bottle to be exact) observed from a distance, extrapolating the distance to infinity to reduce distortion.

Since the start angle is random, certain worst case situations happen in practice. Arctan is particularly sensitive.

The distances don't really need to be "infinity, but can be large, so domain often needs to be reduced to a set of periods around 0, or treated different around extremes. Doing otherwise reduces your precision to 2 or 3 digits. Quick and dirty solutions often skip this. I currently don't implement any packing, since while there are some array operations, calculations are often relative to a certain index which might not be aligned, complicating SSE code.

(*) English cornercase. AN seems logical as article, but as a loanword, the H in Heineken is NOT silent. So maybe it should be "a" ?

Return to “General”

Who is online

Users browsing this forum: No registered users and 15 guests