AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Post your FreeBASIC tips and tricks here. Please don’t post your code without including an explanation.
D.J.Peters
Posts: 8023
Joined: May 28, 2005 3:28
Contact:

AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Postby D.J.Peters » Mar 14, 2020 23:25

I was run in to trouble while I port some of my math 3d stuff (operators) to fast SSE inline assembler.
All worked so far on 64-bit Windows and Linux but crashed on 32-bit.
I found out that FreeBASIC alloctes UDT's and of course Classes on 32-bit on a 8 byte boundary and on 64-bit on a 16 byte boundary.

So I needed SSE aligned memory on 32-bit also.

By the way I used it in fbImage before that means you can free an aligend pointer with ImageDestroy() if you like :-)
Of course AlignFree() does the same job and set the deallocated pointer to NULL !

Joshy

Code: Select all

' alloc aligned memory (default not cleared)
function AlignAlloc(byval nItems as uinteger, byval sizeofItem as uinteger, byval alignment as uinteger,byval bClearMem as boolean=false) as any ptr
  if nItems=0 then return 0
  if sizeofItem=0 then sizeofItem=nItems
  if alignment=0 then alignment=sizeofItem
  'print "AlignAlloc(" & nItems & "," & sizeofItem & "," & alignment & ")"
  alignment-=1
  dim as uinteger pointer_size = (sizeof(any ptr)*3)
  dim as uinteger payload_size = nItems*sizeofItem
  dim as integer  allocated_size = payload_size + pointer_size + alignment
  dim as any ptr tmp = iif(bClearMem,callocate(allocated_size),allocate(allocated_size))
  dim as any ptr p = cptr(any ptr,((cast(uinteger,tmp) + pointer_size + alignment) and not alignment))
  cptr(any ptr ptr,p)[-3] = cptr(any ptr,payload_size)
  cptr(any ptr ptr,p)[-2] = cptr(any ptr,allocated_size)
  cptr(any ptr ptr,p)[-1] = tmp
  return p
end function

' alloc cleared aligned memory
function AlignCAlloc(byval nItems as uinteger, byval sizeofItem as uinteger, byval alignment as uinteger) as any ptr
  return AlignAlloc(nItems,sizeOfItem,alignment,true)
end function

' get total size of aligned memory
function AlignSize(byval p as any ptr) as uinteger
  if p=0 then return 0
  return cast(uinteger,cptr(any ptr ptr,p)[-2])
end function

' get size of payload memory
function AlignPayload(byval p as any ptr) as uinteger
  if p=0 then return 0
  return cast(uinteger,cptr(any ptr ptr,p)[-3])
end function

' free an aligned pointer (byref) and set it NULL
' !!! be sure the pointer was allocted aligned before !!!
' Is the same as ImageDestroy() :-)
sub AlignFree(byref p as any ptr)
  if p then
    var original_p = cptr(any ptr ptr,p)[-1]
    if original_p then deallocate original_p
    p=0
  end if 
end sub
Last edited by D.J.Peters on Mar 14, 2020 23:32, edited 1 time in total.
SARG
Posts: 1046
Joined: May 27, 2005 7:15
Location: FRANCE

Re: AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Postby SARG » Mar 15, 2020 8:46

@D.J.Peters
I was not at home yesterday however I could read your (deleted ????) post about the alignment problem.
That's strange because according to an asm 'bible' that I often use (https://www.felixcloutier.com/x86/ ) Movups avoids crashes.

MOVAPS 128-bit versions :
Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPS instruction.

MOVUPS 128-bit versions:
Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

When the source or destination operand is a memory operand, the operand may be unaligned without causing a general-protection exception (#GP) to be generated.

Movaps is said (obviously) faster.
Did you try the instructions Vmovups/Vmovaps with 256 bits (If your processor is compatible) ? The speed should be even more impressive impressive.
D.J.Peters
Posts: 8023
Joined: May 28, 2005 3:28
Contact:

Re: AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Postby D.J.Peters » Mar 15, 2020 12:51

SARG you can move none 16 byte boundary aligned memory with "movups" that is not the problem !
The problem are in 32-bit mode you can't do a SSE math operation on none aligned memory.

By the way I use SSE only (not MMX, 3D NOW, SSE2, SSE3, SSE4, AVX) the minmum specs are an old Pentium 3 or AMD Athlon.
PC's before with SSE and 3D Now have only 64-bit registers not packed 4 single precission floats in one go.

See the dot/inner product of vector4f 32-bit VS. 64-bit
64-bit: first command load left argument in xmm0 second command mul xmm0 with right argument
32-bit: first command load left argument in xmm0 second command load right argument in xmm1 third command mul xmm0 with xmm1

Would FreeBASIC in 32-bit mode align on a 16 byte boundary the second command would not be necessary !

Joshy

May be not the fastes version of the dot product but I'm learning SSE and the shuffle stuff is new for me.

Code: Select all

  dot product = a*b
operator * NAKED_C (byref l as const vec4f, byref r as const vec4f) as single
#if defined(ASM_OFF)
  return (l.x*r.x + l.y*r.y + l.z*r.z + l.w*+r.w)
#else
ASM_PROLOG
 #if defined(ASM_X86_ABI)
  mov REG0_DW,[ARG_I0]   ' @l
  mov REG1_DW,[ARG_I1]   ' @r
 
  movups  xmm0,[REG0_DW] ' l.w    , l.z    , l.y    , l.x
  movups  xmm1,[REG1_DW] ' r.w    , r.z    , r.y    , r.x
  mulps   xmm0, xmm1     ' ww     , zz     , yy     , xx
  movaps  xmm1, xmm0     ' ww     , zz     , yy     , xx
  shufps  xmm1, xmm0, 177' zz     , ww     , xx     , yy  (10:11:00:01)
  addps   xmm0, xmm1     ' ww + zz, zz + ww, yy + xx, [xx + yy]
  movhlps xmm1, xmm0     '                           +[zz + ww]
  addss   xmm0, xmm1     ' xx + yy + zz + ww
  movss  [ASM_MEM(0)], xmm0 ' *tmp = xmm0[0]
  fld dword ptr [ASM_MEM(0)]' st(0)= l.x*r.x + l.y*r.y + l.z*r.z + l.w+r.w
     
 #elseif defined(ASM_WIN64_ABI) or defined(ASM_LIN64_ABI)
 
  movups  xmm0,[ARG_I0] ' l.w    , l.z    , l.y    , l.x
  mulps   xmm0,[ARG_I1] ' l.w*r.w, l.z*r.z, l.y*r.y, l.x*r.x
  movaps  xmm1,xmm0     ' ww     , zz     , yy     , xx
  shufps  xmm1,xmm0,177 ' zz     , ww     , xx     , yy  (10:11:00:01)
  addps   xmm0,xmm1     ' ww + zz, zz + ww, yy + xx, [xx + yy]
  movhlps xmm1,xmm0     '                           +[zz + ww]
  addss   xmm0,xmm1     ' xmm0 = l.x*r.x + l.y*r.y + l.z*r.z + l.w*r.w
 #endif 
ASM_EPILOG
#endif   
end operator
SARG
Posts: 1046
Joined: May 27, 2005 7:15
Location: FRANCE

Re: AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Postby SARG » Mar 15, 2020 16:52

D.J.Peters wrote:SARG you can move none 16 byte boundary aligned memory with "movups" that is not the problem !
The problem are in 32-bit mode you can't do a SSE math operation on none aligned memory.
Sorry, I had a quick look at your post just before leaving home and when I returned on the forum the post was deleted and I remembered only the moves.

D.J.Peters wrote:By the way I use SSE only (not MMX, 3D NOW, SSE2, SSE3, SSE4, AVX) the minmum specs are an old Pentium 3 or AMD Athlon.
PC's before with SSE and 3D Now have only 64-bit registers not packed 4 single precission floats in one go.
It was just curiosity if your PC has a such processor for knowing the gain.

Maybe you already know this document. Some interesting tricks.
https://www.agner.org/optimize/optimizing_assembly.pdf

An example of shuffle, found when I was searching information about subps and alignment.

Code: Select all

 cross product to find a vector perpendicular to given two vectors (32-bit version for now).

;; float *cross_product (float V1[4], float V2[4], float W[4])

;; Find the cross product of two constant vectors and return it.

;; W.x = V1.y * V2.z - V1.z * V2.y
;; W.y = V1.z * V2.x - V1.x * V2.z
;; W.z = V1.x * V2.y - V1.y * V2.x

        global cross_product
        section .text

cross_product:
        push ebp
        mov ebp,esp
        mov eax,[ebp+8]          ;Put argument addresses to registers
        mov ebx,[ebp+12]

        movups xmm0,[eax]        ;If aligned then use movaps
        movups xmm1,[ebx]   

        movaps xmm2,xmm0         ;Copies
        movaps xmm3,xmm1

        shufps xmm0,xmm0,0xd8    ;Exchange 2 and 3 element (V1)
        shufps xmm1,xmm1,0xe1    ;Exchange 1 and 2 element (V2)
        mulps  xmm0,xmm1
               
        shufps xmm2,xmm2,0xe1    ;Exchange 1 and 2 element (V1)
        shufps xmm3,xmm3,0xd8    ;Exchange 2 and 3 element (V2)
        mulps  xmm2,xmm3
             
        subps  xmm0,xmm2

        mov eax,[ebp+16]
        movups [eax],xmm0        ;Result
        pop ebp
        ret
marcov
Posts: 2945
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: AlignAlloc(), AlignCAlloc(), AlignFree(), AlignSize(), AlignPayload()

Postby marcov » Mar 15, 2020 17:11

D.J.Peters wrote:SARG you can move none 16 byte boundary aligned memory with "movups" that is not the problem !
The problem are in 32-bit mode you can't do a SSE math operation on none aligned memory.

By the way I use SSE only (not MMX, 3D NOW, SSE2, SSE3, SSE4, AVX) the minmum specs are an old Pentium 3 or AMD Athlon.


Very weird, since the code you show has SSE2 128-bit XMM register, and not SSE1 MM registers.

Note that what is allowed unaligned depends on exact cpu.

Return to “Tips and Tricks”

Who is online

Users browsing this forum: No registered users and 2 guests