64-bit inline assembler

srvaldez · Post by **srvaldez** » Sep 27, 2016 16:16

just some observations, feel free to correct my mistakes.
64-bit intel asm works ok if you use -O2 or less, sometimes -O0 may be needed.
here's quick and dirty example

Code: Select all

Sub iPower(Byref result As double, Byref x As double, Byval e As Integer)
    Asm
        mov rax,[e]
        mov rbx,rax
        ipower_absrax:
        neg rax
        js  ipower_absrax
        fld1          '  z=1.0
        fld1
        mov rdx,[x]
        fld qword ptr [rdx] 'load st0 with x
        cmp rax,0     'while e>0
        ipower_while1:
        jle ipower_wend1
        ipower_while2:
        bt rax,0      'test for odd/even
        jc ipower_wend2      'jump if odd
  '                while e is even
        sar rax,1     'rax=rax/2
        fmul st(0),st(0)  'x=x*x
        jmp ipower_while2
        ipower_wend2:
        sub rax,1
        fmul st(1),st(0)  'z=z*x 'st1=st1*st0
        jmp ipower_while1 
        ipower_wend1:
        fstp st(0)      'cleanup fpu stack
        fstp st(1)      '"       "   "
        cmp rbx,0     'test to see if e<0
        jge ipower_noinv     'skip reciprocal if not less than 0
  '                if e<0 take reciprocal
        fld1
        fdivrp st(1),st(0)
        ipower_noinv:
        mov rax,[result]
        fstp qword ptr [rax] 'store z (st0)
        fstp st(0) 'clear fpu stack
        fstp st(0) 'clear fpu stack
    End Asm
End Sub

dim as double x, y
x=2
iPower(y,x,3)
print y
iPower(y,x,-3)
print y

geany compile command: fbc -w all "%f" -asm intel -gen gcc -Wc -O2
on Linux if you use -O3 for example, then you may get assembler errors like 'symbol already defined

D.J.Peters · Post by **D.J.Peters** » Sep 27, 2016 17:42

Sub iPower(Byref result As double, Byref x As double, Byval e As Integer)

Why do you use the params from slow memory stack frame

Are the parameters not in register on 64-bit ?

the pointer of result is in RCX (byref)
the pointer of x are in RDX (byref)
the value of e is in R8 (byval)

Let me know if i'm wrong ?

Can you post an example how to access local and global vars with 64-bit inline assembler please ?

Joshy

srvaldez · Post by **srvaldez** » Sep 27, 2016 18:49

you are probably right about the parameters and registers, am no expert on this, but I don't think you can trust the registers to have the parameters as you would expect unless you compile with -O0 because gcc will more than likely make optimizations, found that out trying to use inline asm on my Mac.
maybe MichaelW will see this thread and post some good examples, he's the expert, but here's a very simple example

Code: Select all

dim shared as short ten=10
function TenPow(Byval x As double) as double
	dim as double y
	'dim as short ten=10
    Asm
        fld qword ptr [x]
        fild word ptr [ten]
        fyl2x
        fld st(0) 
        frndint 
        fsub st(1), st(0) 
        fld1 
        fscale 
        fxch 
        fxch st(2) 
        f2xm1 
        fld1 
        faddp st(1), st(0) 
        fmulp st(1), st(0)
        fstp st(1) 
        fstp qword ptr [Function]
        'or you could return in y
        'fstp qword ptr [y]
    End Asm
    'return y
End Function

print TenPow(.5)

marcov · Post by **marcov** » Sep 27, 2016 18:54

D.J.Peters wrote: Can you post an example how to access local and global vars with 64-bit inline assembler please ?

(you might have to add rip to globals)

srvaldez · Post by **srvaldez** » Sep 28, 2016 2:29

here's the first example adapted for the Mac

Code: Select all

Sub iPower(Byref result As double, Byref x As double, Byval e As Integer)
    Asm
        ".intel_syntax noprefix"
        "push rax"
        "push rbx"
        "mov rax,rdx"
        "mov rbx,rax"
        "ipower_absrax:"
        "neg rax"
        "js  ipower_absrax"
        "fld1"          '  z=1.0
        "fld1"
        "fld qword ptr [rsi]" 'load st0 with x
        "cmp rax,0"     'while e>0
        "ipower_while1:"
        "jle ipower_wend1"
        "ipower_while2:"
        "bt rax,0"      'test for odd/even
        "jc ipower_wend2"      'jump if odd
  '                while e is even
        "sar rax,1"     'rax=rax/2
        "fmul st(0),st(0)"  'x=x*x
        "jmp ipower_while2"
        "ipower_wend2:"
        "sub rax,1"
        "fmul st(1),st(0)"  'z=z*x 'st1=st1*st0
        "jmp ipower_while1" 
        "ipower_wend1:"
        "fstp st(0)"      'cleanup fpu stack
        "fstp st(1)"      '"       "   "
        "cmp rbx,0"     'test to see if e<0
        "jge ipower_noinv"     'skip reciprocal if not less than 0
  '                if e<0 take reciprocal
        "fld1"
        "fdivrp st(1),st(0)"
        "ipower_noinv:"
        "fstp qword ptr [rdi]" 'store z (st0)
        "fstp st(0)" 'clear fpu stack
        "fstp st(0)" 'clear fpu stack
        "pop rbx"
        "pop rax"
        ".att_syntax prefix"
    End Asm
End Sub

dim as double x, y
x=2
iPower(y,x,3)
print y
iPower(y,x,-3)
print y

geany compile command fbc -w all -asm att "%f" -gen gcc -Wc -O2
note that on the Mac there's no easy way to access FB variables making inline asm impractical.

srvaldez · Post by **srvaldez** » Sep 28, 2016 17:19

I know that the Mac is not a supported platform but naked functions fail, for example

Code: Select all

function dbl naked (Byval x As double) as double
    Asm
        "addsd  %xmm0, %xmm0"
        "ret"
    End Asm
End Function

print dbl(5)

relevant asm code

Code: Select all

	.text
	.globl DBL
	DBL:
	addsd  %xmm0, %xmm0
	ret
	
	.text
	.globl _main
_main:
	...
	call	_DBL

the main program calls a decorated function whereas the naked function was not decorated.
btw, it works ok on windows and linux

MichaelW · Post by **MichaelW** » Oct 02, 2016 11:09

I didn't have time to do everything that I wanted, but I did verify that for 64-bit code, and using compiler Version 1.05.0 (01-31-2016), built for win64 (64bit), naked functions conform to the 64-bit calling convention.

Edit: Added code to do RIP-relative access to shared variables.

Edit2: Sorry, the above was a last minute change, and the shared variables are accessed as direct memory operands where the address of the variable is encoded into the accessing instruction. There are examples of RIP-relative addressing elsewhere in the assembly code output of the compiler. IIRC RIP-relative addressing is preferred because the encoding is smaller.

Code: Select all

''----------------------------------------------------------------------
'' The first four integer/floating-point arguments, taken in left to
'' right order*, should be passed in RCX/XMM0L**, RDX/XMM1L**,
'' R8/XMM2L**, and R9/XMM3L**, with any further arguments, taken in
'' right to left order*, passed on the stack.
''
'' * As they are listed in the function definition or prototype.
''
'' ** The choice of register is determined by the operand type, with
'' the register that does not match the type ignored.
''
'' Scalar values that fit in 64 bits are returned in RAX.
''
'' Floating-point values are returned in XMM0.
''
''----------------------------------------------------------------------

''----------------------------------------------------------------------
'' On entry to our functions, the stack layout is:
''    rsp+48  arg6
''    rsp+40  arg5
''    rsp+32  arg4  spill
''    rsp+24  arg3  spill
''    rsp+16  arg2  spill
''    rsp+8   arg1  spill
''    rsp     return address 
''----------------------------------------------------------------------

function Test1 naked ( arg1 as integer, _
                       arg2 as integer, _
                       arg3 as integer, _
                       arg4 as integer, _
                       arg5 as integer, _
                       arg6 as integer ) as integer
    asm
        xor     rax, rax
        add     rax, rcx
        add     rax, rdx
        add     rax, r8
        add     rax, r9
        add     rax, [rsp+40]
        add     rax, [rsp+48]
        ret
    end asm
end function

''----------------------------------------------------------------------

function Test2 naked ( arg1 as double, _
                       arg2 as double, _
                       arg3 as double, _
                       arg4 as double, _
                       arg5 as double, _
                       arg6 as double ) as double
    asm
        addsd     xmm0, xmm1
        addsd     xmm0, xmm2
        addsd     xmm0, xmm3
        addsd     xmm0, [rsp+40]
        addsd     xmm0, [rsp+48]
        ret
    end asm
end function

''----------------------------------------------------------------------

dim shared as integer a = 1, b = 2, c = 3

function Test3 naked ( ) as integer
    asm
        xor       rax, rax
        add       rax, a
        add       rax, b
        add       rax, c
        ret
    end asm
end function

''----------------------------------------------------------------------

print Test1(1,2,3,4,5,6)

print Test2(1,2,3,4,5,6)

print Test3()

sleep

Code: Select all

	.file	"Test.c"
	.intel_syntax noprefix
	.data
	.align 8
A$:
	.quad	1
	.align 8
B$:
	.quad	2
	.align 8
C$:
	.quad	3
/APP
	.text
	.globl TEST1
	TEST1:
	xor     rax, rax
	add     rax, rcx
	add     rax, rdx
	add     rax, r8
	add     rax, r9
	add     rax, [rsp+40]
	add     rax, [rsp+48]
	ret
	.text
	.globl TEST2
	TEST2:
	addsd     xmm0, xmm1
	addsd     xmm0, xmm2
	addsd     xmm0, xmm3
	addsd     xmm0, [rsp+40]
	addsd     xmm0, [rsp+48]
	ret
	.text
	.globl TEST3
	TEST3:
	xor       rax, rax
	add       rax, A$
	add       rax, B$
	add       rax, C$
	ret
	.def	__main;	.scl	2;	.type	32;	.endef
/NO_APP
	.text
	.globl	main
	.def	main;	.scl	2;	.type	32;	.endef
main:
	push	rbp
	mov	rbp, rsp
	sub	rsp, 80
	mov	DWORD PTR 16[rbp], ecx
	mov	QWORD PTR 24[rbp], rdx
	call	__main
	mov	DWORD PTR -28[rbp], 0
	mov	rax, QWORD PTR 24[rbp]
	mov	r8d, 0
	mov	rdx, rax
	mov	ecx, DWORD PTR 16[rbp]
	call	fb_Init
.L2:
	mov	QWORD PTR 40[rsp], 6
	mov	QWORD PTR 32[rsp], 5
	mov	r9d, 4
	mov	r8d, 3
	mov	edx, 2
	mov	ecx, 1
	call	TEST1
	mov	QWORD PTR -8[rbp], rax
	mov	rax, QWORD PTR -8[rbp]
	mov	r8d, 1
	mov	rdx, rax
	mov	ecx, 0
	call	fb_PrintLongint
	movsd	xmm3, QWORD PTR .LC0[rip]
	movsd	xmm2, QWORD PTR .LC1[rip]
	movsd	xmm1, QWORD PTR .LC2[rip]
	movsd	xmm0, QWORD PTR .LC3[rip]
	movsd	QWORD PTR 40[rsp], xmm0
	movsd	xmm0, QWORD PTR .LC4[rip]
	movsd	QWORD PTR 32[rsp], xmm0
	movsd	xmm0, QWORD PTR .LC5[rip]
	call	TEST2
	movq	rax, xmm0
	mov	QWORD PTR -16[rbp], rax
	movsd	xmm0, QWORD PTR -16[rbp]
	mov	r8d, 1
	movapd	xmm1, xmm0
	mov	ecx, 0
	call	fb_PrintDouble
	call	TEST3
	mov	QWORD PTR -24[rbp], rax
	mov	rax, QWORD PTR -24[rbp]
	mov	r8d, 1
	mov	rdx, rax
	mov	ecx, 0
	call	fb_PrintLongint
	mov	ecx, -1
	call	fb_Sleep
.L3:
	mov	ecx, 0
	call	fb_End
	mov	eax, DWORD PTR -28[rbp]
	leave
	ret
	.section .rdata,"dr"
	.align 8
.LC0:
	.long	0
	.long	1074790400
	.align 8
.LC1:
	.long	0
	.long	1074266112
	.align 8
.LC2:
	.long	0
	.long	1073741824
	.align 8
.LC3:
	.long	0
	.long	1075314688
	.align 8
.LC4:
	.long	0
	.long	1075052544
	.align 8
.LC5:
	.long	0
	.long	1072693248
	.ident	"GCC: (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 5.2.0"
	.def	fb_Init;	.scl	2;	.type	32;	.endef
	.def	TEST1;	.scl	2;	.type	32;	.endef
	.def	fb_PrintLongint;	.scl	2;	.type	32;	.endef
	.def	TEST2;	.scl	2;	.type	32;	.endef
	.def	fb_PrintDouble;	.scl	2;	.type	32;	.endef
	.def	TEST3;	.scl	2;	.type	32;	.endef
	.def	fb_Sleep;	.scl	2;	.type	32;	.endef
	.def	fb_End;	.scl	2;	.type	32;	.endef

Regarding the problem with code that runs OK with no compiler optimization, but fails with optimization, within my experience the problem is usually a failure to follow the calling convention. For example, I recently created a set of 64-bit clock-cycle count macros for GCC that use inline assembly. As is the norm for cycle-count code, the macros use CPUID as a "serializing" instruction. One unfortunate side effect of CPUID is that it modifies the EBX component of the callee-save register RBX. Since preserving RBX around the CPUID instruction would place a POP RBX instruction after the CPUID instruction, "polluting" the cycle count somewhat, I avoided preserving RBX. The code worked fine with no compiler optimizations, but with any level of optimization, it would trigger exceptions, apparently because the optimized code depended on RBX being preserved, as per the calling convention. While compiling with no optimization would correct the immediate problem, it is not overly practical because code compiled with no optimization is effectively optimized for debugging, and generally executes much, much slower than optimized code.

There is a Microsoft calling-convention reference here, and a more compact one here.

MichaelW · Post by **MichaelW** » Oct 19, 2016 13:47

On my Windows 10 notebook I can compile either of the apps with -gen gcc and -O 3 and they run with no problems.

srvaldez · Post by **srvaldez** » Oct 19, 2016 15:25

hi MichaelW
you are probably right about the problem of gcc optimization being that of not properly following the calling convention, however I did a small test on my Mac where the parameters of a function were not used except in the inline asm portion and it failed when optimized,
the test simply copied the value of the first byref parameter to the second byref parameter.

MichaelW · Post by **MichaelW** » Oct 19, 2016 19:11

This is the first example, with minimal corrections to handle the parameters and return value as per the calling convention, but more changes will be needed to fully conform, because per the calling convention "All floating point operations are done using the 16 XMM registers."

Code: Select all

function iPower naked ( Byval x As double, Byval e As Integer) as double
    Asm
        push    rbx         '' preserve non-volatile rbx
        '''mov rax,[e]
        mov rax, rdx
        mov rbx, rax
    ipower_absrax:
        neg rax
        js ipower_absrax
        fld1 '  z=1.0
        fld1
        '''mov rdx,[x]
        movq rdx, xmm0
        push rdx
        fld qword ptr [rsp] 'load st0 with x
        pop rdx
        cmp rax,0           'while e>0
    ipower_while1:
        jle ipower_wend1
    ipower_while2:
        bt rax,0            'test for odd/even
        jc ipower_wend2     'jump if odd
                            'while e is even
        sar rax,1           'rax=rax/2
        fmul st(0),st(0)    'x=x*x
        jmp ipower_while2
    ipower_wend2:
        sub rax,1
        fmul st(1),st(0)    'z=z*x 'st1=st1*st0
        jmp ipower_while1 
    ipower_wend1:
        fstp st(0)          'cleanup fpu stack
        fstp st(1)          '"       "   "
        cmp rbx,0           'test to see if e<0
        jge ipower_noinv    'skip reciprocal if not less than 0
                            'if e<0 take reciprocal
        fld1
        fdivrp st(1),st(0)
    ipower_noinv:
        '''mov rax,[result]
        ''sub     rsp, 16      '' allocate buffer from stack
        ''                     '' maintaining 16-byte alignment   
        sub     rsp, 8      '' allocate buffer from stack
        '''fstp qword ptr [rax] 'store z (st0)
        fstp qword ptr [rsp] '' store z to buffer
        movq    xmm0, [rsp]  '' store buffer in return register
        add     rsp, 8      '' free buffer
        fstp st(0)          'clear fpu stack
        fstp st(0)          'clear fpu stack
        pop     rbx         '' recover non-volatile rbx 
        ret
    End Asm
End function

dim as double x, y
x=2
print iPower(x,3)
''print y
print iPower(x,-3)
''print y
sleep
dim as double x, y
x=2
print iPower(x,3)
''print y
print iPower(x,-3)
''print y
sleep

Code: Select all

 8
 0.125

Edit: I'm not sure the above code is handling the stack correctly, even though the app runs OK even with -O 3. I need to determine if pushing/popping a 64-bit register changes the stack pointer by 8 bytes or 16 bytes.

Per Agner Fog's calling_conventions.pdf, available here, the stack word size is 8 bytes, but the stack must be aligned by 16 before any call instruction. So for a function that does not contain any call instructions, maintaining an 8-byte alignment is apparently sufficient, so I modified the above code to do just that.

TeeEmCee · Post by **TeeEmCee** » Oct 22, 2016 16:38

srvaldez wrote:I know that the Mac is not a supported platform but naked functions fail...
the main program calls a decorated function whereas the naked function was not decorated.
btw, it works ok on windows and linux

I noticed that bug and and a pile of other OSX ones and fixed it, but I haven't submitted a pull request yet. You can try it though.
I spent several days trying to get -gen gas to work on OSX. Well, I got it working fine... unfortunately you can't actually use it, because Apple gas is broken. It has a major bug where if you ever refer to the same label twice in intel-syntax code or do a backwards jmp/call, it gives the error:

Code: Select all

fb_naked_asm.asm:33:suffix or operands invalid for `call'

This bug has been known for two decades, but Apple don't care about such things, and their rate of development for these core utilities is <1% of GNU's binutils anyway. I tried to fix it myself, but the gas source is the stuff of nightmares. I gave up on even getting FSF gas to compile after a few hours and can't tell if it even properly supports Mach-O, but it didn't a few years ago. I also tried LLVM's assembler, but it turns out its support for intel syntax is utterly broken... they did fix the most serious bug a few days ago though; haven't tried it since. There are no other assemblers supporting intel syntax for mach-o, unless you want to produce an ELF object file and convert to mach-o with objconv.

srvaldez wrote:here's the first example adapted for the Mac

Hey wait... are you saying that that code works for you? It doesn't assemble for me:

Code: Select all

fb_asm.c:26:suffix or operands invalid for `js'
fb_asm.c:38:suffix or operands invalid for `jmp'
fb_asm.c:42:suffix or operands invalid for `jmp'

That is, it hits the Apple gas bug I just mentioned. So you seem to have a working assembler, which I tried so hard and failed to find!
Can you please tell me where you got your build system (XCode, macports, homebrew?) and its version, and the OSX and gas versions (as -version)?

Edit: I'm not sure the above code is handling the stack correctly, even though the app runs OK even with -O 3. I need to determine if pushing/popping a 64-bit register changes the stack pointer by 8 bytes or 16 bytes.

Maybe you are referring to the existence of x86 instructions that push/pop a 16bit value on the stack to a 32 bit register and vice versa. There are no other mismatched-size push/pop instructions for other bitwidths.

srvaldez · Post by **srvaldez** » Oct 22, 2016 17:09

TeeEmCee wrote:
srvaldez wrote:here's the first example adapted for the Mac
Hey wait... are you saying that that code works for you? It doesn't assemble for me:
Code: Select all
fb_asm.c:26:suffix or operands invalid for `js'
fb_asm.c:38:suffix or operands invalid for `jmp'
fb_asm.c:42:suffix or operands invalid for `jmp'
That is, it hits the Apple gas bug I just mentioned. So you seem to have a working assembler, which I tried so hard and failed to find!
Can you please tell me where you got your build system (XCode, macports, homebrew?) and its version, and the OSX and gas versions (as -version)?

hello TeeEmCee :-)
I have Xcode 8.0.0 with the accompanying command line tools but it also worked with 7.3.0 version,

as --version
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

what version of OS X are you using? I am using El Capitan, btw it's good to see you are interested in FB on the Mac.
<edit> I started by using a Mac version of FB built by venom http://www.freebasic.net/forum/viewtopi ... 17&t=24027 but have since compiled and use the latest FB git repo
venom's link is not working any more but I uploaded it here in case some wants it FreeBASIC-1.04.0-darwin-x86_64
<edit 2> my compile command for geany on my Mac is: fbc -w all -asm att -gen gcc -Wc -O2 "%f"

TeeEmCee · Post by **TeeEmCee** » Oct 23, 2016 7:00

Wow! I didn't realise that Apple switched to LLVM's assembler. That's surprising, because that assembler is a very recent and buggy project (unlike the GCC toolchain, an assembler is NOT used by clang, except partially to parse inline asm blocks), and it didn't even seem to be attempting to be compatible with gas. In fact, it was originally called 'mc' instead of 'as'. I had quite a lot of trouble just finding the right commandline args to invoke it.

I forgot that Apple bizarrely patched the LLVM tools to report the XCode version number (8.0.0 in your case) instead of the real version number. I looked it up and found that XCode 7.3 ships LLVM 3.8.0.

I have llvm-as 3.8.1 on my gnu/linux machine, and llvm-as 3.7.1 on my mac, and neither can be used to replace gas. But commandline args and assembler directives are quite different between OSX and other Unix anyway (because OSX uses a 30 year old fork of GNU binutils), so maybe llvm-as 3.8 only works as a replacement for gas on OSX.
I'm using OSX 10.8.

BTW, here is the llvm-as bug I mentioned which was fixed 2 weeks ago. 'push' on x86 in intel syntax is miscompiled. It works fine on x86_64.

srvaldez · Post by **srvaldez** » Oct 23, 2016 15:10

hello TeeEmCee
I program occasionally as a hobby and my skills are beginner to maybe intermediate, does your fb fork include all your fixes?
also how do you compile fb?
I have been compiling fb like this
make FBFLAGS="-asm att" ENABLE_XQUARTZ=1 all

TeeEmCee · Post by **TeeEmCee** » Oct 24, 2016 9:06

Yes, that branch has all my darwin-related work. It defaults to -asm att and -gen gcc, but of course if you compile it with fbc 1.05 then you will need to specify those manually. I have never tried ENABLE_XQUARTZ=1.

The things I need to do before getting it merged into the trunk are ensuring that the use of -macosx_version_min=10.4 is correct in general, finish translating the crt headers, and update changelog.txt.

64-bit inline assembler

64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler

Re: 64-bit inline assembler