Concerning the lack of inline functions

General FreeBASIC programming questions.
Post Reply
shadow008
Posts: 86
Joined: Nov 26, 2013 2:43

Concerning the lack of inline functions

Post by shadow008 »

I've been picking away at my project for the longest time here and it's gotten to the point where keeping it as effectively "one source" file has pushed the compile times up to an intolerable 10 seconds... So I've gotten around to finally properly structuring my project into header (.bi) and source (.bas) files for each of my modules. After a good bit of messing fiddling, I got everything working and compiled it into a working executable. To my surprise, execution time more than doubled! I went from an average of 1.1ms/tick to 2.3ms/tick! Digging into this I figured out that what's going on is that my various UDTs that I've been using for things in the critical path (custom vectors/stacks/etc) are not able to be inlined as they're being compiled as separate modules. Looking at the assembly, I'm seeing a whole setup for a function call and stack cleanup to call a type member function that just adds three integers together. Previously, when everything was effectively "one source" file, this same section was fully vectorized code. So this has lead me to coming up with the 4 following solutions:

1) Freebasic gets inline functions. This would be the ideal solution, and as an added bonus would make conversion of various c headers more straight-forward (not needing to use macros instead). Plus there's been apatite to add inline functions to FB for a while now. However this is asking a lot of effort of people who already do so much already, so I don't think it's my place to do so.

2) Use macros as a substitute for inline functions. A few comments beforehand. 1) I find macros acceptable when they either abstract out ugly but necessary syntax, and when they reduce boilerplate code. 2) Macros as an interface to a UDT that's supposed to work seamlessly with your existing code is clunky, ugly, and near impossible to get right once the complexity of expressions gets too much. In my case, I would be writing macros for things like vector adding, or bit array retrieval. This code is usually within more complex statements and would not only increase boilerplate code, but would make the syntax look horribly and moreso, make it harder to reason about what's going on. So while macros may be nice for standalone functions, type members as macros are a hard pass.

3) Keep a parallel header file that contains all header and source files, and compile that when testing for performance. This is what I'm currently doing, and it's not great. Basically I keep every one of my project modules in a header with both the .bi and .bas all lined up according to dependency. Then I compile just the main file. This effectively acts as "whole program optimization" as everything is loaded in at once and the compiler and do its thing. This, of course, takes a considerable amount of time to compile so it's not ideal for ongoing development. But it works.

4) Use the -flto compiler optimization. Since I'm using the gcc backend, there should be the possibility of setting up the linker to be able to to link time optimization (LTO). I haven't had much time to research it, but there appears to be a plugin you can get for the linker that can be used to do LTO, as the toolchain that the freebasic compiler comes with does not have this by default. Has anyone used the link time optimization flag (-flto) to compile freebasic code? If you have, please share what you did to get it working.

(bonus) 5) Just deal with the performance degradation.... 0/10 terrible idea

Sorry for the novel... but does anyone else have this problem? How are you dealing with it?
SARG
Posts: 1768
Joined: May 27, 2005 7:15
Location: FRANCE

Re: Concerning the lack of inline functions

Post by SARG »

If 64bit did you try to compile the 'one source' with gas64 option instead gcc ?
I'm curious to know if the compilation time is also long.

In case it's enough fast you can compile with gcc (with optimization if needed) only for your final version.
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Concerning the lack of inline functions

Post by marcov »

Afaik non LTO gcc inline requires to lift the functions from .c to .h ?

I don't have much experience with LTO (other than the principle)


Sometimes another workaround is to move some of the big loops around the functions to the same compilation unit class. E.g. never call such a function in a loop, but move the loop to a function in the same compilation unit as the class that contains the method. But that is depends on how your algorithms are of course.
shadow008
Posts: 86
Joined: Nov 26, 2013 2:43

Re: Concerning the lack of inline functions

Post by shadow008 »

SARG wrote: Oct 09, 2023 8:05 If 64bit did you try to compile the 'one source' with gas64 option instead gcc ?
I'm curious to know if the compilation time is also long.

In case it's enough fast you can compile with gcc (with optimization if needed) only for your final version.
I just tried to compile with -gen gas64, and while many of the modules compiled, I got some errors. I'll post on the Gas64 thread.
marcov wrote: Oct 09, 2023 11:39 Afaik non LTO gcc inline requires to lift the functions from .c to .h ?
That's pretty much all you have to do. Inlines are basically suggestions to the compiler to do what amounts to fancy text substitution. Inline functions would be found in the .h, not the .c, as they're a sort of template for the code to be generated "in line" where the function call happens.

As for your other suggestion, I wouldn't be able to lift the various class methods out of their home and plaster them all over my other modules. It would be too ugly and would add a lot of ongoing maintenance.
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Concerning the lack of inline functions

Post by marcov »

shadow008 wrote: Oct 09, 2023 19:32 That's pretty much all you have to do. Inlines are basically suggestions to the compiler to do what amounts to fancy text substitution.
Inlines and e.g. generics/templates can be done in several ways/levels. Using token replay (which is closer to the text), or using a stored node form (which is closer to the IR).
Inline functions would be found in the .h, not the .c, as they're a sort of template for the code to be generated "in line" where the function call happens.
This is a known C limitation due to the weak .h and .c coupling . I hope the C++ modules fixes this, and allows the code to stay in the .c(pp), just like in e.g. Delphi/Free Pascal.

If only because .h's are parsed so redundantly in C.
shadow008
Posts: 86
Joined: Nov 26, 2013 2:43

Re: Concerning the lack of inline functions

Post by shadow008 »

@SARG
With the fix for gen gas64 I was able to get the whole project compiling from scratch in just under 3 seconds (one source) and just over 4 seconds (modules+linking)! This is much faster than gcc. The performance is also within margin of error of gcc without whole program optimization. Whereas gcc with single source compilation (with -O3 flag) achieves 1.1ms/tick, and gcc module compilation (again -O3) achieves 2.3ms/tick, gas64 achieves 2.4ms/tick with either compilation method. Given that performance is well within budget, I'll be sticking with gas64 for development for now (assuming no bugs :D ), and gcc for performance measurements. That compile time is well worth the tradeoff.

Though I should be asking in the other thread, are there any optimization flags to add to improve performance?
SARG
Posts: 1768
Joined: May 27, 2005 7:15
Location: FRANCE

Re: Concerning the lack of inline functions

Post by SARG »

@shadow008
Thanks for testing and reporting.
Though I should be asking in the other thread, are there any optimization flags to add to improve performance?
Unfortunately no, at least for the moment..
I did optimizations at low level by regrouping 2 asm lines ('peep hole' optimization).
eg
mov r11, rax
mov -[72], r11
becomes
mov -[72], rax

It can be possible to extend the number of analysed lines to improve the optimizations but obvioulsy at a cost of compilation time increased.
Maybe also at higher level but good work has already be made in freebasic compiler.

If it's possible send me your code and I'll see if I can improve something in gas64 emitter.

Just an advice to speed up execution,use INTEGER/UINTEGER most of the time to avoid convertings as all calculations (except for floats) are done with these datatypes.
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: Concerning the lack of inline functions

Post by TeeEmCee »

FB does have inline functions (when using -gen gcc). The keyword is PRIVATE (which is equivalent to 'static' in C). Marking a function PRIVATE indicates that the function is internal to that module and can't be called from outside it. But it also tells the compiler not to export that symbol (so for example on GNU/Linux backtrace_symbols() won't know it, and may also result in it being missing from debug symbols). It also encourages GCC/Clang to inline the function. For that reason I actually "#define LOCAL" (expanding to nothing) to use instead of PRIVATE for code documention purposes to indicate functions which are internal which I *don't* want to be inlined in order to get better debug info and stacktraces.

You can mark subs, functions, methods and operators as private. You need to place them into a .bi file if you want them to be inlined into subs in multiple files, eg for methods of a vector type, just like in C. If you compile without optimisations, or if the compilers chooses not to inline them, this will cause your executable to grow in size because the functions will be duplicated in different modules.

LTO works too, with a few workarounds, on Windows, Linux, although it didn't work on Mac for some reason. I think Emscripten always uses LTO. I use cross-compilation from Linux to Windows using a different mingw-w64 toolchain (mxe) but LTO is a standard feature of GCC which probably works with the GCC packaged with FB for Windows.

In order to avoid the "plugin needed to handle lto object" error from ld, link your .o files together using gcc instead of letting fbc do it (using ld):
- compile modules with fbc -gen gcc -asm att -Wc -flto,-fno-strict-aliasing -m <mainmodule> -c <module.bas>
- link with gcc -O2 -fno-strict-aliasing -Wno-lto-type-mismatch <modules.o> -L<path to FB's libraries> -lfb -o <executable> (or -lfbmt)
-fno-strict-aliasing -Wno-lto-type-mismatch get rid of some warnings. They might not be strictly needed, but I think without -fno-strict-aliasing there is a risk of miscompiling.

---Regarding gas64---

SARG: you've done a heck of a lot of work on gas64. I've been waiting for it to become stable enough to use. Been a while since I tried it... well finally I can compile my whole 124kLOC codebase and run all tests without failures! Awesome! Whoops, actually it doesn't compile without -g or without -exx. Oh man. It's doing a replacement of "rsi" with "r11" including in identifiers! I'll report those problems. So all optimisations are performed regardless of the -O flag? It made no difference to speed.

Here are total timings for a suite of benchmarks of a slow AST-walking script interpreter written in FB including various functions called from it. Probably none of the code is vectorizable, and very little maths. Comparing to GCC 12.2.0 with -O2.

Code: Select all

x86 -gen gcc:     11491
x86 -gen gcc lto  11253
x86 -gen gas:     23764
x64 -gen gcc:     10916
x64 -gen gcc lto: 10609 
x64 -gen gcc lto: 10126  (compiled with -O3)
x64 -gen gas64:   24154
So very similar ratios to shadow008. I'm surprised it's not faster than the gas backend though, which I don't think does register move optimisations like that at all?
SARG wrote: Oct 10, 2023 7:55 Just an advice to speed up execution,use INTEGER/UINTEGER most of the time to avoid convertings as all calculations (except for floats) are done with these datatypes.
Huh? Are 32-bit integers really slower than 64-bit ones in the gas64 backend? What about multiplications and divisions? I use 32-bit ints exclusively. I even do "#undef integer" "type integer as long" in 64-bit builds.
adeyblue
Posts: 301
Joined: Nov 07, 2019 20:08

Re: Concerning the lack of inline functions

Post by adeyblue »

TeeEmCee wrote: Oct 11, 2023 2:03 Huh? Are 32-bit integers really slower than 64-bit ones in the gas64 backend? What about multiplications and divisions? I use 32-bit ints exclusively. I even do "#undef integer" "type integer as long" in 64-bit builds.
He's saying that in general FB does non-float calculations as integer regardless of the size of the integral types involved. To wit:

Code: Select all

Type NumType As Integer

Function RandInt() as NumType
   Return Rnd * &h7fffffff
End Function

dim as NumType a,b,c
a = RandInt() '' just so it doesn't constant fold to the result 
b = RandInt()
c = a * b

? c
This generates the following (gas64 O 2), pretty normal

Code: Select all

   call RANDINT
   mov -72[rbp], rax #Optim 2
   
   call RANDINT
   mov -80[rbp], rax #Optim 2
   
   mov r11, QWORD PTR -72[rbp]
   imul r11, QWORD PTR -80[rbp]
   mov -88[rbp], r11
   xor ecx, ecx
   mov rdx, -88[rbp]
   mov r8d, 1
   call fb_PrintLongint
But if you change NumType to Long, this happens instead

Code: Select all

   call RANDINT
   mov -64[rbp], eax #Optim 16
   
   call RANDINT
   mov -68[rbp], eax #Optim 16
   
   movsxd r11, DWORD PTR -64[rbp]
   movsxd r10, DWORD PTR -68[rbp]
   imul r11, r10
   
   mov -72[rbp], r11d #Optim 2
   xor ecx, ecx
   mov edx, -72[rbp]
   mov r8d, 1
   call fb_PrintInt
movsxds are inserted (sometimes its cdqe) so that the operation is done in the native integer size instead of longs. GCC probably ellides mosts of these at higher optimization levels but if you look at the generated c code, the casts to int64 are there.

Code: Select all

	int32 vr$4 = RANDINT(  );
	A$0 = vr$4;
	int32 vr$5 = RANDINT(  );
	B$0 = vr$5;
	C$0 = (int32)((int64)A$0 * (int64)B$0);
So his advice is that if you're using an abundance of shorts, ubytes or longs in 64-bit compiles, you'll probably be paying for at least some of these conversions even when using GCC, whereas if you used more integers/longints, you wouldn't.
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: Concerning the lack of inline functions

Post by TeeEmCee »

(EDIT: argh, I forgot/overlooked the C code you posted.)

Thanks for the testcase. But I think SARG was talking about what gas64 does (uses 64-bit registers most of the time), not about what FB does in general. Because I don't think it's true that FB uses 64-bit ints for intermediate calculation results even for operations on 32-bit ints: if you look at the output of -gen gcc there are no 64-bit ints anywhere in that case. (EDIT: see followup post.) Maybe you're thinking of the rule in C that operands to arithmetic operatos are promoted to "int" (which is 32-bit on any modern machine) if they are smaller.

If you replace the multiplication with \ or MOD, gas64 does use a 32-bit idiv rather than 64-bit --- but SARG just implemented that back in August. I had a look at Agner Fog's x86 instruction tables and found that on my old AMD Bulldozer CPU 32-bit int multiplies are 2x faster than 64-bit ones, while on an Intel CPU the 64-bit ones are faster, yikes! (On a modern AMD 32- and 64-bit multiplication are the same speed.)
Last edited by TeeEmCee on Oct 11, 2023 6:56, edited 2 times in total.
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: Concerning the lack of inline functions

Post by TeeEmCee »

So this is the output of -gen gcc I get with latest git, when I change the multiplication to \:

Code: Select all

	int32 vr$4 = RANDINT(  );
	A$0 = vr$4;
	int32 vr$5 = RANDINT(  );
	B$0 = vr$5;
	C$0 = (int32)(int64)(A$0 / B$0);
I failed to notice the casts are still there when doing a multiplication, as you showed. (Also, they're still there when you use a ulong instead.)

Actually, GCC ignores the int64 casts and uses a 32-bit imul instructions even at -O0.

It's quite funny. What's going on here is that Jeff tricked me by recently (July) optimising away the casts to 64-bit ints. Actually he did that before SARG updated gas64 to be able to make use of that optimisation by using a different imul instruction. Yes, FB uses 64-bit intermediates, a fact I immediately forgot when seeing assembly/C that suggested it doesn't.
SARG
Posts: 1768
Joined: May 27, 2005 7:15
Location: FRANCE

Re: Concerning the lack of inline functions

Post by SARG »

Yes gas64 is stable even if from time to time a bug is discovered ;-)

Compiling sucessfully only with -g is obviously not normal. I'll look at that (I saw your post on gas64 thread)

On my PC even if not an old CPU div64 is slower than div32.......


Thanks adeyblue for precising.

About the change in july Jeff and I both worked on this point (based on my work) but as always Jeff implemented all in a better way.
shadow008
Posts: 86
Joined: Nov 26, 2013 2:43

Re: Concerning the lack of inline functions

Post by shadow008 »

TeeEmCee wrote: Oct 11, 2023 2:03 FB does have inline functions (when using -gen gcc). The keyword is PRIVATE (which is equivalent to 'static' in C). Marking a function PRIVATE indicates that the function is internal to that module and can't be called from outside it.
I suppose you're right! Given my experience with how aggressive gcc will inline things with optimizations turned on, this is a direct workaround.
TeeEmCee wrote: Oct 11, 2023 2:03 I use 32-bit ints exclusively. I even do "#undef integer" "type integer as long" in 64-bit builds.
Out of curiosity, why? I can come up with many cases where using 32bit ints over native int size (64 probably) is preferable. But what's your reason? And why not use the integer<bit> notation instead of redefining the keyword altogether? My codebase exclusively uses integer<32> and integer<64> when specificity is needed, and just the native integer when it doesn't matter (bytes and shorts haven't been ambiguous in long enough for me to ever care).
marcov
Posts: 3462
Joined: Jun 16, 2005 9:45
Location: Netherlands
Contact:

Re: Concerning the lack of inline functions

Post by marcov »

TeeEmCee wrote: Oct 11, 2023 2:03 FB does have inline functions (when using -gen gcc). The keyword is PRIVATE (which is equivalent to 'static' in C). Marking a function PRIVATE indicates that the function is internal to that module and can't be called from outside it.
(In my messages, I assumed gen gcc would already do inlining within the compilation unit, so those posts are specifically about cross compilation unit inlining)
TeeEmCee
Posts: 375
Joined: Jul 22, 2006 0:54
Location: Auckland

Re: Concerning the lack of inline functions

Post by TeeEmCee »

shadow008 wrote: Oct 13, 2023 4:58 Out of curiosity, why? I can come up with many cases where using 32bit ints over native int size (64 probably) is preferable. But what's your reason?
This is the OHRRPGCE game engine, BTW. Answering both why 32-bit and why redefine:
The most important reason is so that in-memory and on-disk data formats don't change between 32/64-bit builds. Also the engine has a script interpreter and we want perfect compatibility of scripts between 32/64bit builds, so the basic script datatype is always a 32-bit int. Scripts can access a huge amount of data so if they could read/write to 64-bit data structures there may be problems all over. I didn't want to have to do a replacement of "integer" throughout the whole codebase, which at the time was about 80k lines of FB (now much more), and which would have have been a huge pain for my many unmerged branches. Also important, I didn't want to increase memory usage and decrease speed.
(EDIT: Also, to be honest a big reason: when you're so used to "integer" changing to something else feels ugly.)

On the other hand when the codebase was ported from QB to FB 'integer' changed from 16 to 32-bits so all the in-memory data structures changed, with lots of work done to not change the on-disk formats -- many remain based on 16-bit ints to this day! For a year we could compile with both QB and FB and put out separate builds. I don't think changing the script interpreter from int16 to int32 ever broke anyone's scripts.
Post Reply