Realtime CPU Raytracing.
Realtime CPU Raytracing.
For a while now I've been playing with a voxel-based, hybrid, raytracing/raymarching renderer.
It's CPU based, and the point was to explore the performance bounds of CPU rendering with clever coding and algorithms.
At some point I should stop and just port the damn thing to GPU, but every time I think I've squeezed all the speed I can out of it, I think of another improvement.
I'm curious, have any other users here played with such a thing, or have an intuition of what a "good" frame-rate for such a project might be?
I've yet to break out the ASM, (I would have if my CPU supported AVX2. i7-3770), but I am doing a fair amount of bit-fiddling for texture look-ups etc.
It's CPU based, and the point was to explore the performance bounds of CPU rendering with clever coding and algorithms.
At some point I should stop and just port the damn thing to GPU, but every time I think I've squeezed all the speed I can out of it, I think of another improvement.
I'm curious, have any other users here played with such a thing, or have an intuition of what a "good" frame-rate for such a project might be?
I've yet to break out the ASM, (I would have if my CPU supported AVX2. i7-3770), but I am doing a fair amount of bit-fiddling for texture look-ups etc.
Re: Realtime CPU Raytracing.
I am not a graphics programmer but I would suggest that before you try inline asm that you consider using gcc built in functions https://gcc.gnu.org/onlinedocs/gcc-4.8. ... tions.html and https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html of course that means compiling with the gcc back-end, all that's required to use those functions if FB is to declare them in your program
Re: Realtime CPU Raytracing.
That seems like way more work TBH. But maybe the compiler can optimize better like that?
What do you mean by "just declare them"? I am using the GCC back-end, but I get a "fake:...:undefined reference" error if I just do that.
Testing code:
What do you mean by "just declare them"? I am using the GCC back-end, but I get a "fake:...:undefined reference" error if I just do that.
Testing code:
Code: Select all
declare sub __builtin_cpu_init cdecl()
__builtin_cpu_init()
'Compiler output:
'C:\FreeBASIC-1.08.1-winlibs-gcc-9.3.0\bin\win32\ld.exe: C:\FreeBASIC-1.08.1-winlibs-gcc-9.3.0\FBIDETEMP.o:fake:(.text+0x3d): undefined reference
'to `__BUILTIN_CPU_INIT'
Re: Realtime CPU Raytracing.
you need to enclose the declarations with extern "c"
Code: Select all
extern "c"
declare function __builtin_cpu_init () as long
declare function __builtin_cpu_is (byval as const zstring ptr) as long
end extern
__builtin_cpu_init ()
if __builtin_cpu_is ("intel") then
? "Intel CPU."
end if
if __builtin_cpu_is ("atom") then
? "Intel Atom CPU."
end if
if __builtin_cpu_is ("core2") then
? "Intel Core 2 CPU."
end if
if __builtin_cpu_is ("corei7") then
? "Intel Core i7 CPU."
end if
if __builtin_cpu_is ("nehalem") then
? "Intel Core i7 Nehalem CPU."
end if
if __builtin_cpu_is ("westmere") then
? "Intel Core i7 Westmere CPU."
end if
if __builtin_cpu_is ("sandybridge") then
? "Intel Core i7 Sandy Bridge CPU."
end if
if __builtin_cpu_is ("amd") then
? "AMD CPU."
end if
if __builtin_cpu_is ("amdfam10h") then
? "AMD Family 10h CPU."
end if
if __builtin_cpu_is ("barcelona") then
? "AMD Family 10h Barcelona CPU."
end if
if __builtin_cpu_is ("shanghai") then
? "AMD Family 10h Shanghai CPU."
end if
if __builtin_cpu_is ("istanbul") then
? "AMD Family 10h Istanbul CPU."
end if
if __builtin_cpu_is ("btver1") then
? "AMD Family 14h CPU."
end if
if __builtin_cpu_is ("amdfam15h") then
? "AMD Family 15h CPU."
end if
if __builtin_cpu_is ("bdver1") then
? "AMD Family 15h Bulldozer version 1."
end if
if __builtin_cpu_is ("bdver2") then
? "AMD Family 15h Bulldozer version 2."
end if
if __builtin_cpu_is ("bdver3") then
? "AMD Family 15h Bulldozer version 3."
end if
if __builtin_cpu_is ("btver2") then
? "AMD Family 16h CPU."
end if
Last edited by srvaldez on Sep 01, 2021 11:27, edited 1 time in total.
Re: Realtime CPU Raytracing.
Got it.
Re: Realtime CPU Raytracing.
@Manpcnin
I edited my post as it contained errors
I edited my post as it contained errors
Re: Realtime CPU Raytracing.
I have some suggestions, based on what I have learned in these years:
1) don't waste your time with ASM: currently, GCC (that is used by FreeBasic as well) offers good optimizations, and compiled code can be as fast (or even faster) than ASM code. So, unless you are really good at coding in assembly, don't bother with it, it's not worth (also, it would make your code harder to port to different architectures, like ARM processors)
2) Either you code your renderer for the CPU, or for the GPU. Don't expect to write it for the CPU and then port it to the GPU, because it's a completely different way of coding, and most of the tricks that make code faster on the CPU would actually make it slower and less efficient on the GPU (for example, on the CPU branching can save a lot of time, on the GPU it often creates a severe bottleneck): so, porting code you wrote for the CPU to the GPU might require more effort than rewriting the same code from scratch.
1) don't waste your time with ASM: currently, GCC (that is used by FreeBasic as well) offers good optimizations, and compiled code can be as fast (or even faster) than ASM code. So, unless you are really good at coding in assembly, don't bother with it, it's not worth (also, it would make your code harder to port to different architectures, like ARM processors)
2) Either you code your renderer for the CPU, or for the GPU. Don't expect to write it for the CPU and then port it to the GPU, because it's a completely different way of coding, and most of the tricks that make code faster on the CPU would actually make it slower and less efficient on the GPU (for example, on the CPU branching can save a lot of time, on the GPU it often creates a severe bottleneck): so, porting code you wrote for the CPU to the GPU might require more effort than rewriting the same code from scratch.
Re: Realtime CPU Raytracing.
Do you really think the compiler is going to 8 x vectorize a core loop, into a pseudo "warp" with per-pixel early exit detection using branchless SIMD?angros47 wrote:1) don't waste your time with ASM...
Even if this requires algorithmic changes elsewhere?
The kind ASM I'm looking at doing is, I think, a lot more than just what an optimizing compiler would do anyway. And I'm not there yet anyway. I still have 2 big algorithm improvements to implement first, (which I've been putting off because they will involve a ton of refactoring.)
I'm actually removing as many branches as possible because they are causing pipeline stalls. And I'm aware of the architectural differences btw CPU & GPU. This project has still taught me techniques that I'm sure will work on both. Though with different levels of benefit of course. It's still very parallel. (Currently using all 8 of my (virtual)cores, but there's no reason it can't scale to thousands of cores.)angros47 wrote:...on the CPU branching can save a lot of time...
--
At any rate, both of you seem to be missing the point of the project. AND the point of the post.
The point of the project is to see what is possible, Can sufficient determination & cleverness overturn the conventional wisdom that CPU raytracing is just too slow?
And the point of the post wasn't to gather advice. (Although if anyone has actually written something similar i'd love to hear your approach and any tricks you discovered.)
The point was to see what peoples intuition of what "too slow" actually was. Have I already beaten it? Am I getting close? Or am I a long way off still?
This is why I haven't posted any actual performance numbers. I don't want to influence your idea of what "reasonable performance" is. "X fps @ 1080p / Y million voxels"
Re: Realtime CPU Raytracing.
Does your algorithm do that, currently? On the CPU?Do you really think the compiler is going to 8 x vectorize a core loop, into a pseudo "warp" with per-pixel early exit detection using branchless SIMD?
Even if this requires algorithmic changes elsewhere?
Too slow for what? It always depends on the complexity of what you are trying to renderThe point of the project is to see what is possible, Can sufficient determination & cleverness overturn the conventional wisdom that CPU raytracing is just too slow?
This example used the CPU for rendering a dynamic voxel based world, for example, and it did it 20 years ago: https://www.youtube.com/watch?v=irvFiTsY05Q. So even a slow CPU can perform good rendering of voxels, as long as you don't expect excessive resolution, I guess.
Re: Realtime CPU Raytracing.
angros47 wrote: Does your algorithm do that, currently? On the CPU?
Not yet. That's what I would break out the hand coded assembly for.
Sure. Any performance numbers will always be dependent on the scene complexity. Too slow, for a given complexity. For example I personally think that 1 million voxels @ 20FPS @ 1080p is too slow. I'm asking for peoples intuitions here, Not some hard cutoff. It's always going to be slower than GPU anyway, So if that's your heuristic, it's ALWAYS going to be "too slow".angros47 wrote:Too slow for what? It always depends on the complexity of what you are trying to render...
As for that example, I don't think it's even raytraced. I don't see any lighting effects that indicate it might be. But just as a CPU rasterizer, it's not bad. I've written similar.
Re: Realtime CPU Raytracing.
I believe a real-time ray tracing renderer can be developed to run on a laptop cpu. Euclideon's Unlimited Detail comes to mind. I guarantee there are at least a dozen undiscovered techniques, each one capable of increasing performance by 10%
Re: Realtime CPU Raytracing.
Then, in many cases, the best solution is to reduce the complexity, considering that you will have to render only a small amount of the given voxels. So, use octrees to simplify the scene rendering and discard everything that is not in view. Also, try to remove all the polygons that won't be visible anyway: Minecraft, for example, doesn't render ALL the cubes: it combines them into much larger meshes, using some CSG algorithms, and then renders those meshes, reducing the number of surfaces to render to a more manageable amount (and this elaboration can be done on the CPU, of course: actually, it has to happen on the CPU, to allow using collisions and physics, too)Manpcnin wrote: Sure. Any performance numbers will always be dependent on the scene complexity. Too slow, for a given complexity. For example I personally think that 1 million voxels @ 20FPS @ 1080p is too slow.
-
- Posts: 8586
- Joined: May 28, 2005 3:28
- Contact:
Re: Realtime CPU Raytracing.
In scope of ray tracing or ray marching I'm as a human can beet any GCC optimization with in lined (#define) ASM code :-)
Do you tested shadertoy.com OpenGL shader language with FreeBASIC ?
Today OpenCL with FreeBASIC are my first choice for GPU number crunching !
Joshy
Do you tested shadertoy.com OpenGL shader language with FreeBASIC ?
Today OpenCL with FreeBASIC are my first choice for GPU number crunching !
Joshy
Re: Realtime CPU Raytracing.
I've tested it on a puny 2 core 1.1Ghz Celeron at interactive (realtime at very low resolution) framerates. So it does run, for certain definitions of "run". And you get it! It's the thrill of finding those compounding "10%" improvements, and the increasing familiarity with the problem domain, that keeps me coming back to the challenge.
This is a voxel based engine. There are no polygons or meshes. Also a lot of occlusion and frustum culling falls out of my algorithm automatically without needing to be explicitly coded. Nothing ever gets rendered that doesn't affect a pixel on the screen. Ray-marching is very different to rasterization. Oct-trees... Oct-trees have been a carefully considered feature. They need to be implemented carefully for the scaling to outweigh the overheads. My current code is extremely lean, (it has to be), and any extra branching & instructions has to pull its weight. If I end up implementing it it will probably need to be a 64-tree not an oct-tree.angros47 wrote: ↑Sep 02, 2021 9:32 ...try to remove all the polygons that won't be visible anyway: Minecraft, for example, doesn't render ALL the cubes: it combines them into much larger meshes, using some CSG algorithms, and then renders those meshes, reducing the number of surfaces to render to a more manageable amount (and this elaboration can be done on the CPU, of course: actually, it has to happen on the CPU, to allow using collisions and physics, too)
Thanks for the encouragement! I am woefully inexperienced with shader languages and OpenCL and it's something I really should remedy.D.J.Peters wrote: ↑Sep 03, 2021 0:47 ...a human can beet any GCC optimization with in lined (#define) ASM code
Do you tested shadertoy.com OpenGL shader language with FreeBASIC ?
Today OpenCL with FreeBASIC are my first choice for GPU number crunching !
How portable is OpenCL code?