For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.
For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.
In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.
This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
> I always code with the mindset “the compiler is smarter than me.”
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
> I always code with the mindset “the compiler is smarter than me.”
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
add w8,w1,w0
cmp w0,#0
cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)
A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.
You can mostly not think about super low level integer manipulation stuff though.
You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
This post assumes C/C++ style business logic code.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:
> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
> It can't do that because the program might load a dynamic library that depends on the function
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.
The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".
Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?
"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already
I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.
For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.
For example:
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
LLVM can do more complex sums, too. See https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo...
I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.
This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
e.g.
I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
The nice thing about godbolt is that it can show you that clang not only can but do it in theory but also does it in practice:
https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
> I always code with the mindset “the compiler is smarter than me.”
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
> I always code with the mindset “the compiler is smarter than me.”
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
Even beginner assembly coders on their first day wouldn't write such bullshit :)A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.
You can mostly not think about super low level integer manipulation stuff though.
You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
also not all software need optimization to the bone
pareto principle like always, dont need the best but good enough
not every company is google level anyway
You can fool the optimizer, but you have to work harder to do so:
becomes (with armv8-a clang 21.1.0 -O3) :This post assumes C/C++ style business logic code.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
Sometimes you can fool the compiler :-)
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
It would but it's harder to trigger. Here, it's not safe because they're public functions and the standard would require `add_v1 != add_v2` (I think).
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:
Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
> It can't do that because the program might load a dynamic library that depends on the function
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
Learned something :)
Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.
The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...
Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?
"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already
One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.
Today I learned that Matt Godbolt is British!
I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.
Better tell me how to make the compiler not fool me!
I'm curious what is the theoreme-proving magic behind add_v4 and if this is prior LLVM ir
Is this an argument for compiled code?
It's not really an argument for anything, it's just showing off how cool compilers are!