You can't fool the optimizer

267 points by HeliumHydride 3 months ago

stabbles 3 months ago

For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.

For example:

    $ julia
    julia> function f(n)
             total = 0
             for x in 1:n
               total += x
             end
             return total
           end
    julia> @code_native f(10)
        ...
        sub    x9, x0, #2
        mul    x10, x8, x9
        umulh    x8, x8, x9
        extr    x8, x8, x10, #1
        add    x8, x8, x0, lsl #1
        sub    x0, x8, #1
        ret
        ...

it shows this with nice colors right in the REPL.

In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.

lifthrasiir 3 months ago

This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
Someone 3 months ago

LLVM can do more complex sums, too. See https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo...

eigenspace 3 months ago

Another nice thing in julia is that if you dont want the optimizer to delete something, you can just ask it nicely :)

    julia> function f(n)
             total = 0
             for x in 1:n
               total += Base.donotdelete(x)
             end
             return total
           end

will keep the loop

abainbridge 3 months ago

The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.

Some boring examples I've just thought of...

eg 1:

    int bar(int num) { return num / 2; }

Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.

eg 2:

    int foo(void) { return strlen("hello"); }

We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

eg 3:

    int foo(char const *s) {
      if (strlen(s) < 3) return 0;
      if (strcmp(s, "hello") == 0)
        return 1;
      return 0;
    }

This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.

abbeyj 3 months ago

> We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6
I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.
- abainbridge 3 months ago
  
  Fair point. It doesn't do the optimization if you ask to optimize for size '/Os' either.
senfiaj 3 months ago
Yeah, this one as well:
```
  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }
```
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.
- Koffiepoeder 3 months ago
  
  Mhm, this is one of these cases I'd prefer a benchmark to be sure. Checking %2 is very performant and actually just a single bit check. I can also imagine some cpu's having a special code path for %3. In practice I would not be surprised that the double operand is actually faster than the %6. I am mobile at this moment, so not able to verify.
  
  Bratmon 3 months ago
  
  But if % 2 && % 3 is better, then isn't there still a missed optimization in this example?
  
  NobodyNada 3 months ago
  
  Let's throw this into godbolt: https://clang.godbolt.org/z/qW3qx13qT
  is_divisible_by_6(int): test dil, 1 jne .LBB0_1 imul eax, edi, -1431655765 add eax, 715827882 cmp eax, 1431655765 setb al ret .LBB0_1: xor eax, eax ret is_divisible_by_6_optimal(int): imul eax, edi, -1431655765 add eax, 715827882 ror eax cmp eax, 715827883 setb al ret
  By themselves, the mod 6 and mod 3 operations are almost identical -- in both cases the compiler used the reciprocal trick to transform the modulo into an imul+add+cmp, the only practical difference being that the %6 has one extra bit shift.
  But note the branch in the first function! The original code uses the && operator, which is short-circuiting -- so from the compiler's perspective, perhaps the programmer expects that x % 2 will usually be false, and so we can skip the expensive 3 most of the time. The "suboptimal" version is potentially quite a bit faster in the best case, but also potentially quite a bit slower in the worst case (since that branch could be mispredicted). There's not really a way for the compiler to know which version is "better" without more context, so deferring to "what the programmer wrote" makes sense.
  That being said, I don't know that this is really a case of "the compiler knows best" rather than just not having that kind of optimization implemented. If we write 'x % 6 && x % 3', the compiler pointlessly generates both operations. And GCC generates branchless code for 'is_divisible_by_6', which is just worse than 'is_divisible_by_6_optimal' in all cases.
  
  senfiaj 3 months ago
  
  I also tried this
  bool is_divisible_by_15(int x) { return x % 3 == 0 && x % 5 == 0; } bool is_divisible_by_15_optimal(int x) { return x % 15 == 0; }
  is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not
  is_divisible_by_15(int): imul eax, edi, -1431655765 add eax, 715827882 cmp eax, 1431655764 jbe .LBB0_2 xor eax, eax ret .LBB0_2: imul eax, edi, -858993459 add eax, 429496729 cmp eax, 858993459 setb al ret is_divisible_by_15_optimal(int): imul eax, edi, -286331153 add eax, 143165576 cmp eax, 286331153 setb al ret
- abainbridge 3 months ago
  
  Nice.
  Is the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?
  
  stouset 3 months ago
  
  Probably not, because a lot of the power of optimizing compilers comes from composing optimizations. Also a lot comes from being able to rule out undefined behavior.
SkiFire13 3 months ago

> int bar(int num) { return num / 2; } > > Doesn't get optimized to a single shift right, because the that won't work if num is negative.
Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).
- abainbridge 3 months ago
  
  Good point. I guess there are more cases than just this one where I'd like to be able to tell the compiler I don't care about rounding behaviour and would prefer the fastest code. Like -ffast-math but for integer operations. I don't think that exists. I wonder why.
- kragen 3 months ago
  
  Will shift, shift, and add be slower or faster than a divide instruction on machines with a divide instruction?
  
  SkiFire13 3 months ago
  
  Most likely no, division instructins generally take as much as 10-20 other arithmetic/logic instruction.
commandlinefan 3 months ago

> won't work if num is negative
I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.
- cibyr 3 months ago
  
  The extra fun thing about this is that eval has different semantics if it's assigned to a different name, in order to allow JavaScript implementations to apply extra optimizations to code that doesn't call a function literally named "eval": https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
  Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...
- LoganDark 3 months ago
  
  Could this perhaps be it? https://janvitek.org/pubs/ecoop11.pdf
- astrange 3 months ago
  
  A JIT can do any optimization it wants, as long as it can deoptimize if it turns out it was wrong.
  
  adrianN 3 months ago
  
  You also want to prove that the „optimization“ doesn’t make things slower.
  
  astrange 3 months ago
  
  I don't think anyone actually bothers to do that tbh.
stabbles 3 months ago

The compiler doesn't know the implementation of strlen, it only has its header. At runtime it might be different than at compile time (e.g. LD_PRELOAD=...). For this to be optimized you need link time optimization.
- dzaima 3 months ago
  
  Both clang and gcc do optimize it though - https://godbolt.org/z/cGG9dq756. You need -fno-builtin or similar to get them to not.
- valleyer 3 months ago
  
  No, the compiler may assume that the behavior of standard library functions is standards-conformant.
  
  SpaceManNabs 3 months ago
  
  > No, the compiler may assume that the behavior of standard library functions is standards-conformant.
  Why?
  What happens if it isn't?
  
  MindSpunk 3 months ago
  
  Sadness. Tons of functions from the standard library are special cases by the compiler. The compiler can elide malloc calls if it can prove it doesn't need them, even though strictly speaking malloc has side effects by changing the heap state. Just not useful side effects.
  memcpy will get transformed and inlined for small copies all the time.
  
  1718627440 3 months ago
  
  What happens when you change random functions in your C compiler? The C standard library and compiler are not independent, both make up a C implementation, which behaviour is described by the C standard.
  
  valleyer 3 months ago
  
  Yes, though it's worth stating that it's a little more nuanced than that, since (for historical, path-dependent reasons) the compiler and libc are often independent projects (and libc often includes a bunch of other stuff beyond what the standard/compiler need).
  This is the case, for example, on macOS, FreeBSD, and Linux.
  Cute username, BTW.
  
  1718627440 3 months ago
  
  You are right, it depends, whether you write C (from the standard) or a specific dialect from your vendor (everybody does in practice). In the latter case, you need to know about the rules of the compiler. But to allow optimization, these are often similar, so that the compiler assumes these have the behaviour of the implementation, that the compiler is tailored against.
  > Cute username, BTW.
  Thanks, I was to lazy to think of a real name, so this is the timestamp, I created the account.
  
  valleyer 3 months ago
  
  > What happens if it isn't?
  §6.4.2.1: "If the program defines a reserved identifier [...] the behavior is undefined."
  
  kragen 3 months ago
  
  The most common reason is to do optimizations such as replacing strlen("hello") with 5 or open-coding strlen (or, more commonly, memcpy or memcmp). If you're linking with a non-conformant strlen (or memcpy or whatever) the usual thing that happens is that you get standards-compliant behavior when the compiler optimizes away the call, but you get the non-conformant behavior you presumably wanted when the compiler compiles a call to your non-conformant function.
  But the orthodox answer to such questions is that demons fly out of your nose.
  
  DannyBee 3 months ago
  
  Because that's what it means to compile a specific dialect of a specific programming language?
  If you want a dialect where they aren't allowed to assume that you would have to make your own
- 1718627440 3 months ago
  
  It does. The meaning of certain functions are prescribed by the C standard and the compiler is allowed to expect them to have certain implementations. It can replace them with intrinsics or even remove them entirely. It is of course different for a freestanding implementation.
- abainbridge 3 months ago
  
  Hmmm, really? Switching compiler seems sufficient: https://godbolt.org/z/xnevov5d7
  BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).
dzaima 3 months ago

> I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.
For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.
That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.
delta_p_delta_x 3 months ago

> but some compilers don't: https://godbolt.org/z/M7x5qraE6
You've misrepresented the situation. Turn up the optimiser to `/O2` and MSVC returns 5 directly, too.
> This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.
It's funny how sometimes operating at a higher level of abstraction allows the compiler to optimise the code better: https://godbolt.org/z/EYP5764Mv
In this, the string literal "hello" is lowered not merely into a static string, but a handful of integral immediates that are directly inline in the assembly, no label-dereferencing required, and the 'is equal to "hello"' test is cast as the result of some sign extends and a bitwise-xor.
Of course, one could argue that std::string_view::size() is statically available, but then my counter-argument is that C's zero-terminated strings are a massive pessimisation (which is why the compiler couldn't 'see' what we humans can), and should always be avoided.
abainbridge 3 months ago
eg 4:
```
   int foo(char const *s) {
     if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
        return 1;
     return 0;
   }
```
The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKe
- ynik 3 months ago
  
  `s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.
  If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb
  (also note you got the endianness wrong in your hand-optimized version)
  
  zrm 3 months ago
  
  > If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen
  But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.
  
  Denvercoder9 3 months ago
  
  Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.
  
  immibis 3 months ago
  
  UB doesn't exist in the processor (it does, but not here). If the compiler knows the pointer is aligned it can do the transformation.
  
  zrm 3 months ago
  
  For the compiler to know the pointer is aligned it would have to actually be aligned and there is no guarantee that it is.
  
  kragen 3 months ago
  
  This is fantastic, thanks! This is the approach I use in httpdito to detect the CRLFCRLF that terminates an HTTP/1.0 GET request, but I'm doing it in assembly.
  
  abainbridge 3 months ago
  
  Ooo, I'd never thought of using & like that. Interesting.
  > (also note you got the endianness wrong in your hand-optimized version) Doh :-)
  
  rdc12 3 months ago
  
  Matt Godbolt's talk on ray tracers, shows how effective that change can be. Think it was that talk anyway.
  https://www.youtube.com/watch?v=HG6c4Kwbv4I
  
  NooneAtAll3 3 months ago
  
  good ol' short circuiting
- abbeyj 3 months ago
  
  If you want to tell the compiler not to worry about the possible buffer overrun then you can try `int foo(char const s[static 4])`. Or use `&` instead of `&&` to ensure that there is no short-circuiting, e.g. `if ((s[0] == 'h') & (s[1] == 'e') & (s[2] == 'l') & (s[3] == 'l'))` Either way, this then compiles down to a single 32-bit comparison.
  Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.
  The code in `bar` is probably not a good idea on targets that don't like unaligned loads.
- raphlinus 3 months ago
  
  That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.
  
  abainbridge 3 months ago
  
  Ah, yes, good point. I think this is a nice example of "I didn't notice I needed to tell the compiler a thing I know so it can optimize".
WalterBright 3 months ago

`s` may be null, and so the strlen may seg fault.
- pianom4n 3 months ago
  
  But that's undefined behavior, so the compiler is free to ignore that possibility.
  
  WalterBright 3 months ago
  
  > so the compiler is free to ignore that possibility
  And that's what is wrong. This is the most unfriendly behavior towards the programmer.
- flqn 3 months ago
  
  Since the optimiser is allowed to assume you're not invoking UB, and strlen of null is UB, I don't believe that it would consider that case when optimising this function.
  
  WalterBright 3 months ago
  
  I understand that, but I don't agree that such optimizer behavior is worth it and I won't put it in my compilers.
  
  kragen 3 months ago
  
  I appreciate that greatly.
  
  WalterBright 3 months ago
  
  The notion that because it is undefined behavior means that the compiler is free to replace it with anything up to and including "launch nuclear missiles". This is just nuts.
  If I program it to cause a null pointer seg fault, I expect a null pointer seg fault. If I program it to cause a twos complement overflow, I want a twos complement overflow.
  
  kragen 3 months ago
  
  Yeah, I feel the same way. It's refreshing to hear that that's not just because I'm insane. I think C compiler teams are sort of forced into this stupid shit because they don't have new CPU architectures to port to anymore, so, unless they want to go find new jobs, they're forced to waste their time and everyone else's by "improving" the compilers by increasing performance in riskier and riskier ways.

jagged-chisel 3 months ago

I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.

adrianN 3 months ago

This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
- amiga386 3 months ago
  
  I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.
  So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
  e.g.
  /* slower */ struct foo { uint32_t a,b,c,d,e,f,g,h; } uint32_t do_thing(struct foo *foo) { return foo->a ^ foo->b ^ foo->c ^ foo->d; } void blah() { struct foo x; for (...) { x.e = do_thing(&x) ^ x.f; ... } } /* faster */ #define DO_THING (a^b^c^d) void blah() { uint32_t a,b,c,d,e,f,g,h; for (...) { e = DO_THING ^ f; ... } }
  
  torginus 3 months ago
  
  The nice thing about godbolt is that it can show you that clang not only can but do it in theory but also does it in practice:
  https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
  The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
  Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
  
  amiga386 3 months ago
  
  That's really good; clearly I haven't looked at more recent versions. The magic seems to happen in your link at SROAPass, "Scalar Replacement Of Aggregates". Very cool!
  According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...
  > This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.
  That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.
  
  actionfromafar 3 months ago
  
  I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
- lou1306 3 months ago
  
  To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
  
  adrianN 3 months ago
  
  That is something that can be easily found and usually fixed with trivial profiling. I'm more talking about data locality instead of pointer chasing. Once you set up a pointer-chasing data infrastructure changing that means rewriting most of your application.
jaccola 3 months ago

I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
qsort 3 months ago

> I always code with the mindset “the compiler is smarter than me.”
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
- manbitesdog 3 months ago
  
  To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
  
  zbentley 3 months ago
  
  > However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50...
  I do not get that, I get LOAD_CONST 500.
  Tested on: Python 3.9.3 MacOS (Apple provided), 3.13.3 (uv provided) MacOS and Linux, and 3.14.0 (uv provided) MacOS and Linux.
  
  manbitesdog 2 months ago
  
  Sorry, I meant 10**50; HN formatting removed one asterisk
  
  zbentley 2 months ago
  
  Ah, gotcha. Yep! Not constant-calculated at all!
wat10000 3 months ago

I would modify this a bit. Someone with decent computer architecture knowledge, tools, and time can generally do better than the compiler. But you generally won't, because you have a lot of other things to think about. So I'd state this as, "the compiler is more diligent and consistent than me." It's not so much that it can spot a for loop that's equivalent to a single add, but that it will spot it just about every time, so you don't have to worry about it.
moregrist 3 months ago

There are optimizations that a compiler can perform; usually these are code transformations. Modern optimizing compilers usually get these right.
The optimizations that tend to have the most impact involve changes to the algorithm or data layout. Most compilers won’t do things like add a hash table to make a lookup O(1) or rearrange an array of structures to be a structure of arrays for better data locality. Coding with an eye for these optimizations is still a very good use of your time.
stonemetal12 3 months ago

I go with "You are responsible for the algorithms, it is responsible for the code micro optimizations". The compiler can't optimize you out of an SQL N+1 situation, that is on me to avoid, but it is better than me at loop unrolling.
IshKebab 3 months ago

The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.
You can mostly not think about super low level integer manipulation stuff though.
wavemode 3 months ago

This is very often true when your data is sitting right there on the stack.
Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.
flohofwoe 3 months ago
> I always code with the mindset “the compiler is smarter than me.”
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
```
  add w8,w1,w0
  cmp w0,#0
  cseleq w0,w1,w8
```
Even beginner assembly coders on their first day wouldn't write such bullshit :)
A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
- sumtechguy 3 months ago
  
  I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
ErroneousBosh 3 months ago

You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
mamcx 3 months ago

> “the compiler is smarter than me.”
This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".
It works great for basic, simple, common code, and for code that is made with care for data structures.
A total mess of code is another story.
P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries
kragen 3 months ago

> I always code with the mindset “the compiler is smarter than me.”
That mindset will last until the first or second time you walk through the compiler's assembly-language output. GCC and LLVM can find some amazing optimizations, but they also consistently miss very obvious optimization opportunities, often even on utterly trivial code.
That isn't to say that it's worthwhile to hand-optimize all your code. But the compiler is very much not smarter than you, and if you do need to squeeze performance out of the processor, you can do a lot better than the compiler does. A good first step is to look at the stupid shit the compiler has spat out and massage your source code until the shit stinks less.
tonyhart7 3 months ago

also not all software need optimization to the bone
pareto principle like always, dont need the best but good enough
not every company is google level anyway

WalterBright 3 months ago

There are general optimizations, based on DFA (Data Flow Analysis). These recognize things like loops, loop invariants, dead code, copy propagation, constant propagation, common subexpressions, etc.

Then, there are is a (very long) list of checks for specific patterns and replacing them with shorter sequences of code, things like recognizing the pattern of bswap and replacing it with a bswap instruction. There's no end to adding patterns to check for.

DannyBee 3 months ago

This is true but there is actually an end ;)
There are limits on provable equivalence in the first place. things like equality saturation also try and do a better job of formalizing equivalence based rewrites.

Scene_Cast2 3 months ago

This post assumes C/C++ style business logic code.

Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).

I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.

layer8 3 months ago

I don’t disagree, but profiling also won’t help you with death by a thousand indirections.
- lmm 3 months ago
  
  Sure, but that's mostly a myth.
  
  1718627440 3 months ago
  
  So how do you see in a profiler, that everything is 1.2x slower than it could be?
  
  lmm 3 months ago
  
  No-one's getting out of bed for 1.2x.
  
  1718627440 3 months ago
  
  Depends. If you have a real-time system, that might very well will be what you chase after. Also why not make your program a bit faster when it is no work, by starting it the right way upfront. I mean I wouldn't rewrite a program for this, but when I program some new part and I can avoid an indirection, why not do it? Less complexity, less (failure) state, better performance.
  
  lmm 3 months ago
  
  > If you have a real-time system, that might very well will be what you chase after.
  If you have a real-time system you care about the difference between real-time and not. But any given 1.2x factor is extremely unlikely to be the difference-maker.
  > When I program some new part and I can avoid an indirection, why not do it? Less complexity, less (failure) state, better performance.
  Well if it makes the code simpler then you should do it for that reason. But thinking about this kind of low-level performance detail at all will usually do more harm than good.
  
  1718627440 3 months ago
  
  In languages that don't hide much (e.g heap allocations in Java), less complexity corresponds to better performance (up to a point). Because better performance is fundamentally about letting the computer do less work. When pointer indirection is explicit, the programmer is nudged towards thinking whether they actually want it. Same with dynamic dispatch, polymorphism and every other runtime complexity.
  
  teworks 3 months ago
  
  I know games are constantly brought up as an example, but it's for good reason. Your frametime being 16.6ms instead of 20ms is the difference between a shipped feature and a cut one in a console title. And all the data and instruction cache thrashing caused by pointer chasing can (and does) absolutely make that difference.

CodeArtisan 3 months ago

Recursive Popcount:

    unsigned int popcount(unsigned int n) 
    {
        return (n &= n - 1u) ? (1u  + popcount(n)) : 0u;
    }

Clang 21.1 x64:

    popcount:
            mov     eax, -1
    .LBB0_1:
            lea     ecx, [rdi - 1]
            inc     eax
            and     ecx, edi
            mov     edi, ecx
            jne     .LBB0_1
            ret

GCC 15.2:

    popcount:
            blsr    edi, edi
            popcnt  eax, edi
            ret

Both compiled with -O3 -march=znver5

pbsd 3 months ago
Because the function is not quite correct. It should be
```
    return n ? (1u  + popcount(n & n - 1u)) : 0u;
```
which both Clang and GCC promptly optimize to a single popcnt.
- 1718627440 3 months ago
  
  There should be testsuites, which are based on testing which compilation passes the compiler chose.

matja 3 months ago

You can fool the optimizer, but you have to work harder to do so:

    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;
            b = x ^ y;
            x = a << 1;
            y = b;
        } while (a);
        return b;
    }

becomes (with armv8-a clang 21.1.0 -O3) :

    add(unsigned int, unsigned int):
    .LBB0_1:
            ands    w8, w0, w1
            eor     w1, w0, w1
            lsl     w0, w8, #1
            b.ne    .LBB0_1
            mov     w0, w1
            ret

thaumasiotes 3 months ago
Since I had to think about it:
```
    unsigned add(unsigned x, unsigned y) {
        unsigned a, b;
        do {
            a = x & y;   /* every position where addition will generate a carry */
            b = x ^ y;   /* the addition, with no carries */
            x = a << 1;  /* the carries */
            y = b;
        /* if there were any carries, repeat the loop */
        } while (a);
        return b;
    }
```
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.
It's a little more difficult to see that the loop will necessarily terminate.
New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.
- gfaster 3 months ago
  
  In C, I'm pretty confident the loop is defined by the standard to terminate.
  Also I did take the excuse to plug it (the optimized llvm ir) into Alive:
  https://alive2.llvm.org/ce/#g:!((g:!((g:!((h:codeEditor,i:(f...
  
  dzaima 3 months ago
  
  Alive2 does not handle loops; don't know what exactly it does by default, but changing the `shl i32 %and, 1` to `shl i32 %and, 2` has it still report the transformation as valid. You can add `--src-unroll=2` for it to check up to two loop iterations, which does catch such an error (and does still report the original as valid), but of course that's quite limited. (maybe the default is like `--src-unroll=1`?)
  
  gfaster 3 months ago
  
  Oh wow nice catch - I was not at all familiar with the limitations. I would've hoped for a warning there, but I suppose it is a research project.
  I was able to get it working with unrolling and narrower integers:
  https://alive2.llvm.org/ce/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF...
  
  thaumasiotes 3 months ago
  
  > In C, I'm pretty confident the loop is defined by the standard to terminate.
  Huh? What's that supposed to mean?
  
  gfaster 3 months ago
  
  That it is Undefined Behavior for a loop with a non-constant conditional and that doesn't cause side effects in its body to not terminate.
  For example, you can use this make the compiler "prove" the Collatz Conjecture:
  https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(file...

anon-3988 3 months ago

What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:

let a = expr let b = expr2

if (a || b) { return true; }

is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?

A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.

A bad example recently is something like

const S * s =;

let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;

if (s->a == a && s->b == b /* etc */ ) { return true; }

It did not turn all of this into a SIMD mask or something like that.

jcranmer 3 months ago

> Is the compiler smart enough to only compute them whenever it is necessary?
This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.
[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.
Denvercoder9 3 months ago

> It did not turn all of this into a SIMD mask or something like that.
Did you try using bitwise and (&), or a local for the struct? The short-circuiting behaviour of the logical means that if `s->a != a`, `s->b` must not be dereferenced, so the compiler cannot turn this into a SIMD mask operation, because it behaves differently.
Generally compilers are pretty smart these days, and I find that more often than not if they miss an "obvious" optimization it's because there's a cornercase where it behaves differently from the code I wrote.

Findecanor 3 months ago

I'm wondering how the compiler optimised add_v3() and add_v4() though.

Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?

fooyc 3 months ago

add_v3() is the result of induction variable simplification: https://llvm.org/doxygen/IndVarSimplify_8cpp_source.html
j2kun 3 months ago

Scalar Evolution is one way loops can be simplified

derefr 3 months ago

Even better / potentially more surprising:

    unsigned mult(unsigned x, unsigned y) {
        unsigned y0 = y;
        while (x--) y = add_v1(y, y0);
        return y;
    }

optimizes to:

    mult(unsigned int, unsigned int):
      madd w0, w1, w0, w1
      ret

(and this produces the same result when substituting any of the `add_vN`s from TFA)

senfiaj 3 months ago

I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.

Also optimizers have a limit, they can't reason as abstractly as humans, for example:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }

I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.

The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.

inkyoto 3 months ago

> So no, there are plenty of easy ways to fool the optimizer by obfuscation.
If you mean fooling the compiler by the source code obfuscation, it won't – by the time the first optimisation pass arrives, the source had already been transformed into an abstract syntax tree and the source code obfuscation becomes irrelevant.
Multiple optimiser passes do take place, but they are bounded in time – it is not an accepted expectation that the optimiser will spend a – theoretically – indefinite amount of time trying to arrive at the most perfect instruction sequence.
There was a GNU project a long time ago, «superoptimiser», which, given a sequence of instructions, would spend a very long time trying to optimise it into oblivion. The project was more of an academic exercise, and it has been long abandoned since.
jakobnissen 3 months ago

They do, and the order of the passes matter. Sometimes, optimizations are missed because they require a certain order of passes that is different from the one your compiler uses.
On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.
- titzer 3 months ago
  
  It's a long-standing problem in compilers, often referred to as the "phase ordering problem". In general, forward dataflow optimizations can be combined if they are monotonic (meaning, never make the code worse, or at least, never undo a previous step. It's possible to run forward dataflow problems together repeatedly to a fixpoint. In TurboFan a general graph reduction algorithm is [1] instantiated with a number of reducers, and then a fixpoint is run. The technique of trying to combine multiple passes has been tried a number of times. What doesn't seem so obvious is how to run optimizations that are not traditional forward dataflow problems or are indeed backward dataflow problems (like DCE) together with other transformations. Generally compilers get tuned by running them on lots of different kinds of code, often benchmarks, and then tinkering with the order of passes and other heuristics like loop unroll factors, thresholds for inlining, etc, and seeing what works best.
  [1]was? TurboFan seems to have splintered into a number of pieces being reused in different ways these days
windward 3 months ago

Those aren't isomorphic. The C spec says `is_divisible_by_6` short-circuits. You don't want the compiler optimising away null checks.
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
6.5.13, semantics
- senfiaj 3 months ago
  
  So you claim that the compiler "knows about this but doesn't optimize because of some safety measures"? As far as I remember, compilers don't optimize math expressions / brackets, probably because the order of operations might affect the precision of ints/floats, also because of complexity.
  But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.
  
  1718627440 3 months ago
  
  > the order of operations might affect the precision of ints/floats
  That's only the problem of floats, with ints this issue doesn't exist.
  Why do you write (x % 2 == 0 && x % 3 == 0) instead of (x % 2 == 0 & x % 3 == 0), when the latter is what you think you mean?
  Are you sure, that dividing by 6 is actually faster, than dividing by 2 and 3? A division operation is quite costly compared to other arithmetic and 2 and 3 are likely to have some special optimization (2 is a bitshift), which isn't necessary the case for 6.
  
  senfiaj 3 months ago
  
  > That's only the problem of floats, with ints this issue doesn't exist.
  With ints the results can be dramatically different (often even worse than floats) even though in pure mathematics the order doesn't matter:
  1 * 2 * 3 * 4 / 8 --> 3 3 * 4 / 8 * 1 * 2 --> 2
  This is a trivial example, but it shows why it's extremely hard for compilers to optimize expressions and why they usually leave this task to humans.
  But x % 2 == 0 && x % 3 == 0 isn't such case, swapping operands of && has no side effects, nor swapping operands of each ==.
  > Are you sure, that dividing by 6 is actually faster
  Compilers usually transform divisions into multiplications when the denominator is a constant.
  I wrote another example in other comment but I'll write again.
  I also tried this
  bool is_divisible_by_15(int x) { return x % 3 == 0 && x % 5 == 0; } bool is_divisible_by_15_optimal(int x) { return x % 15 == 0; }
  is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not
  is_divisible_by_15(int): imul eax, edi, -1431655765 add eax, 715827882 cmp eax, 1431655764 jbe .LBB0_2 xor eax, eax ret .LBB0_2: imul eax, edi, -858993459 add eax, 429496729 cmp eax, 858993459 setb al ret is_divisible_by_15_optimal(int): imul eax, edi, -286331153 add eax, 143165576 cmp eax, 286331153 setb al ret
  My point is that the compiler still doesn't notice that 2 functions are equivalent. Even when choosing 3 and 5 (to eliminate the questionable bit check trick for 2) the 1st function appears less optimal (more code + branch).
  
  1718627440 2 months ago
  
  > in pure mathematics the order doesn't matter
  I don't perceive that as an ordering issue. "Pure mathematics" has multiple division definitions, what we see here is the definition you use in class 1: integer division. The issue here is not associativity, it is that the inverse of an integer division is NOT integer multiplication, the inverse of division is the sum of multiplication and the modulo. Integer division is a information destroying operation.
  > I wrote another example in other comment but I'll write again. [...]
  Yes, this is because optimizing compilers are not optimizers in the mathematical sense, but heuristics and sets of folk wisdoms. This doesn't make them any less impressive.
  
  senfiaj 2 months ago
  
  > I don't perceive that as an ordering issue. "Pure mathematics" has multiple division definitions, what we see here is the definition you use in class 1: integer division. The issue here is not associativity, it is that the inverse of an integer division is NOT integer multiplication, the inverse of division is the sum of multiplication and the modulo. Integer division is a information destroying operation.
  Agree, I've gone too far with integer division. But a similar problem exists for floats as well. In abstract mathematics the order of some operations between real numbers doesn't matter, but since the CPU floats have limited size and accuracy, it does. This is why when you are calculating some decreasing convergent series, you should better to start from the smallest terms, because the accuracy would be lost during float normalization when adding a tiny term to an already large accumulated sum. A compiler is unlikely to do any optimization here and people should be aware of this. Compilers can't assume the intention in your code, so they make sure the program behavior isn't affected after the optimizations.
  > Yes, this is because optimizing compilers are not optimizers in the mathematical sense, but heuristics and sets of folk wisdoms. This doesn't make them any less impressive.
  I'm not implying that it's not impressive, but I'm implying that compilers still aren't magic wands, and you should still optimize the algorithms (to a reasonable degree). Just let the compiler do the microptimizations (all this register allocation golf, instruction reordering, caching, the discussed division trick, etc.). IMHO this suboptimal output in this particular case was somewhat expected because it's some "niche" case although it's obvious. I'm not blaming the compiler people. Yes someone could add that optimization rule for my case, but as I said, It's quite rare and it's probably not worth adding optimization rules for such case to make the optimizer more bloated and complicated.
  
  1718627440 2 months ago
  
  > I'm not implying that it's not impressive
  That part was about me :-).
  > I'm implying that compilers still aren't magic wands,
  Agreed.
  > you should still optimize the algorithms (to a reasonable degree). Just let the compiler do the microptimizations (all this register allocation golf, instruction reordering, caching, the discussed division trick, etc.).
  Agreed.
- jcranmer 3 months ago
  
  x % 3 == 0 is an expression without side effects (the only cases that trap on a % operator are x % 0 and INT_MIN % -1), and thus the compiler is free to speculate the expression, allowing the comparison to be converted to (x % 2 == 0) & (x % 3 == 0).
  Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.
- Sohcahtoa82 3 months ago
  
  Any number divisible by 6 will also be divisible by both 2 and 3 since 6 is divisible by 2 and 3, so the short-circuiting is inconsequential. They're bare ints, not pointers, so null isn't an issue.
  So how are they not isomorphic?
- dzaima 3 months ago
  
  That only matters for things with side-effects; and changing the `&&` to `&` doesn't get it to optimize anyway.
  You can check - copy the LLVM IR from https://godbolt.org/z/EMPr4Yc84 into https://alive2.llvm.org/ce/ and it'll tell you that it is a valid refinement as far as compiler optimization goes.
ramon156 3 months ago

I don't know enough about ASM. Are u saying the first one is more optimal because it is faster or because it uses less instructions? Would this reflect a real world use case? Do any other compilers (e.g. V8) optimize modulo's into something else?
- senfiaj 3 months ago
  
  The compiler didn't recognize that x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values. In theory a compiler could detect that and generate identical code for both functions, but it isn't done because this case is "niche" despite being trivial. My point is not to over rely on optimizer for math expressions and algorithms.

barishnamazov 3 months ago

Sometimes you can fool the compiler :-)

See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/

jmcomets 3 months ago

Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.

    unsigned add_v5(unsigned x, unsigned y) {
      if (x == y) return 2 * x;
      return x + y;
    }

Results in:

    add_v5(unsigned int, unsigned int):
      lsl w8, w0, #1
      add w9, w1, w0
      cmp w0, w1
      csel w0, w8, w9, eq
      ret

(armv8-a clang 21.1.0 with O3)

If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?

jcranmer 3 months ago

> If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
Compilers are essentially massive towers of heuristics for which patterns to apply for optimization. We don't throw a general SMT solver at your code because that takes way too long to compile; instead, we look at examples of actual code and make reasonable efforts to improve code.
In the case of the incrementing in a loop, there is a general analysis called Scalar Evolution that recasts expressions as an affine expression of canonical loop iteration variables (i.e., f(x), where x is 0 on the first loop iteration, 1 on the second, etc.). In the loop `while (x--) y++;`, the x variable [at the end of each loop iteration] can be rewritten as x = x₀ + -1*i, while the y variable is y = y₀ + 1*i. The loop trip count can be solved to an exact count, so we can replace the use of y outside the loop with y = y₀ + 1*trip count = y₀ + x, and then the loop itself is dead and can be deleted. These are all optimizations that happen to be quite useful in other contexts, so it's able to easily recognize this form of loop.
In the example you give, the compiler has to recognize the equivalence of two values conditional on control flow. The problem is that this problem really starts to run into the "the time needed to optimize this isn't worth the gain you get in the end." Note that there are a lot of cases where you have conditional joins (these are "phis" in SSA optimizer parlance), most of which aren't meaningfully simplifiable, so you're cutting off the analysis for all but the simplest cases. At a guess, the simplification is looking for all of the input values to be of the same form, but 2 * x (which will actually be canonicalized to x << 1) is not the same form as x + y, so it's not going to see if the condition being used to choose between the same values would be sufficient to make some operation return the same value. There are representations that make this problem much easier (egraphs), but these are not the dominant form for optimizers at present.
- DannyBee 3 months ago
  
  This is all true. Additionally, the payback from optimizing purely scalar arithmetic harder has gone down more and more over time compared to almost anything else.
  For example, eliminating an extra load or store is often worth more than eliminating 100 extra arithmetic operations these days.
Someone 3 months ago

> I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can’t?
I expect because the former helps more in optimising real-world code than the latter. It’s not worth the LLVM developer's time to make the compiler better for programs that it won’t see in practice.
It’s not as if the compiler did nothing with that code, though. It replaced the multiplication by a left shift and removed the branch.
scialex 3 months ago

This sort of pattern can't be found by incremental lowering (and isn't common enough to have more sophisticated analysis written for it) so it ends up in a local maximum.
Basically the idea for most compilers is to do a series of transforms which incrementally improve the program (or at least make it worse in understood and reversible ways). To do this transform you need the optimizer to do the (not always trivial) proof that the 2*x is equivalent to x+y, do the replacement, do the gvn to duplicate the adds and finally do the branch elimination. Each of these steps is however totally separate from one another and the first one doesn't trigger since as far as it's concerned a shift left is faster than an add so why should it do the replacement.
This is all even more complicated since what representation is faster can depend on the target.
- AlotOfReading 3 months ago
  
  I agree, but GCC manages the optimization, and not all optimizations need to take fewer cycles. The single instruction version is obviously better for -Os and it would probably be a win in general.
DullPointer 3 months ago

I’m not a compiler expert, an assembly expert or an ARM expert, so this may be wildly wrong, but this looks optimized to me.
The trick is that it’s doing both the add and the left shift in parallel then selecting which to use based on a compare of the two values with csel.
(To see this, rather than reading the code sequentially, think of every instruction as being issued at the same time until you hit an instruction that needs a destination register from an earlier instruction)
The add is stored in W9 but only read if the two arguments are unequal.
If the compare succeeds and the lsl retires before the add, the add is never read, so nothing stalls waiting for it and the answer can be returned while the add is still in flight. The result of the add would then be quietly discarded assuming it ever started (maybe there’s some magic where it doesn’t even happen at all?).
It’s not clear to me that this is power efficient, or that on many real cpus there’s a latency difference to exploit between add and lsl, so it may not be faster than just unconditionally doing the addition.
That said, it is definitely faster than the code as it was written which if translated to asm verbatim stalls on the compare before executing either the add or the left shift.
- adwn 3 months ago
  
  > this looks optimized to me.
  It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.
  
  DannyBee 3 months ago
  
  In the end, the simple answer is that scalar code is just not worth optimizing harder these days. It's rarer and rarer for compilers to be compiling code where spending more time optimizing purely scalar arithmetic/etc is worth the payback.
  This is even true for mid to high end embedded.
  
  DullPointer 3 months ago
  
  ARM is a big target, there could be cpus where lsl is 1 cycle and add is 2+.
  Without knowing about specific compiler targets/settings this looks reasonable.
  Dumb in the majority case? Absolutely, but smart on the lowest common denominator.
  
  adwn 3 months ago
  
  > Without knowing about specific compiler targets/settings this looks reasonable.
  But we do, armv8-a clang 21.1.0 with O3, and it doesn't.
  > […] but smart on the lowest common denominator.
  No, that would be the single add instruction.

msarnoff 3 months ago

I was very surprised that GCC could optimize NEON SIMD intrinsics. After spending hours trying to optimize my vector code, trying to get the spacing between register dependencies right to reduce stalls, breaking long reduction operations into intermediate results, messing with LLVM-MCA, etc., I realized that I just couldn’t beat the compiler. It was doing its best to allocate registers and reorder instructions to keep the pipeline filled.

I don’t think it always did the best job and saw a bunch of register spills I thought were unnecessary, but I couldn’t justify the time and effort to do it in assembly…

sureglymop 3 months ago

With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?

E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?

It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?

optionalsquid 3 months ago

This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
cyco130 3 months ago

It would but it's harder to trigger. Here, it's not safe because they're public functions and the standard would require `add_v1 != add_v2` (I think).
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
tialaramex 3 months ago

If your language has monomorphization† (as C++ and Rust do) then it's really common to have this commonality in the emitted code and I believe it is common for compilers to detect and condense the resulting identical machine code. If the foo<T> function for an integer checks if it's equal to four, it well be that on your target hardware that's the same exact machine code whether the integer types T are 1 byte, 2 bytes or 4 bytes and whether they're signed or unsigned, so we should only emit one such implementation of foo, not six for u8, i8, u16, i16, u32 and i32.
† Monomorphization takes Parametrically Polymorphic functions, ie functions which are strongly typed but those types are parameters at compile time, and it emits distinct machine code for each needed variation of the function, so e.g. add(a, b) maybe gets compiled to produce add_integer(a, b) and add_float(a, b) and add_matrix(a, b) even though we only wrote one function, and then code which calls add(a, b) with matrices, is at compile time emitted as calling add_matrix(a, b), because the compiler knew it needs that version. In C++ the number of parameters is also potentially allowed to vary between callers so add_matrix(a, b, c, d) might exist too, this feature is not yet available in Rust.
- titzer 3 months ago
  
  The linker de-duping identical machine code is common, but most frontends that do monomorphization aren't that smart about identical copies, because monomorphization is usually done with source-level types, and there are lots of typeful operations that need to get resolved and lowered before it's known that the machine code will be identical.
moefh 3 months ago

> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
- bruce343434 3 months ago
  
  Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.
  I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
  
  cyco130 3 months ago
  
  That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.
  
  1718627440 3 months ago
  
  That's one major reason why I don't like C++. I think the concept of header and implementation files is fine, but idiomatic C++ code basically makes it broken. Surely a class should go into the implementation file? (Internal) Types belong into the implementation, what belongs into headers are interfaces and function signatures. A class is a type, so it does not belong into a header file.
  
  gpderetta 3 months ago
  
  [[gnu::visibility(hidden)]] (or the equivalent for your compiler), might help.
- sureglymop 3 months ago
  
  > It can't do that because the program might load a dynamic library that depends on the function
  That makes perfect sense, thank you!
  And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
  Learned something :)
Joker_vD 3 months ago
Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:
```
    void go_forward(Closure *clo, Closure *cont, Closure *forward) {
        GC_CHECK(clo, cont, forward);
        ((Fun0)(forward->fun))(forward, cont);
    }

    void go_left(Closure *clo, Closure *cont, Closure *left, Closure *right) {
        GC_CHECK(clo, cont, left, right);
        ((Fun0)(left->fun))(left, cont);
    }

    void go_right(Closure *clo, Closure *cont, Closure *left, Closure *right) {
        GC_CHECK(clo, cont, left, right);
        ((Fun0)(right->fun))(right, cont);
    }

    GcInfo gc_info[] = {
        { .fun = (GenericFun)&go_forward, .envc = 0, .argc = 1 },
        { .fun = (GenericFun)&go_left, .envc = 0, .argc = 2 },
        { .fun = (GenericFun)&go_right, .envc = 0, .argc = 2 },
    };
```
Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.
- gpderetta 3 months ago
  
  But it could generate one then make the remaining three tail call to that one, or lay them out so that they are at 1byte-nop each to the next one and fallthrough the next until the last one implements the logic (This is a bit more compilcated on msvc as I believe the ABI requires a well defined prologue).
  
  zozbot234 3 months ago
  
  They can't be at 1byte-nop distance because pointer addresses as well as branch target addresses are expected to be aligned for performance reasons - often to 16 bytes. You need either a nop sequence or a jump/tailcall.
  
  gpderetta 3 months ago
  
  Sure, there are also probably pointer integrity landing pads. Make it larger nops then.
apple1417 3 months ago

The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...

amelius 3 months ago

One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.

titzer 3 months ago

These situations are known as "performance cliffs" and they are particularly pernicious in optimizing dynamic languages like JavaScript, where runtime optimization happens that depends not just on the program's shape, but its past behavior.

317070 3 months ago

"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.

You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.

Almondsetat 3 months ago

But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already

Panzerschrek 3 months ago

I can: https://godbolt.org/z/Kc8cTddd5 Compilers still struggle to optimize non-trivial recursive functions, where obvious non-recursive approach is possible.

bgbntty2 3 months ago

I'm not well-versed in compilers, so it was a bit surprising to see how it optimizes all the add_vX functions.

What I most enjoyed, though, was how the guy in the video (linked at the bottom of the article) was typing - a mistake on every few characters. Backspace was likely his most-used key. I found it encouraging, somehow. I know typing speed or correctness isn't really important for coders, but I always felt like I'm behind others with regards to typing, even though when I really concentrate, I do good on those online typing tests. Even when writing this comment, I made like 30 mistakes. Probably an useless comment, but it may give some people hope or validation if they feel like they not great typists.

raluk 3 months ago

Years ago I wrote c++ library for stream compostion. Something like C++20 ranges. It turns out that as long as you compose everything with lambdas, compiled code is same as it would be with naive loops. Everything gets optimised.

For example, you can write sum of numbers less than n as:

  count(uint64_t(0)) 
   | take(n) 
   | sum<uint64_t>();

Clang converted this into n*(n-1)/2.

amai 3 months ago

> This process of converting different code patterns into a standard, canonical form is what lets the compiler treat them all identically.

Wouldn‘t it be nice if we could transform back from the canonical form into the most readable code? In the example that would convert all functions into x + y.

anonymousiam 3 months ago

Sometimes you can fool the (C) optimizer by using the 'volatile' keyword in front of a variable in code that would otherwise be optimized out.

https://www.embeddedrelated.com/thread/4749/when-and-how-to-...

mattnewport 3 months ago

That's not really "fooling" the optimizer, that's kind of the point of volatile. The optimizer not making optimizations is the intended behaviour.
- anonymousiam 3 months ago
  
  You're right, but the article failed to mention that there was a way around the optimizations.
1718627440 3 months ago

Fooling for you is making someone not do X by telling them not to do X?

torginus 3 months ago

Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.

It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.

norir 3 months ago

For me, compiler optimization is a mixed bag. On the one hand, they can facilitate the generation of higher performance runtime artifacts, but it comes at significant cost, often I believe exceeding the value they provide. They push programs in the direction of complexity and inscrutability. They make it harder to know what a function _actually_ does, and some even have the ability to break your code.

In the OP examples, instead of optimization, what I would prefer is a separate analysis tool that reports what optimizations are possible and a compiler that makes it easy to write both high level and machine code as necessary. Now instead of the compiler opaquely rewriting your code for you, it helps guide you into writing optimal code at the source level. This, for me, leads to a better equilibrium where you are able to express your intent at a high level and then, as needed, you can perform lower level optimizations in a transparent and deterministic way.

For me, the big value of existing optimizing compilers is that I can use them to figure out what instructions might be optimal for my use case and then I can directly write those instructions where the highest performance is needed. But I do not need to subject myself to the slow compilation times (which compounds as the compiler repeatedly reoptimizes the same function thousands of times during development -- a cost that is repeated with every single compilation of the file) nor the possibility that the optimizer breaks my code in an opaque way that I won't notice until something bad and inscrutable happens at runtime.

raverbashing 3 months ago

I'm curious what is the theoreme-proving magic behind add_v4 and if this is prior LLVM ir

gpderetta 3 months ago

Interesting, even this can't fool the optimizer (tried with a recent gcc and clang):

  unsigned add(unsigned x, unsigned y) {
   std::vector vx {x};
   std::vector vy {y};
   auto res = vx[0]+vy[0];
   return res;
  }

Joker_vD 3 months ago

Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?

Karliss 3 months ago

That's not Intel syntax that's more or less ARM assembly syntax as used by ARM documentation. Intel vs AT&T discussion is primarily relevant only for x86 and x86_64 assembly.
If you look at GAS manual https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/a... almost every other architecture has architecture specific syntax notes, in many cases for something as trivial comments. If they couldn't even decide on single symbols for comments, there is no hope for everything else.
ARM isn't the only architecture where GAS uses similar syntax as developers of corresponding CPU arch. They are not doing the same for X86 due to historical choices inherited from Unix software ecosystem and thus AT&T. If you play around on Godbolt with compilers for different architectures it seems like x86 and use AT&T syntax is the exception, there are a few other which use similar syntax but it's a minority.
Why not use same syntax for all architectures? I don't really know all the historical reasoning but I have a few guesses and each arch probably has it's own historic baggage. Being consistent with manufacturer docs and rest of ecosystem has the obvious benefits for the ones who need to read it. Assembly is architecture specific by definition so being consistent across different architectures has little value. GAS is consistent with GCC output. Did GCC added support for some architectures early with the with help of manufacturers assembler and only later in GAS? A lot of custom syntax quirks which don't easily fit into Intel/AT&T model and are related to various addressing modes used by different architectures. For example ARM has register postincrement/preincrement and the 0 cost shifts, arm doesn't have the subregister acess like x86 (RAX/EAX/AX/AH/AL) and non word access is more or less limited to load/store instructions unlike x86 where it can show up in more places. You would need to invent quite a few extensions for AT&T syntax for it to be used by all the non x86 architectures, or you could just use the syntax made by developer of architecture.
- Joker_vD 3 months ago
  
  > Why not use same syntax for all architectures?
  My question is more, why even try to use the same syntax for all architectures? I thought that was what GAS's approach was: that they took AT&T syntax, which historically was unified syntax for several PDPs (and some other ISA, I believe? VAX?) and they made it fit every other ISA they supported. Except apparently no, they didn't, they adopted the vendors' syntaxes for other ISAs but not for Intel's x86? Why? It just boggles my mind.
  
  gldrk 3 months ago
  
  I don’t believe GNU invented the AT&T syntax for x86. System V probably targeted x86 before GNU did (Richard Stallman didn’t think highly of microcomputers). They used some kind of proprietary toolchain at the time that gas must have copied.
- kragen 3 months ago
  
  As I understand it, GCC and gas for i386 were written to be compatible with existing Unix C compilers (emitting assembly) and assemblers (consuming assembly), and those tools used the "AT&T" syntax that copied the PDP-11 assembler's syntax.
  For amd64 there isn't really much of an excuse. Intel syntax was already supported in gas.

asah 3 months ago

I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.

Jaxan 3 months ago

Why does it have to be AI?

Scubabear68 3 months ago

I liked the idea behind this post, but really the author fairly widely missed the mark in my opinion.

The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.

The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.

As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".

gpderetta 3 months ago

I think the author is strictly talking about C and C++. Python is famously pessimal in all possible ways.
- Scubabear68 3 months ago
  
  Digging around, OK that makes sense. But even in the context of C and C++, there are often more ways the compiler can't help you than ways it can.
  The most common are on function calls involving array operations and pointers, but a lot of it has to do with the C/C++ header and linker setup as well. C and C++ authors should not blithely assume the compiler is doing an awesome job, and in my experience, they don't.
  
  gpderetta 3 months ago
  
  > C and C++ authors should not blithely assume the compiler is doing an awesome job
  Agree. And I'm sure the author agrees as well. That's why compiler-explorer exists in the first place.

mkornaukhov 3 months ago

Better tell me how to make the compiler not fool me!

dlenski 3 months ago

Today I learned that Matt Godbolt is British!

daft_pink 3 months ago

Is this an argument for compiled code?

0xTJ 3 months ago

It's not really an argument for anything, it's just showing off how cool compilers are!