The acyclic e-graph: Cranelift's mid-end optimizer

62 points by tekknolagi 4 days ago

mbid 4 hours ago

I believe these ideas are much more mature and better explored for code gen, but similar techniques are useful also in the frontend of compilers, in the type checker. There's a blog post [1] by Niko Matsakis where he writes about adding equalities to Datalog so that Rust's trait solver can be encoded in Datalog. Instead of desugaring equality into a special binary predicate to normal Datalog as Niko suggests, it can be also be implemented by keeping track of equality with union-find and then propagating equality through relations, eliminating now-duplicate rows recursively. The resulting system generalizes both Datalog and e-graphs, since the functionality axiom ("if f(x) = y and f(x) = z, then y = z") is a Datalog rule with equality if you phrase it in terms of the graph of functions.

Systems implementing this are egglog [2] (related to egg mentioned in the article) and (self-plug, I'm the author) eqlog [3]. I've written about implementing Hindley-Milner type systems here: [4]. But I suspect that Datalog-based static analysis tools like CodeQL would also benefit from equalities/e-graphs.

[1] https://smallcultfollowing.com/babysteps/blog/2017/01/26/low...

[2] https://github.com/egraphs-good/egglog

[3] https://github.com/eqlog/eqlog

[4] https://www.mbid.me/posts/type-checking-with-eqlog-polymorph...

IainIreland 2 hours ago

This is really cool. Thanks for the write-up, Chris!

I kept waiting for "sea of nodes with CFG" to be shortened to SeaFG, and it never happened. I guess maybe it's ambiguous out loud.

tekknolagi 35 minutes ago

Damn. I wish I thought of that.
cfallin 29 minutes ago

Oh goodness, that name is so good!
(And thanks!)

pjmlp 7 hours ago

> While that kind of flexibility is tempting, it comes with a significant complexity tax as well: it means that reasoning through and implementing classical compiler analyses and transforms is more difficult, at least for existing compiler engineers with their experience, because the IR is so different from the classical data structure (CFG of basic blocks). The V8 team wrote about this difficulty recently as support for their decision to migrate away from a pure Sea-of-Nodes representation.

Note that the Sea of Nodes author, Cliff Click, is the opinion they weren't really using the way they should, and naturally doesn't see a point on their migration decision.

There is a Coffee Compiler Club discussion on the subject.

titzer 6 hours ago

Well it's hard to summarize what I said in the Coffee Compiler club chat in a HN comment, but there were a number of things that went wrong there. I half agree with Cliff and half agree with the V8 blogpost. TurboFan evolved into a very complicated compiler that made a number of things harder on itself that it should have been.
The sea of nodes is just extending SSA renaming on values to both control and effects. Effect dependencies are equivalent SSA renaming of the state of the world, allowing relaxed ordering of effectful operations and more general transforms. That means that GVN and load elimination are the same thing when effect dependencies are explicitly part of the graph.
Making control and effect dependencies explicit is great!
What makes the sea of nodes complicated is relaxing linear control and effects to allow more reorderings. Many optimizations require a more general algorithm (which is sometimes inefficient, but mostly not) and other optimizations can be almost impossible. E.g. reasoning about what happens between two instructions is impossible--there is no such thing, except after scheduling. For most optimizations, the chain of dependencies is enough. Not all. Loop transforms become more complicated, making regions of code that are uninterruptible (e.g. fully initializing an object before it can be see by the GC) is tough, and a few other things.
Overall I would say that TurboFan's main problem was that did not relax effect edges and it tried to introduce speculation too late and tried to that in the sea of nodes representation. It would have been a better design to do some optimizations on a CFG representation prior to the heavy lifting in optimizations that work on the sea of nodes.
One of TurboFan's good architectural decisions was to separate operators from the node representation, so that reasoning could be somewhat independent of how nodes represent dataflow and effects, but it looks like that got junked in favor of the class-based organization (https://github.com/v8/v8/blob/main/src/maglev/maglev-ir.h) which is pure 90s tech lifted straight from C1 and Crankshaft. When I see an IR that's 11K lines in a header, I find it astonishing. Pity, that 11K knot isn't just self-contained, it will replicate itself over and over and over in the compiler and make a big mess in the end.
I think the main part of the V8 blogpost I agree with is that the sea of nodes is difficult to debug, especially for big graphs. I don't see any way around that except a whole crapton of testing, better tools, graph verifiers, etc. There's a learning curve to any compiler, and complex compilers have complex failure modes. Still, I think some people on the V8 team just always hated the sea of nodes and blamed all of their problems on it. It didn't help that all of the senior people who developed expertise with the IR moved on.
- Leszek 2 hours ago
  
  (disclaimer: current V8 team member replying to senior ex-V8 team member)
  I'd say there were a lot more things problematic in practice with SoN than not relaxing effect edges enough - I'd argue that the bigger problem was that a single effect chain was not enough to represent the flexibility that SoN promised, while keeping the costs, and getting that flexibility would mean effectively extending the effect chain to one effect colour per object per field. Maybe this was a JS specific idiosyncrasy, but my experience was that the effect chain became almost homomorphic to the control chain (and when it wasn't, we had bugs), and then you may as well merge the two into a CFG - if you have to skip over links in your effect chain to skip over certain kinds of effect then you can equally well skip over zero effect nodes too. With SoN, we got all the costs and (almost none) of the benefits, hoisting really isn't so difficult that you have to design your whole IR around it.
  As for IR design and TFs good architectural decision, idk, I don't think it's all that different from what we ended up with in maglev. All those classes are just convenience views onto a consistent node layout (with e.g. the same trick as TF of putting inputs behind the node), and so far we haven't had issues with it - time will tell I suppose.
  Overall, this narrative that TF, with it's SoN and other serial decisions, was super clever and built by very smart senior engineers that just all moved on and left behind just us dummies that don't get it -- I've honestly never argued against it. Hell, I can even agree with it, same as I totally believe Cliff when he says that he could easily solve every problem we struggled with (likely by doing it in the scheduler). Tony Stark built one in a cave with a bunch of scraps, but unfortunately I'm not Tony Stark, and we've ended up choosing human comprehension (instead of superhuman) as a design constraint so that us dummies can still work on it after all the senior engineers got promoted away or bored. I think this is a good decision and I stand by it.
  
  titzer 1 hour ago
  
  Well obviously I don't think that you're a bunch of dummies so please don't throw out strawmen like that.
  I don't know if Cliff is on the same page w.r.t. how much speculation is necessary (and when) to make JS go fast. In particular, inserting speculative guards has the nice property of improving downstream information for dominated control flow paths. Dominated control flow paths are few and far between when the control chain is very relaxed (in fact, that's the point of a relaxed dependency representation--the graph doesn't have induced dominance information, just dependencies). So relaxing ordering actually works against making use of speculation decisions, because speculation decisions have all these downstream (forward) benefits. It actually makes a lot of sense to work out what speculations are going to be done and propagate their effects on a fully-scheduled CFG[1].
  The case is not the same in Java. In C2, afaict, all speculation happens at graph build time, which is driven by abstract interpretation of the bytecode, which follows the control flow order. That means it can make use of dominance information.
  AFAIK Graal does scheduling multiple times, whenever it needs to know explicit control information. This allows its speculations to propagate forward to dominated paths.
  > Maybe this was a JS specific idiosyncrasy, but my experience was that the effect chain became almost homomorphic to the control chain (and when it wasn't, we had bugs), and then you may as well merge the two into a CFG - if you have to skip over links in your effect chain to skip over certain kinds of effect then you can equally well skip over zero effect nodes too. With SoN, we got all the costs and (almost none) of the benefits, hoisting really isn't so difficult that you have to design your whole IR around it.
  I think this is the core of the problem. Control and effects are different things and it wasn't until long after TurboFan that I understood this well enough to even articulate it to myself, let alone to others. The only nodes that really need control inputs are those that have write effects or could diverge (not terminate). Reads of mutable state don't need control, they just need to have an effect input to order them w.r.t. writes. Writes don't need to depend on reads, like they did in TF IR; they are anti-dependencies that can be treated differently. From conversations with Cliff, I think C2 does treat antidependencies differently. Truth be told, writes don't even need to have control dependencies if the scheduler replicates code to make sure that different versions of the world are not simultaneously live (i.e. they are affine resources). The control dependencies in TF graphs were basically a CFG embedded in the graph and making writes depend on control more or less achieved that affine-ness.
  While JS has far too many things that can change the world, it's clearly not the case that all the effect chains were linear, because load elimination would never happen, and no code motion would ever be possible. While it's hard to know how you think of "multiple effect chains"--it doesn't have to be represented by a multiple of edges in the graph. You can still have a single effect edge per node in most cases.
  In the end it seemed like some people were really intent on only thinking about optimization from a CFG perspective and chaining a CFG through all effectful things, completely defeating the purpose of a dependency graph. People think about IRs in different ways, sure, but the dependency graph view and mindset is inherent in the sea of nodes representation. "Walking the code forward" is a common mental mode for CFG optimizations but is just alien to sea of nodes.
  [1] To cut to the chase, and hindsight is 20/20, I think the best design for JS optimization is to use a CFG for a lot of frontend and middle optimizations, before lowering, to use it to insert speculations, and then to thread only a minimal effect / control chain to make a sea of nodes graph, run all the optimizations again, and reschedule it to a CFG. TBH I am not sure whether lowering should happen on a CFG or SoN--it does matter, but I don't think there's a definitive answer. But definitely run GVN on SoN after lowering--there's a bazillion common subexpressions to find then. It'd be great if optimizations can be written to be independent of whether a node is hooked up in a CFG or a SoN, to allow reuse between the two different representations. Or if the representation was sufficiently parametric that nodes could be hooked up in either configuration.
  
  Leszek 1 hour ago
  
  The dummies phrasing is my own, and I stand by it - I simply find the CFG way of thinking much easier to reason about than the SoN way, and I find myself falling back into it no matter how hard I try to follow what I totally (abstractly) appreciate is a more mathematically beautiful dependency/anti-dependency graph - I think that's a limitation of my ability to maintain that concept in my mind and that other people can do it better than me. I find it far easier to reason on CFG terms most of the time, and to raise the CFG temporarily into some dependency representation for doing eliminations/hoisting/LICM, than I do maintaining that dependency representation across all phases and accurately reasoning in each of those about what concrete dependency I forgot about when writing the reduction (concrete recent example, I tried to make reading from holes segfault by unmapping the hole, to force potential holey field access to always first compare against the sentinel, but I was foiled by TF since the load had no dependency on the hole compare branch and could be hosted above it). Implying that this was simply "some people" insisting on only thinking one way without trying to think the SoN way, or that it was some sort of CFG prejudice, is the actual strawman here.

j2kun 6 hours ago

I work in an esoteric compiler domain (compilers for fancy cryptography) and we've been eyeing e-graphs for a bit. This article is super helpful seeing how it materialized in a real-world scenario.

An interesting move in this direction is the Tamagoyaki project: https://github.com/jumerckx/Tamagoyaki that supports equality saturation directly in MLIR.

pizlonator 7 hours ago

Compiler writer here.

This post makes it seem like the pass ordering problem is bigger than it really is and then overestimates the extent to which egraphs solve it.

The pass ordering problem isn’t a big deal except maybe in the interaction of GVN and load elimination, but practically, that ends up being a non issue because its natural to make those be the same pass.

Aside from that, pass ordering isn’t a source of sweat for me or my colleagues. Picking the right pass order is fun and easy compared to the real work (designing the IR and writing the passes).

When pass ordering does come to bite you, it’s in a way that egraphs won’t address:

- You need to run some pass over a higher level IR, some other pass over a lower level IR (ie later), and then you discover that the higher level pass loses information needed by the lower level one. That sucks, but egraphs won’t help.

- You might have some super fancy escape analysis and some super fancy type inference that ought to be able to help each other but can only do so to a limited extent because they’re expensive to run repeatedly to fixpoint and even then they can’t achieve optimality. Both of them are abstract interpreters from hell. Too bad so sad - your only full solution is creating an even more hellish abstract interpreter. Egraphs won’t help you.

- Your tail duplication reveals loops so you want to run it before loop optimization. But your tail duplication also destroys loops so you want to run it after loop optimization. No way around this if you want to do fully aggressive taildup. I end up just running it late. I don’t think egraphs will help you here either.

Worth noting that the first problem is sort of theoretical to me, in the sense that I’ve always found some ordering that just works. The second problem happens, but when it does happen, it’s reasonable to have high level and low level versions of the optimizations and just do both (like how the FTL does CSE in three IRs). The last problem is something I still think about.

cfallin 6 hours ago

Hi Fil -- thanks for the comment!
I think we may be playing in slightly different spaces: unlike a JS JIT, Cranelift doesn't have "super fancy escape/type analysis". We're really targeting the core optimizations (GVN, LICM, cprop, RLE, STLF, various algebraic rules) in a fast compile pass. The RLE+GVN interaction was pretty real in our case, as are interactions between the algebraic rewrites and GVN.
You'll note that my main point is that the single fixpoint loop for all of the core rewrites is what we wanted, and what the sea-of-nodes-with-CFG gets us; the egraph (multiple versions of one value) is kind of an aside. One could say: well sure but I could just do a single pass with that fixpoint loop without all the egraph stuff; and, yes, that's what our single rewrite pass is.
- pizlonator 6 hours ago
  
  > I think we may be playing in slightly different spaces: unlike a JS JIT
  My B3 compiler is a direct equivalent of Cranelift.
  Sure, I've also written JS JITs. B3 was initially the backend of a JS JIT, but now does other things too. And I've done a lot of LLVM work.
  Generally, it's unwise to assume you know what another person has or hasn't done. I'm not making that assumption about you. (I don't even know you, and that doesn't matter for the purposes of this discussion.)
  > single fixpoint loop for all of the core rewrites is what we wanted, and what the sea-of-nodes-with-CFG gets us
  It just feels so constraining. Eventually you'll want optimizations that can't be expressed with egraphs, and I'm not convinced that your approach conserves code or complexity even if you are sticking to egraphable optimizations.
  
  cfallin 6 hours ago
  
  OK, cool. I was assuming "escape analysis and type inference" implied a JS JIT -- straight from your comment, no other assumptions intended. But you've got a lot of interesting experience here and thanks for your thoughts.
  All the best!
- titzer 6 hours ago
  
  Thanks for writing the article, btw. I didn't have a chance to go through the whole thing yet.
  Did you have a chance to study Graal's IR? It is a hybrid between sea of nodes and CFG; it can contain some "fixed" nodes that can be wired into basic blocks. It can also be relaxed and have nearly everything floating.
  TurboFan's IR was very close to C2, but it had even more things that could float. E.g. a pure operation could be lowered to a subgraph with internal control. TurboFan's schedule could move entire floating control islands and attach them to the main CFG skeleton, or remove them entirely.
  I'm working on a new IR and I'll be able to share more soon.
  
  cfallin 6 hours ago
  
  Thanks! I haven't studied Graal's IR in detail, no. I'll add it to my reading list...
diamondlovesyou 1 hour ago

> This post makes it seem like the pass ordering problem is bigger than it really is and then overestimates the extent to which egraphs solve it.
It isn't so much for SoTA implementations like LLVM, but it is for HL IRs like those present in MLIR. For LLVM, you're basically always in the same representation and every pass operates in that shared representation. But even then, this is not quite true. For example, SLP in LLVM is one of the last passes because running SLP before most "latency sensitive cleanups" would break most of them.
In particular, HL to LL lowering pipelines suffer very heavily from the ordering concerns.

infogulch 4 hours ago

I remember coming across egg ( https://egraphs-good.github.io/ ) via some virtual conference a few years ago. I'm happy to see this idea land in a real compiler!

0xnadr 4 hours ago

The e-graph approach to optimization is really elegant. Curious how much compile-time overhead it adds vs the optimization wins it gets.

PoignardAzur 3 days ago

> Finally, the most interesting question in my view: [...] does skipping equality saturation take the egraph goodness out of an egraph(-alike)? The most surprising conclusion in all of the data was, for me, that aegraphs (per se) -- multi-value representations -- don't seem to matter.

I'm not super surprised.

As the article points out, a lot of e-graph projects include rules for culling e-nodes or stopping generation after a certain cutoff. That this is considered a perfectly normal thing to do hints that equality saturation isn't really the magic sauce of e-graphs.