That implementation of the max(A, X) operator that is described in the article is very clever. Obviously not super practical, since it needs to allocate one specific zeropage location for it to work ($8A), but that type of requirement is pretty typical for lovely tricks like this one! And the beauty here is in the trick, not whether or not there may be some practical application of it.
But this also raises the question: is there some clever way to use the undocumented SAX instruction, which does an AND between the A and the X registers, to achieve something similar? There is and old trick to compute min and max using no branches, which is not immediately applicable (https://graphics.stanford.edu/~seander/bithacks.html#Integer...) - but maybe there is some other trick hiding in there?
I did something related in the past: https://github.com/RussellSprouts/6502-enumerator. It uses C++ templates to share an emulator implementation between z3-powered symbolic execution and actual execution. It was meant to find equivalence between random instruction sequences for peephole optimization, rather than optimizing a specific input sequence.
Shadow instructions are very interesting and cursed. I've seen them used in very limited circumstances for NOP-slides for timing code: https://www.pagetable.com/?p=669. It would be fun to see it applied in otherwise normal code. My enumerator wouldn't support this -- it didn't execute actual 6502 instructions from bytes -- it had its own internal representation for `the first arbitrary absolute pointer` or `the second arbitrary immediate constant`. These would either be initialized with random concrete values or z3 variables.
Your project was very much something I looked into when designing this! Fun to see you commenting.
But yes, different goals. I did look into using z3, but quickly found out that it's pretty slow compared to just checking if a test case passes when ran through the candidate program.
Interesting project and well written. That only made me miss some links to prior art more though.
iirc, there was a superoptimizer (I belief the term was coined and motivated in that article) in the early nineties for M68k. https://dl.acm.org/doi/pdf/10.1145/36206.36194 might have been that.
If you assume that A * 10 isn't going to overflow, so that ASL A moves 0 into the carry flag (so no need for CLC), then instead of using the undocumented RRA opcode, you can just do:
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
Sure. Note that I picked those examples to demonstrate the two fairly quirky classes of things the optimizer tends to find. If the programmer has different requirements they can specify that, and it'll spit out the examples you gave (or something equivalent).
I like the idea of exhaustive search, which the simplicity of the 6502 seems ideally suited for, but the search speed seems a bit limiting. I wonder if there's not potential for more generation restriction (e.g. code can only use a specific N bytes of zero page) and heavy search pruning to speed it up? If it could generate optimal 20-30 op sequences in semi-reasonable time that'd make it very useful.
I did some manual golfing with nand2tetris assembly and developed similar hacks to the max() implementation, where one appropriates an arbitrary, conveniently placed, memory address.
After reading the article, though, I feel like I definitely need a superoptimiser, to see what could be improved :)
"Although NESFab performance is good, writing assembly by hand can obviously surpass it"
These project obviously have different goals. DeiMOS is for people already writing assembler that want optimal assembler.
NESFab is a compiler - apparently a very good one (hard to do for 6502), but nonetheless not directly competing with people hand writing assembler and looking to save every byte and/or cycle.
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
Interesting and fun read - we are well into the terrain of what was completely impossible to do back then. Now I can't wait to see a faster AppleSoft ROM ;-)
Just wanted to mention genetic algorithms (GAs), popularized by John Koza and others.
The post uses a 4 instruction program as an example having about 256^4 or 4 billion combinations. Most interesting programs are 10, 100, 1000+ instructions long, which is too large of a search space to explore by brute force.
So GAs use a number of tricks to investigate the search space via hill climbing without getting stuck at local optima. They do that by treating the search space as a bit string, then randomly flipping bits (mutation) or swapping bits (sexual reproduction) to hop to related hills in the search space. Then the bit string is converted back to instructions and tested to see if it performs the desired algorithm.
The bit string usually encodes the tree form of a Lisp program to minimize syntax. We can think of it as if every token is encoded in bits (like Huffman encoding inspired by Morse code) For example, the tokens in a (+ 1 2) expression might have the encoding 00, 01 and 10, so the bit string would be 000110, and we can quickly explore all 2^3 = 8 permutations (2^6 = 64 if we naively manipulate an uncompressed bit string whose encoded token sizes vary).
Note that many of the bit strings like (+ + 1) or (2 1 +) don't run. So guard rails can be added to reduce the search space, for example by breaking out early when bit strings throw a compiler exception, or using SAT solvers or caching to weed out nonviable bit strings.
We could build a superoptimizer with GAs, then transpile between MOS 6502 assembly and Lisp (or even run the MOS 6502 assembly directly in a sandbox) and not have to know anything about how the processor works. To me, this is the real beauty of GAs, because they allow us to solve problems without training, at the cost of efficiency.
I don't think that LLMs transpile to Lisp when they're designing algorithms. So it's interesting that they can achieve high complexity and high efficiency via training, without even having verification built-in. Although LLMs trained on trillions of parameters running on teraflops GPUs with GBs of memory may or may not be viewed as "efficient".
I suspect that someday GAs may be incorporated into backpropagation to drastically reduce learning time by finding close approximations to the matrix math of gradient descent. GAs were just starting to be used to pseudorandomly produce the initial weights of neural nets around 2000 when I first learned about them.
Also quantum computing (QC) could perform certain matrix math in a fraction of the time, or even preemptively filter out bit strings which aren't runnable. I suspect that AI will get an efficiency boost around 2030 when QC goes mainstream. Which will probably lead us to a final candidate learning algorithm that explains how quantum uncertainty and emergent behavior allow a physical mind to tune into consciousness and feel self-aware, but I digress.
Because modern compilers don't do any of this, and we aren't accustomed to multicore computing, then from a sheer number of transistors perspective, we're only getting a tiny fraction of the computing power that we might otherwise have if we designed chips from scratch using modern techniques. This is why I often say that computers today run thousands of times slower than they should for their transistor budgets.
That implementation of the max(A, X) operator that is described in the article is very clever. Obviously not super practical, since it needs to allocate one specific zeropage location for it to work ($8A), but that type of requirement is pretty typical for lovely tricks like this one! And the beauty here is in the trick, not whether or not there may be some practical application of it.
But this also raises the question: is there some clever way to use the undocumented SAX instruction, which does an AND between the A and the X registers, to achieve something similar? There is and old trick to compute min and max using no branches, which is not immediately applicable (https://graphics.stanford.edu/~seander/bithacks.html#Integer...) - but maybe there is some other trick hiding in there?
Very cool!
I did something related in the past: https://github.com/RussellSprouts/6502-enumerator. It uses C++ templates to share an emulator implementation between z3-powered symbolic execution and actual execution. It was meant to find equivalence between random instruction sequences for peephole optimization, rather than optimizing a specific input sequence.
Shadow instructions are very interesting and cursed. I've seen them used in very limited circumstances for NOP-slides for timing code: https://www.pagetable.com/?p=669. It would be fun to see it applied in otherwise normal code. My enumerator wouldn't support this -- it didn't execute actual 6502 instructions from bytes -- it had its own internal representation for `the first arbitrary absolute pointer` or `the second arbitrary immediate constant`. These would either be initialized with random concrete values or z3 variables.
Your project was very much something I looked into when designing this! Fun to see you commenting.
But yes, different goals. I did look into using z3, but quickly found out that it's pretty slow compared to just checking if a test case passes when ran through the candidate program.
Interesting project and well written. That only made me miss some links to prior art more though.
iirc, there was a superoptimizer (I belief the term was coined and motivated in that article) in the early nineties for M68k. https://dl.acm.org/doi/pdf/10.1145/36206.36194 might have been that.
If you assume that A * 10 isn't going to overflow, so that ASL A moves 0 into the carry flag (so no need for CLC), then instead of using the undocumented RRA opcode, you can just do:
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
stx $00
cmp $00
bcs done
txa
done:
Sure. Note that I picked those examples to demonstrate the two fairly quirky classes of things the optimizer tends to find. If the programmer has different requirements they can specify that, and it'll spit out the examples you gave (or something equivalent).
I like the idea of exhaustive search, which the simplicity of the 6502 seems ideally suited for, but the search speed seems a bit limiting. I wonder if there's not potential for more generation restriction (e.g. code can only use a specific N bytes of zero page) and heavy search pruning to speed it up? If it could generate optimal 20-30 op sequences in semi-reasonable time that'd make it very useful.
Having it only use operations that use a specific set of zero-page addresses is already supported, yep!
20-30 ops is probably impossible, unfortunately. The combinatorial explosion is just too enormous.
e-graphs have been repeatedly reinvented across decades for many purposes. One of them is superoptimization.
https://en.wikipedia.org/wiki/E-graph
https://www.cs.cornell.edu/courses/cs6120/2025sp/blog/supero...
https://github.com/philzook58/awesome-egraphs
I did some manual golfing with nand2tetris assembly and developed similar hacks to the max() implementation, where one appropriates an arbitrary, conveniently placed, memory address.
After reading the article, though, I feel like I definitely need a superoptimiser, to see what could be improved :)
reminds me a bit of https://pubby.games/codegen.html just that its approach seems way more refined and useful.
"Although NESFab performance is good, writing assembly by hand can obviously surpass it"
These project obviously have different goals. DeiMOS is for people already writing assembler that want optimal assembler.
NESFab is a compiler - apparently a very good one (hard to do for 6502), but nonetheless not directly competing with people hand writing assembler and looking to save every byte and/or cycle.
That’s incredibly clever and a fun read. Well done!
I imagine lots of demo coders glancing back and forth between that writeup and their own carefully hand-tuned assembly.
Thank you!
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
Interesting and fun read - we are well into the terrain of what was completely impossible to do back then. Now I can't wait to see a faster AppleSoft ROM ;-)
Just wanted to mention genetic algorithms (GAs), popularized by John Koza and others.
The post uses a 4 instruction program as an example having about 256^4 or 4 billion combinations. Most interesting programs are 10, 100, 1000+ instructions long, which is too large of a search space to explore by brute force.
So GAs use a number of tricks to investigate the search space via hill climbing without getting stuck at local optima. They do that by treating the search space as a bit string, then randomly flipping bits (mutation) or swapping bits (sexual reproduction) to hop to related hills in the search space. Then the bit string is converted back to instructions and tested to see if it performs the desired algorithm.
The bit string usually encodes the tree form of a Lisp program to minimize syntax. We can think of it as if every token is encoded in bits (like Huffman encoding inspired by Morse code) For example, the tokens in a (+ 1 2) expression might have the encoding 00, 01 and 10, so the bit string would be 000110, and we can quickly explore all 2^3 = 8 permutations (2^6 = 64 if we naively manipulate an uncompressed bit string whose encoded token sizes vary).
Note that many of the bit strings like (+ + 1) or (2 1 +) don't run. So guard rails can be added to reduce the search space, for example by breaking out early when bit strings throw a compiler exception, or using SAT solvers or caching to weed out nonviable bit strings.
We could build a superoptimizer with GAs, then transpile between MOS 6502 assembly and Lisp (or even run the MOS 6502 assembly directly in a sandbox) and not have to know anything about how the processor works. To me, this is the real beauty of GAs, because they allow us to solve problems without training, at the cost of efficiency.
I don't think that LLMs transpile to Lisp when they're designing algorithms. So it's interesting that they can achieve high complexity and high efficiency via training, without even having verification built-in. Although LLMs trained on trillions of parameters running on teraflops GPUs with GBs of memory may or may not be viewed as "efficient".
I suspect that someday GAs may be incorporated into backpropagation to drastically reduce learning time by finding close approximations to the matrix math of gradient descent. GAs were just starting to be used to pseudorandomly produce the initial weights of neural nets around 2000 when I first learned about them.
Also quantum computing (QC) could perform certain matrix math in a fraction of the time, or even preemptively filter out bit strings which aren't runnable. I suspect that AI will get an efficiency boost around 2030 when QC goes mainstream. Which will probably lead us to a final candidate learning algorithm that explains how quantum uncertainty and emergent behavior allow a physical mind to tune into consciousness and feel self-aware, but I digress.
Because modern compilers don't do any of this, and we aren't accustomed to multicore computing, then from a sheer number of transistors perspective, we're only getting a tiny fraction of the computing power that we might otherwise have if we designed chips from scratch using modern techniques. This is why I often say that computers today run thousands of times slower than they should for their transistor budgets.
Great