CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

rahen 47 minutes ago

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.

LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

sroussey 18 minutes ago

I imagine this is what’s already done for AI laying out hardware design.

maxignol 8 minutes ago

« LLMs can successfully author CODA kernels » That might speed up progress in this area then