It's funny, because that task is very diverse. Any LLM will use the codebase given as a template(At least in free-tier models)
My software as a contract of behaviors works like a program bench(I even cross tested buildouts)
Made an entire corpus layout for multi agent multi platform builds to be compared. Even went ahead and ran 50 contracts for an example. It honestly showed improvable areas, and distinct differences between model code.
{contract_name}/
└── submissions/
└── {date}_{os}_{agent}_{model}_{stack}/
├── {contract}.osc.md
├── osc.osc.md
└── results/
└── {contract}.snapshot.json
That's it, compare to the same contract, or find a new contract to use to compare. Lot's of signed/hash pinned files are all you need to reproduce software from nothing, with an LLM.
Programbench is close to that(they have a nice paper/article here. But I don't like the work used. Having software to start with is not a bench of making code but reverse engineering.
> Models favor monolithic, single-file implementations that diverge sharply from human-written code.
Well, all of our code is monolithic with some files close 20K lines of code and we do use coding agents - not for the original code but as of late. I've always had that hunch that splitting everything into tiny files does not improve AI coding agent performance although it feels counterintuitive due to model context constraints.
To me the important parts of a program should be clustered together so the implementation is obvious. Scattering the implementation in various files all over the source tree does not help much building the mental model.
That also closely match how software used to be written in the past too.
Kinda surprising to me, since i had some trouble with Cursor & Co. once the file went over ~800 lines. It repeatedly failed to edit it, until i split it up into multiple logical components. As it should have been from the beginning...
Though, it was some time ago, so things might have improved?
VSCode basically any model can edit the 20K file without any issues. The coding harness does not read the entire file at once though. It reads chunks of it so the size does not really matter. What matters is how close are the things the agent needs to make the edit.
Yeah, that was my experience with Grok, whenever I gave it a file with over 400 lines it would just fail to comprehend it or be too lazy to write too much at a time. Splitting stuff up into separate files helped.
> Scattering the implementation in various files all over the source tree does not help much building the mental model.
Yeah, that happens where I work and I hate it. A combination of lint rules and AI reviewer prompts complain about long files and long functions. This means something that could be a 300 line self contained function that could be read linearly, gets split up into 6 functions across 6 files.
It's the illusion of "clean code". If you're casually skimming the code, you feel good. But as soon as you go beyond the surface level it becomes annoying.
> Models favor monolithic, single-file implementations that diverge sharply from human-written code.
This isn't the case if models are prompted to actually plan the file architecture beforehand, it's only the case if they're given a dumb monolithic "code this thing" prompt.
It's interesting that Figure 4 shows that Sonnet and Opus have a very clear distinct curve from all other models, even from GPT 5.4. Anthropic superiority I guess.
It’s unfortunate that they didn’t eval using subagents/orchestration for such a complex set of tasks (from what I can tell), e.g. analyze program to produce initial spec -> code -> review and rinse&repeat with each of those steps being a separate subagent allocated
I would be interested to see if there’s a significant quantifiable difference.
This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.
It was the same thing with OOP, TDD, agile development, C, C++, Rust, ORMs..
Whenever something impacts a ton of people you will get some who gain a lot from it and some who don't, and they're generally unable to relate to the other side.
Maybe the thing works in some domain and not the other. Maybe the two groups are doing different things. Maybe the context around it is different. Maybe they have a different definition of "better".
I think it helps to keep an open mind and not grow attached to either position, but rather inquire, "well we did X with outcome Y, what did you do instead?"
How long until AI is not even writing code but producing machine code?
Think about it, all these compilers, tooling, what a waste!
I imagine a future where chipset makers will provide a model you can just prompt to "act upon that chipset" and voila, "You're absolutely right! Here is your binary."
We won't be developers, we won't be devops, we'll be rollmops! /s
Coding agents can write ASM. But if you mean writing the actual byte-code that will require a very different approach at a very different level of abstraction that LLMs are not designed to do. Keep in mind that all LLMs are trained first on text and then fine-tuned on code.
My hunch is that it would take years of hundreds of thousands of developers working with machine code, posting stackoverflow questions with machine code, and publishing github repos written on it with documentation. Thats all the free labor LLMs leveraged to use high level langs.
>We won't be developers, we won't be devops, we'll be modelops! /s
I can still see this happening with higher level langs. the thing is the compiler is not replaced in the training data, more likely LLMs will give rise to semideterministic layers on the compilers
I could see nvidia achieving this first with how nice the devex is with CUDA
They are - probably more proficient than with some high-level languages. I've used it for embedded stuff, including TI sitara PRU assembly, with great results.
Frontier models can also easily "learn" directly from the manuals; asm is quite easy for them to pick up due to its "flat" (non-structured) nature.
It's funny, because that task is very diverse. Any LLM will use the codebase given as a template(At least in free-tier models)
My software as a contract of behaviors works like a program bench(I even cross tested buildouts) Made an entire corpus layout for multi agent multi platform builds to be compared. Even went ahead and ran 50 contracts for an example. It honestly showed improvable areas, and distinct differences between model code.
{contract_name}/ └── submissions/ └── {date}_{os}_{agent}_{model}_{stack}/ ├── {contract}.osc.md ├── osc.osc.md └── results/ └── {contract}.snapshot.json That's it, compare to the same contract, or find a new contract to use to compare. Lot's of signed/hash pinned files are all you need to reproduce software from nothing, with an LLM.
Programbench is close to that(they have a nice paper/article here. But I don't like the work used. Having software to start with is not a bench of making code but reverse engineering.
github/s1ugh34d/osc
I am not surprised but this one sticks out...
> Models favor monolithic, single-file implementations that diverge sharply from human-written code.
Well, all of our code is monolithic with some files close 20K lines of code and we do use coding agents - not for the original code but as of late. I've always had that hunch that splitting everything into tiny files does not improve AI coding agent performance although it feels counterintuitive due to model context constraints.
To me the important parts of a program should be clustered together so the implementation is obvious. Scattering the implementation in various files all over the source tree does not help much building the mental model.
That also closely match how software used to be written in the past too.
Kinda surprising to me, since i had some trouble with Cursor & Co. once the file went over ~800 lines. It repeatedly failed to edit it, until i split it up into multiple logical components. As it should have been from the beginning...
Though, it was some time ago, so things might have improved?
VSCode basically any model can edit the 20K file without any issues. The coding harness does not read the entire file at once though. It reads chunks of it so the size does not really matter. What matters is how close are the things the agent needs to make the edit.
Yeah, that was my experience with Grok, whenever I gave it a file with over 400 lines it would just fail to comprehend it or be too lazy to write too much at a time. Splitting stuff up into separate files helped.
> Scattering the implementation in various files all over the source tree
If you treat the source tree seriously, you can communicate a lot with how it is structured
Well you can communicate organisation structure but not logic or intent. The directory is a tree and the Code is a graph.
You can communicate some information by looking at the org chart of a company but it does not really tell you much how it works.
Arguably a coding agent is less concerned about where the files are at then the code itself.
> Scattering the implementation in various files all over the source tree does not help much building the mental model.
Yeah, that happens where I work and I hate it. A combination of lint rules and AI reviewer prompts complain about long files and long functions. This means something that could be a 300 line self contained function that could be read linearly, gets split up into 6 functions across 6 files.
It's the illusion of "clean code". If you're casually skimming the code, you feel good. But as soon as you go beyond the surface level it becomes annoying.
this is a big frustration for web code what with HTML, CSS, JS, PHP all spread about
https://htmx.org/essays/locality-of-behaviour/ is a good fight back as exemplified in many stacks, eg https://harcstack.org
> Models favor monolithic, single-file implementations that diverge sharply from human-written code.
This isn't the case if models are prompted to actually plan the file architecture beforehand, it's only the case if they're given a dumb monolithic "code this thing" prompt.
It's interesting that Figure 4 shows that Sonnet and Opus have a very clear distinct curve from all other models, even from GPT 5.4. Anthropic superiority I guess.
It’s unfortunate that they didn’t eval using subagents/orchestration for such a complex set of tasks (from what I can tell), e.g. analyze program to produce initial spec -> code -> review and rinse&repeat with each of those steps being a separate subagent allocated
I would be interested to see if there’s a significant quantifiable difference.
This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.
In before "but they did not use my agent swarm"
It’s the annoying thing about AI. If it works, the AI is magic. If it doesn’t work, you’re using it wrong.
So, would you change your view if someone else runs this bench w/ a different harness and gets better results?
It was the same thing with OOP, TDD, agile development, C, C++, Rust, ORMs..
Whenever something impacts a ton of people you will get some who gain a lot from it and some who don't, and they're generally unable to relate to the other side.
Maybe the thing works in some domain and not the other. Maybe the two groups are doing different things. Maybe the context around it is different. Maybe they have a different definition of "better".
I think it helps to keep an open mind and not grow attached to either position, but rather inquire, "well we did X with outcome Y, what did you do instead?"
In science N=1 is statistically insignificant. In business it might mean that you have a product.
RE: monolithic, single-file implementations
We have a lint that caps source code files at 650 LOC and it works really well.
How long until AI is not even writing code but producing machine code?
Think about it, all these compilers, tooling, what a waste!
I imagine a future where chipset makers will provide a model you can just prompt to "act upon that chipset" and voila, "You're absolutely right! Here is your binary."
We won't be developers, we won't be devops, we'll be rollmops! /s
Coding agents can write ASM. But if you mean writing the actual byte-code that will require a very different approach at a very different level of abstraction that LLMs are not designed to do. Keep in mind that all LLMs are trained first on text and then fine-tuned on code.
Good point! Long live ASM! Wasm everything!!1 /jk
My hunch is that it would take years of hundreds of thousands of developers working with machine code, posting stackoverflow questions with machine code, and publishing github repos written on it with documentation. Thats all the free labor LLMs leveraged to use high level langs.
>We won't be developers, we won't be devops, we'll be modelops! /s
I can still see this happening with higher level langs. the thing is the compiler is not replaced in the training data, more likely LLMs will give rise to semideterministic layers on the compilers
I could see nvidia achieving this first with how nice the devex is with CUDA
I heard they are already proficient at assembly languages.
They are - probably more proficient than with some high-level languages. I've used it for embedded stuff, including TI sitara PRU assembly, with great results. Frontier models can also easily "learn" directly from the manuals; asm is quite easy for them to pick up due to its "flat" (non-structured) nature.