The last six months in LLMs in five minutes

149 points by yakkomajuri 4 hours ago

LZ_Khan 26 minutes ago

I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?

conception 13 minutes ago

Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
angled 13 minutes ago

In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Antibabelic 12 minutes ago

My day job is not in the tech industry. I am an editor. Literally nothing has changed for me in the last four years.

tptacek 47 minutes ago

If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.

thierrydamiba 19 minutes ago

Can you be more specific?
- simonw 9 minutes ago
  
  The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing
  I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
  I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
- baq 8 minutes ago
  
  Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings they know what to do with
jxmesth 10 minutes ago

I'm a security person and would love to hear other people's input here as I don't have that much experience with this
nickvec 9 minutes ago

Are you referring to Claude Mythos?

Insanity 1 hour ago

I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.

They definitely get something barebones up and running, but it's far from a fully fledged application.

adgjlsfhk1 1 hour ago

It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
- DeathArrow 58 minutes ago
  
  >1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)
  I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
  I divide the work to fit within that 100k and use subagent for the tasks.
  
  danielbln 13 minutes ago
  
  In my experience it's more like 400-500k tokens.
minimaxir 1 hour ago

Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
- baq 4 minutes ago
  
  5.2 and the first codex model were step function changes in capability
xbmcuser 1 hour ago

It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
- Scoundreller 13 minutes ago
  
  That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
  Once I work out the kinks, I’ll be able to further automate it.
  Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
  But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
  And I know where to make slight changes without burning my allotments.
bluegatty 1 hour ago

Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
halflife 1 hour ago

I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.
kvakkefly 1 hour ago

I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
- viccis 45 minutes ago
  
  How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?
  
  aspenmartin 43 minutes ago
  
  Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
  
  bsder 32 minutes ago
  
  Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.
  It will almost never converge on the general solution that will pass tests you haven't given it yet.
  This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
  Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
  
  FeepingCreature 5 minutes ago
  
  To be fair, take away a human's paren highlighting and see how well they do.
  
  musebox35 31 minutes ago
  
  Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
  
  rafaelmn 28 minutes ago
  
  How do you justify your salary given that you're just using free OSS compiler/editor any of us could use for free in your role ?
  AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solution if you understand the full problem space. Right now AI is just a productivity tool.
  
  yieldcrv 20 minutes ago
  
  no engineers on staff and stakeholders think the company is incompetent
  Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
  
  altmanaltman 5 minutes ago
  
  They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.
  What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
  It is extremely ignorant.
DeathArrow 1 hour ago

Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.

pr337h4m 9 minutes ago

Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.

shepherdjerred 2 hours ago

> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

nickvec 1 hour ago

Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
_puk 1 hour ago

> So maybe the AI labs have been paying attention after all!
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
- kzrdude 46 minutes ago
  
  Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.
simonw 1 hour ago

That bit probably works better in the talk, it was a setup for a joke later on.

zarzavat 2 hours ago

Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.

minimaxir 2 hours ago

Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
- jofzar 1 hour ago
  
  I wouldn't wish creating a svg pelican on a bicycle on my worst enemy
- Mashimo 16 minutes ago
  
  > Every modern image-generation model can generate a pelican on a bicycle trivially.
  Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
energy123 43 minutes ago

The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.

throwaway2027 2 hours ago

December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.

_puk 1 hour ago

I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
dmpk2k 40 minutes ago

I find your emotional language truly quite fascinating. I've heard people talk like that about drugs.

rTX5CMRXIfFG 1 hour ago

Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?

Sparkyte 1 hour ago

No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
minimaxir 1 hour ago

To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.
bluegatty 1 hour ago

You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
- dnnddidiej 25 minutes ago
  
  Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.
nl 1 hour ago

The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
raincole 1 hour ago

By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.

vishal_new 38 minutes ago

what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future

ShinyLeftPad 31 minutes ago

If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
simonw 4 minutes ago

Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.

ex-aws-dude 1 hour ago

Is the RLVR the key breakthrough for the uplift or is there more to it?

Does that suggest the uplift was only for things that are easily verifiable like code?

4b11b4 1 hour ago

RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously
rdedev 42 minutes ago

I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive

grey-area 45 minutes ago

Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).

I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.

Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.

So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.

https://github.com/openclaw/openclaw/pulse?period=daily

279 commits to main from 77 authors in the last 24 hours.

Why is there so much churn and how could you trust it with your data? This is changes in ONE day!

If these are useful changes, surely it’d be superhuman by now given months of this pace.

What are people using this for?

dnnddidiej 28 minutes ago

Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.

bb88 2 hours ago

I met Simon for the first time this year at pycon. Wow, what a great guy.

DeathArrow 52 minutes ago

I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.

bluegatty 1 hour ago

'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.

It is getting very good at producing code that compiles - at the algorithmic level.

This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.

But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:

-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-

It just knows how to 'incant' the duck.

This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.

This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.

We already kind of knew that - but we have not yet built an intuition for that until now.

Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise

This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.

In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.

LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.

It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.

We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.

I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.

But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.

This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.

Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.

The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.

nl 1 hour ago

> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
- bluegatty 1 hour ago
  
  "That's a higher level of abstraction"
  No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
  If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
  Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
  Precisely because it does not understand those things.
  FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
  We're a long way away - but in the meantime, there's lots to unpack.
  
  IanCal 18 minutes ago
  
  Are we a long way away?
  https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
  
  nl 13 minutes ago
  
  Link doesn't work - maybe not public?
  
  nl 13 minutes ago
  
  > Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
  Proof by existence?
  https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
  Looks pretty good to me. ChatGPT in "Thinking" model.
  Edit: I've added the Opus version on the same link.

DeathArrow 1 hour ago

Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.

simonw 1 hour ago

100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
- rahimnathwani 26 minutes ago
  
  Looking forward to next time, hoping you mention speculative decoding and MTP :)
  It would support your point about the performance of 20GB local models.

aizk 2 hours ago

I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time. I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.

tayo42 1 hour ago

The claw thing really came and went fast lol

yieldcrv 30 minutes ago

I just started a new job and the person I report to was just excited to tell me about it, here in Mid May
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly

iekekke 2 hours ago

It’s good to see dates being hard coded re. Improvements in the models that should deliver material gains.

As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.