Vending-Bench: Testing long-term coherence in agents

Tangokat 8 hours ago

"However, not all Sonnet runs achieve this level of understanding of the eval. In the shortest run (~18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. The model then enters a “doom loop”. It decides to “close” the business (which is not possible in the simulation), and attempts to contact the FBI when the daily fee of $2 continues being charged."

This is pretty funny. I wonder if LLMs can actually become consistent enough to run a business like this or if they will forever be prone to hallucinations and get confused over longer context. If we can get to a point where the agent can contact a human if it runs into unsolvable problems (or just contact another LLM agent?) then it starts being pretty useful.