Ask HN: What's your biggest LLM cost multiplier?
"Tokens per request" has been a misleading cost model for us in production. The real drivers seem to be multipliers: retries/429s, tool fanout, P95 context growth, and safety passes.
What’s been the biggest cost multiplier in your prod LLM systems, and what policies worked (caps, degraded mode, fallback, hard fail)?
- Tool calling: This is unavoidable, but I try structure the tools such that the total tool calling for an input is minimised.
- Using UUIDs in the prompt (which can happen if you serialise a data structure that contains UUIDs into a prompt): Just don't use UUIDs, or if you must, then map them onto unique numbers (in memory) before adding them to a prompt
- Putting everything in one LLM chat history: Use sub agents with their own chat history, and discard it after sub agent finishes.
- Structure your system prompt to maximize input cache tokens: You can do this by putting all the variable parts of the system prompt towards the end if it, if possible.
In my experience the biggest multiplier isn't any single variable it's the interaction between them. Fanout × retries × context growth compounds in ways that linear cost models completely miss.
The fix that worked for us: treat budget as a hard constraint, not a target. When you're approaching limit, degrade gracefully (shorter context, fewer tool calls, fallback to smaller model) rather than letting costs explode and cleaning up later.
Also worth tracking: the 90th percentile request often costs 10x the median. A handful of pathological queries can dominate your bill. Capping max tokens per request is crude but effective.
+1 on interaction terms + tails : fanout × retries × context growth is where linear token math dies.
One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.
Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.
Docs: https://github.com/teilomillet/enzu/blob/main/docs/PROD_MULT... and https://github.com/teilomillet/enzu/blob/main/docs/BUDGET_CO...
If you’re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there: - tokens/run (in+out) - tool calls/run (and fanout) - retry rate (timeouts/429s) - context length over turns (P50/P95)
Fanout × retries is the classic “bill exploder”, and P95 context growth is the stealth one. The point of “budget as contract” is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.
Background note I wrote (framing + “budget as contract”): https://github.com/teilomillet/enzu/blob/main/docs/BUDGETS_A...