CC-Canary: Detect early signs of regressions in Claude Code

ctoth 2 hours ago

A useful(ish) trick I've found is adding a persona block to my CLAUDE.md. When it stops addressing me as 'meatbag' I know the HK-47 persona instructions are not being followed, which means other instructions are not being followed. Dumb trick? Yup. Does it work? Kinda? Does it make programming a lot more fun and funny? Heck yes.

Don't lecture me on basins of attraction--we all know HK is a great programmer.

jdiff 47 minutes ago

My attitude towards this is growing similar to my attitude towards Windows. If I have to fight against my tools and they are actively working against me, I'd rather save the sanity and time and just find a new tool.

evantahler 7 hours ago

I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.

john_strinlai 7 hours ago

"we investigated ourselves and found nothing wrong"

Retr0id 6 hours ago

What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).

idle_zealot 6 hours ago

I believe it's businessspeak for "change." Gap is suittongue for "difference."
jldugger 5 hours ago

IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
redanddead 4 hours ago

there are many causes, but it’s a drift in performance
you can drift a tool via the harness in many ways
you can modify the system prompt
you can modify the underlying model powering the harness
you can use different “thinking” levels for different processes in the harness
you can change the entire way a system works via the harness, which could be better or worse, depending on many things
you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts
you can modify how your tool sends requests to your server depending on many variables
you can handle requests differently, depending on any variable of your choosing, at the server level
you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed
the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises
the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users
drift is not imagined, sure, but admitting to it could expose you to unneeded liability
- Retr0id 4 hours ago
  
  That's a lot of words without actually defining the term, although idle_zealot's suggestion of "change" seems to make grammatical sense as a replacement here.
  
  redanddead 4 hours ago
  
  yeah, figured i’d put some thought into it, you know?
Retr0id 4 hours ago

This definition satisfied me: https://universalpaperclips.fandom.com/wiki/Value_Drift
- ryoshoe 3 hours ago
  
  Fandom sites continue to have one of the most unpleasant user experiences on the modern web, I just want to read the article without having my to watch 4 different video ads...

aleksiy123 7 hours ago

Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.

Anyone know of any other similar tools that allow you to track across harnesses, while coding?

Running evals as a solo dev is too cost restrictive I think.

FrankRay78 5 hours ago

See the very last section in this doc for how I minimise token usage and track savings, all three plugins co-exist fine: https://github.com/FrankRay78/NetPace/blob/main/docs/agentic...
- aleksiy123 3 hours ago
  
  This is very nice as well as the reference links. I’ve been having trouble with closing the loop.
  Going to feed into my own.
  Out of curiosity how have your agents evolved and metric changed.

redanddead 5 hours ago

the actual canary is the need for the canary itself

lioeters 4 hours ago

like the status page of a service provider that goes down when the service goes down. you had one job

wongarsu 7 hours ago

See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions

This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets

Yemane5 4 hours ago

thanks