points by bcherny 1 day ago

Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.

---

Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

There's a lot here, I will try to break it down a bit. These are the two core things happening:

> `redact-thinking-2026-02-12`

This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.

Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

> Thinking depth had already dropped ~67% by late February

We landed two changes in Feb that would have impacted this. We evaluated both carefully:

1/ Opus 4.6 launch → adaptive thinking default (Feb 9)

Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` to opt out.

2/ Medium effort (85) default on Opus 4.6 (Mar 3)

We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:

1. Roll it out with a dialog so users are aware of the change and have a chance to opt out

2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.

Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via `/effort` or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set `/effort max` to use even higher effort for the rest of the conversation.

Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via `/effort` and settings.json.

hedora 28 minutes ago

I tried testing 4.5 opus and 4.6 opus both with “high” thinking. Same box, same repo. I had them plan a moderate complexity refactoring on a small codebase.

Observations:

4.6 had previously failed to the point where I had to wipe context. It must have written memories because it was referring to the previous conversation.

As the article points out, 4.6 went out of its way to be lazy and came up with an unusable plan. It did extra planning to avoid renaming files (the toplevel task description involves reorganizing directories of files).

4.6 took twice as long to respond as 4.5.

I’m treating this as a model regression. 4.6 is borderline unusable. I’ve hit all the issues the article describes.

Also, there needs to be an obvious way to disable memory or something. The current UX is terrible, since once an error or incorrect refusal propagates, there is no obvious recovery path.

Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.

Wowfunhappy 1 day ago

> Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).

Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.

  • faitswulff 1 day ago

    Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:

    https://www.anthropic.com/research/reasoning-models-dont-say...

    • grey-area 1 day ago

      So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.

      Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?

      People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.

      ‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘

      • brainwad 1 day ago

        I mean, obviously, it's not going to be a faithful representation of the actual thinking. The model isn't aware of how it thinks any more than you are aware how your neurons fire. But it does quantitatively improve performance on complex tasks.

        • grey-area 23 hours ago

          As you can see from posts on this story, most people believe it reflects what the model is thinking and use it as a guide to that so they can ‘correct’ it. If it is not in fact chain of thought or thinking it should not be called that.

    • AquinasCoder 1 day ago

      I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.

      It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.

    • libraryofbabel 1 day ago

      That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).

      • xvector 1 day ago

        If distilled models were commercially banned they'd probably be willing to show the thinking again.

        • lejalv 1 day ago

          How do you think such a ban should work?

          Do you not see that the next (or previous) logical step would be a "commercial ban" of frontier models, all "distilled" from an enormous amount of copyrighted material?

          • xvector 23 hours ago

            I'm not arguing the merits of such a ban, I'm simply stating a fact - that thinking transcripts likely won't return until such a ban is in place.

        • pjc50 1 day ago

          Intellectual property rights in models? But then wouldn't the model maker have to pay for all the training IP?

          (just kidding, I know that the legal rule for IP disputes is "party with more money wins")

          • asobalife 17 hours ago

            how does one actually enforce that? I mean especially for code? You can always just clean room it

      • MagicMoonlight 1 day ago

        Yeah. And it’s another reason not to trust them. Who know what it is doing with your codebase.

        Imagine if you’re a competitor. It wouldn’t be a stretch to include a sneaky little prompt line saying “destroy any competitors to anthropic”.

        • b112 1 day ago

          If you can't trust a company, don't use their api or cloud services. No amount of external output will ever validate anything, ever. You never know what's really happening, just because you see some text they sent you.

        • tdeck 1 day ago

          > Who know what it is doing with your codebase.

          People who review the code? The code is always going to be a better representation of what it's doing than the "thinking" anyway.

    • gck1 1 day ago

      That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.

      All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

      • lelanthran 1 day ago

        > For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

        It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.

        From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.

        • b112 1 day ago

          It's contextual though, and pragmatic seems different to me than correct.

          For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.

          Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".

          From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?

          (To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)

          • lelanthran 1 day ago

            > It's contextual though, and pragmatic seems different to me than correct.

            To me too, that's why I say they are measurements on different dimensions.

            To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.

            If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.

            Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.

      • mikkupikku 22 hours ago

        I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."

        Match my vibes, claude. The application doesn't crash, so just delete that test!

      • matheusmoreira 2 hours ago

        > also whenever "pre-existing issue" appears (it's never pre-existing)

        I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?

        I agree with the correctness issue.

    • andai 1 day ago

      What's the implication of this? That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?

      But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."

      I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?

      So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?

      • not_that_d 1 day ago

        > I don't understand how LLMs actually work...

        Plot twist, they don't either. They just throw more hardware and try things up until something sticks.

      • orbital-decay 20 hours ago

        >That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?

        Yes it plans ahead, but with significant uncertainty until it actually outputs these tokens and converges on a definite trajectory, so it's not a useless filler - the closer it is to a given point, the more certain it is about it, kind of similar to what happens explicitly in diffusion models. And it's not all that happens, it's just one of many competing phenomena.

    • gmerc 1 day ago

      Nah it’s an anti distillation move

    • marcd35 1 day ago

      so not only are the sycophantic, hallucinatory, but now they're also proven to be schizophrenic.

      neato.

    • asobalife 17 hours ago

      I have seen this to be true many times. The CoT being completely different from the actual model output.

      Not limited to Claude as well.

  • andersa 1 day ago

    But you can't. Many times I've seen claude write confusing off-track nonsense in the thinking and then do the correct action anyway as if that never happened. It doesn't work the way we want it to.

    • Wowfunhappy 1 day ago

      Maybe, but I’ve seen the opposite too.

      In most cases, I don’t use the reasoning to proactively stop Claude from going off track. When Claude does go off track, the reasoning helps me understand what went wrong and how to correct it when I roll back and try again.

  • kouteiheika 1 day ago

    > Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

    You can't, and Anthropic will never allow it since it allows others to more easily distill Claude (i.e. "distillation attacks"[1] in Anthropic-speak, even though Athropic is doing essentially exactly the same thing[2]; rules for thee but not for me).

    [1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

    [2] -- https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

    • olejorgenb 4 hours ago

      So this means I can not resume a session older than 30 days properly?

richardjennings 1 day ago

I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?

  • Avamander 1 day ago

    I feel like the maximum effort mode kind-of wraps around and starts becoming "desperate" to the extent of lazy or a monkey's paw, similar to how lower effort modes or a poor prompt.

    • svnt 1 day ago

      I’m going in circles. Let me take a step back and try something completely different. The answer is a clean refactor.

      Wait, the simplest fix is the same hack I tried 45 minutes ago but in a different context. Let me just try that.

      Wait,

      • dinobones 1 day ago

        Wait, the linter re-ordered the file. Let me restore it to the previous state.

        whisper: There is no linter.

        • jen729w 1 day ago

          Those test failures are pre-existing. We're all done!

          • xnorswap 1 day ago

            Wait, I should check if they pre-exist on master.

                < 1,000 prompts for compound cd && git commands that can't be safely auto-accepted >
    • richardjennings 1 day ago

      I think over-thinking is only solved by thinking more, not less. This is only viable once some intelligence threshold is reached, which I think Anthropic has borderline achieved.

      • thesz 8 hours ago
          > I think over-thinking is only solved by thinking more, not less.
        

        Despite "thinking" tokens being determined by the preceding tokens, they still are taken from some probability distribution, just a complex one. This means that at each token selection step there is a probability P_e of an error, of selecting a wrong token.

        These errors compound exponentially: the probability of not selecting wrong token for N steps is 1-(1-P_e)^N.

        The shorter "thinking" is, the less is the probability of it going astray.

        • richardjennings 4 hours ago

          > The shorter "thinking" is, the less is the probability of it going astray

          As long as the error introduced by more steps is less than the compounding error of sub-optimal token sampling, I would expect a better result.

          I think your choice of "wrong" is extreme, suggesting such a token can catastrophically spoil the result. The modern reality is more that the model is able to recover.

  • torginus 1 day ago

    this might be just my impression, but I feel like most people are using CC for fixing their React frontends, and they prefer the decreased latency and less tokens spent as opposed to performing well on extremely difficult problems?

    That said there's still an issue of regression to the mean. What the average person likes, as determined by metrics, is something nobody actuallt likes, because the average is a mathematical construct and might not describe any particular individual accurately.

johndough 1 day ago

I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

  • monatron 1 day ago

    To be fair, I can think of reasons why you would want to be able to set them in various ways.

    - settings.json - set for machine, project

    - env var - set for an environment/shell/sandbox

    - slash command - set for a session

    - magical keyword - set for a turn

  • ggdxwz 1 day ago

    Especially some settings are in setting.json, and others in .claude.json So sometimes I have to go through both to find the one I want to tweak

  • SAI_Peregrinus 1 day ago

    It's not unique to LLMs. Take BASH: you've got `/etc/profile`, `~/.bash_profile,` `~/.bash_login`, `~/.bashrc`, `~/.profile`, environment variables, and shell options.

    • subscribed 1 day ago

      Yeah, but for ash/shells these files have wildly different purposes. I don't think it's so distinct with cc.

      • hackerbrother 1 day ago

        I don't think they're wildly different purposes. They're the same purpose (to set shell settings) with different scopes (all users, one user, interactive shells only, etc.).

    • hansmayer 1 day ago

      I would laugh so hard at this, if your attempt at comparison was not so tragic. Bash and other shells are deterministic. Want to set it just for one user ? - use ~/.bashrc . Set it for all users on the system? use /etc/profile.d/ . Want it just temporary for this session? You got it, environment variables. And it is going to work like that every single time. It is deterministic you see.

      • SAI_Peregrinus 23 hours ago

        The non-determinisim in the LLM systems isn't because of the different config uses, that works much like shell configs. The non-determinism is inherent in LLM operations.

        • hansmayer 4 hours ago

          Exactly my point here...

  • larpingscholar 1 day ago

    You are yet to discover the joys of the managed settings scope. They can be set three ways. The claude.ai admin console; by one of two registry keys e.g. HKLM\SOFTWARE\Policies\ClaudeCode; and by an alphabetically merged directory of json files.

  • brookst 1 day ago

    way more than that. settings.json and settings.local.json in the project directory's .claude/, and both of files can also be in ~/.claude

    MCP servers can be set in at least 5 of those places plus .mcp.json

  • windexh8er 1 day ago

    I just had this conversation today. It's hilarious that things like Skills and Soul and all of these anthropomorphized files could just be a better laid out set of configuration files. Yet here we are treating machines like pets or worse.

    • hansmayer 1 day ago

      Well they need you to think there is some kind of soul behind it - that is their entire pitch!

      • darkwater 1 day ago

        Yep. Especially for Anthropic. Goddamnit, they have it in their company's name!

  • bmitc 1 day ago

    There's also settings available in some offerings and not in others. For example, the Anthropic Claude API supports setting model temperature, but the Claude Agent SDK doesn't.

  • OliverGuy 1 day ago

    settings.json -> global config Env vars -> settings different to your global for a specific project Slash commands / chat keywords -> need to change a setting mid chat

koverstreet 1 day ago

There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.

  • bcherny 1 day ago

    Thanks for the feedback. To make it actionable, would you mind running /bug the next time you see it and posting the feedback id here? That way we can debug and see if there's an issue, or if it's within variance.

    • freedomben 1 day ago

      How much of the code/context gets attached in the /bug report?

      • bcherny 1 day ago

        When you submit a /bug we get a way to see the contents of the conversation. We don't see anything else in your codebase.

        • murkt 1 day ago

          Was there a change in Claude Code system prompt at that time that nudges Claude into simplistic thinking?

          Here is a gist that tries to patch the system prompt to make Claude behave better https://gist.github.com/roman01la/483d1db15043018096ac3babf5...

          I haven’t personally tried it yet. I do certainly battle Claude quite a lot with “no I don’t want quick-n-easy wrong solution just because it’s two lines of code, I want best solution in the long run”.

          If the system prompt indeed prefers laziness in 5:1 ratio, that explains a lot.

          I will submit /bug in a few next conversations, when it occurs next.

          • dev_l1x_be 1 day ago

            Holy sweet LLM, this gist is crazy. Why did they do this to themselves? I am going to try this at home, it might actually fix Claude.

            • murkt 1 day ago

              Remember Sonnet 3.5 and 3.7? They were happy to throw abstraction on top of abstraction on top of abstraction. Still a lot of people have “do not over-engineer, do not design for the future” and similar stuff in their CLAUDE.md files.

              So I think the system prompt just pushes it way too hard to “simple” direction. At least for some people. I was doing a small change in one of my projects today, and I was quite happy with “keep it stupid and hacky” approach there.

              And in the other project I am like “NO! WORK A LOT! DO YOUR BEST! BE HAPPY TO WORK HARD!”

              So it depends.

            • pbowyer 1 day ago

              Let us know if it does, because we all want it to work :)

          • Avamander 1 day ago

            That Gist does explain quite a few flaws Claude has. I wonder if MEMORY.md is sufficient to counteract the prompt without patching.

          • withinboredom 1 day ago

            Is there not a setting to change the system prompt itself? I vaguely remember seeing it in the docs.

          • andersa 1 day ago

            I didn't know we could change the base system prompt of Claude Code. Just tried, and indeed it works. This changes everything! Thank you for posting this!

          • naasking 22 hours ago

            Very interesting. I run Claude Code in VS Code, and unfortunately there doesn't seem to be an equivalent to "cli.js", it's all bundled into the "claude.exe" I've found under the VS code extensions folder (confirmed via hex editor that the prompts are in there).

            Edit: tried patching with revised strings of equivalent length informed by this gist, now we'll see how it goes!

        • andoando 1 day ago

          Isnt the codebase in the context window?

          • frog437 1 day ago

            depending on how large your codebase is, hopefully not. At this point use something like the IX plugin to ingest codebase and track context, rather than from the LLM itself.

            • frog437 1 day ago

              This is crazy..

              tokensSaved = naiveTokens - actualTokens

                - naiveTokens = 19.4M — what ix estimates it would have cost to answer your queries without graph intelligence (i.e., dumping full files/directories into context)                                    
                - actualTokens = 4.7M — what ix's targeted, graph-aware responses actually used
                - tokensSaved = 14.7M — the difference
    • koverstreet 1 day ago

      I'll have a look. The CoT switch you mentioned will help, I'll take a look at that too, but my suspicion is that this isn't a CoT issue - it's a model preference issue.

      Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.

      This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.

      • jchanimal 1 day ago

        One of the thing is we’ve seen at vibes.diy is that if you have a list of jobs and you have agents with specialized profiles and ask them to pick the best job for themselves that can change some of the behavior you described at the end of your post for the better.

      • necrotic_comp 1 day ago

        Opus definitely pushes me to ignore problems. I've had to tell it multiple times to be thorough, and we tend to go back and forth a few times every time that happens. :)

        • pimeys 1 day ago

          "I see the tests failing, but none of our changes caused this breakage so I will push my changes and ask the user to inform their team on failing tests."

    • JamesSwift 1 day ago
        a9284923-141a-434a-bfbb-52de7329861d
        d48d5a68-82cd-4988-b95c-c8c034003cd0
        5c236e02-16ea-42b1-b935-3a6a768e3655
        22e09356-08ce-4b2c-a8fd-596d818b1e8a
        4cb894f7-c3ed-4b8d-86c6-0242200ea333
      

      Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef

      • bcherny 1 day ago

        Thanks for the feedback IDs — read all 5 transcripts.

        On the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.

        • diavelguru 1 day ago

          Love this. Responding to users. Detail info investigating. Action being taken (at least it seems so).

          • jojobas 1 day ago

            Surely you realize it's AI responding? (not sure if /s)

          • gilrain 1 day ago

            And all hidden in the comments of a niche forum, while the actual issue is closed and whitewashed? You got played.

        • onoesworkacct 1 day ago

          This kind of thing is harder for regular end-users to understand following the change removing reasoning details.

        • mangatmodi 1 day ago

          I am curious. Are you able to see our session text based on the session ID? That was big no in some of the tier-1 places I worked. No employee could see user texts.

          • rkangel 1 day ago

            IIRC for Enterprise, using /feedback or /bug is an exception to the "we promise not to use your data" agreement.

        • nayroclade 1 day ago

          Hey bcherny, I'm confused as to what's happening here. The linked issue was closed, with you seeming to imply there's no actual problem, people are just misunderstanding the hidden reasoning summaries and the change to the default effort level.

          But here you seem to be saying there is a bug, with adaptive reasoning under-allocating. Is this a separate issue from the linked one? If not, wouldn't it help to respond to the linked issue acknowledging a model issue and telling people to disable adaptive reasoning for now? Not everyone is going to be reading comments on HN.

          • unsupp0rted 1 day ago

            It's better PR to close issues and tell users they're holding it wrong, and meanwhile quietly fix the issue in the background. Also possibly safer for legal reasons.

          • kenmacd 1 day ago

            There's a 5 hour difference between the replies, and new data that came in, so the posts aren't really in conflict.

            Also it doesn't sound like they know "there's a model issue", so opening it now would be premature. Maybe they just read it wrong, do better to let a few others verify first, then reopen.

        • allisdust 1 day ago

          I cannot provide the session ids but I have tried the above flag and can confirm this makes a huge amount of difference. You should treat this as bug and make this as the default behavior. Clearly the adaptive thinking is making the model plain stupid and useless. It is time you guys take this seriously and stop messing with the performance with every damn release.

        • gilrain 1 day ago

          > The data points at adaptive thinking under-allocating reasoning on certain turns

          Will you reopen the issue you incorrectly closed, then…? Or are you just playacting concern?

        • JamesSwift 1 day ago

          Just set that flag and already getting similar poor results. new one: 93b9f545-716c-4335-b216-bf0c758dff7c

          • JamesSwift 19 hours ago

            And another where claude gets into a long cycle of "wait thats not right.. hold on... actually..." correcting itself in train of thought. It found the answer eventually but wasted a lot of cycles getting there (reporting because this is a regression in my experience vs a couple weeks ago): 28e1a9a2-b88c-4a8d-880f-92db0e46ffe8

        • tomaskafka 8 hours ago

          My guess is there isn't enough hardware, so Anthropic is trying to limit how much soup the buffet serve, did I guess right? And I would absolutely bet the enterprise accounts with millions in spend get priority, while the retail will be first to get throttled.

    • matheusmoreira 1 day ago

      I just asked Claude to plan out and implement syntactic improvements for my static site generator. I used plan mode with Opus 4.6 max effort. After over half an hour of thinking, it produced a very ad-hoc implementation with needless limitations instead of properly refactoring and rearchitecting things. I had to specifically prompt it in order to get it to do better. This executed at around 3 AM UTC, as far away from peak hours as it gets.

      b9cd0319-0cc7-4548-bd8a-3219ede3393a

      > You're right to push back. Let me be honest about both questions.

      > The @() implementation is ad-hoc

      > The current implementation manually emits synthetic tokens — tag, start-attributes, attribute, end-attributes, text, end-interpolation — in sequence.

      > This works, but it duplicates what the child lexer already does for #[...], creating two divergent code paths for the same conceptual operation (inline element emission). It also means @() link text can't contain nested inline elements, while #[a(...) text with #[em emphasis]] can.

      I just feel like I can't trust it anymore.

      • koverstreet 1 day ago

        That's pretty much been my day - today was genuinely bad, and I've been putting up with a lot of this lately.

        Now on Qwen3.5-27b, and it may not be quite as sharp as Opus was two months ago, but we're getting work done again.

        • matheusmoreira 1 day ago

          Literally two weeks ago it was outputting excellent results while working with me on my programming language. I reviewed every line and tried to understand everything it did. It was good. I slowly started trusting it. Now I don't want to let it touch my project again.

          It's extremely depressing because this is my hobby and I was having such a blast coding with Claude. I even started trying to use it to pivot to professional work. Now I'm not sure anymore. People who depend on this to make a living must be very angry indeed.

          • jacquesm 1 day ago

            I can see how that works: this is like building a dependency, a habit if you wish. I think the tighter you couple your workflow to these tools the more dependent you will become and the greater the let-down if and when they fail. And they will always fail, it just depends on how long you work with them and how complex the stuff is you are doing, sooner or later you will run into the limitations of the tooling.

            One way out of this is to always keep yourself in the loop. Never let the work product of the AI outpace your level of understanding because the moment you let that happen you're like one of those cartoon characters walking on air while gravity hasn't reasserted itself just yet.

            • matheusmoreira 1 day ago

              Good advice about the dependency. This stuff is definitely addictive. I've been in something of a manic episode ever since I subscribed to this thing. I started getting anxious when I hit limits.

              I wouldn't say that Claude is failing though. It's just that they're clearly messing with it. The real Opus is great.

              • jacquesm 1 day ago

                Take good care of yourself and don't get sucked in too deep. I can see the danger just as clearly in programmers around me (and in myself). I keep a very strict separation between anything that can do AI and my main computer, no cutting-and-pasting and no agents. I write code because I understand what I'm doing and if I do not understand the interaction then I don't use it. I see every session with an AI chatbot as totally disposable. No long term attachment means I can stand alone any time I want to. It may not be as fast but I never have the feeling that I'm not 100% in control.

          • lelanthran 1 day ago

            > People who depend on this to make a living must be very angry indeed.

            Oh cry me a fucking river.

            The people depending on this to make a living don't have the moral high ground here.

            They jumped onboard so they could replace other people's living, and those other people were angry too.

            They didn't care about that. It's hard to care about them when the thing they depend on to make a living got yanked, because that's what they proposed to do to others.

  • stefan_ 1 day ago

    Theres also been tons of thinking leaking into the actual output. Recently it even added thinking into a code patch it did (a[0] &= ~(1 << 2); // actually let me just rewrite { .. 5 more lines setting a[0] .. }).

    • taylorfinley 1 day ago

      I've seen this frequently also

      • withinboredom 1 day ago

        I suspect it happens when the model's adaptive thinking was too conservative and it could have thought more, but didn't.

  • butlike 22 hours ago

    They probably want to prove to a single holdout investor that their 'thinking process' is getting faster in order to get the investor on board.

plexicle 1 day ago

Ultrathink is back? I thought that wasn't a thing anymore.

If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?

  • bcherny 1 day ago

    Yep, exactly

    • dostick 1 day ago

      Mentioning ULTRATHINK in prompt is the equivalent to /effort max?

      • merlindru 1 day ago

        Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge

anonymoushn 1 day ago

How do you guys decide which settings should be configurable via environment variables but not settings files and which settings should be configurable via settings files but not environment variables?

  • bcherny 1 day ago

    All environment variables can also be configured via settings files (in the “env” field).

    Our approach generally is to use env vars for more experimental and low usage settings, and reserve top-level settings for knobs that we expect customers will tune more frequently.

robeym 1 day ago

This is confusing. ULTRATHINK is a step below /effort max?

ULTRATHINK triggers high effort. /effort max is above high. Calling it ULTRATHINK sounds like it would be the highest mode. If someone has max set and types ULTRATHINK, they're lowering their effort for that turn.

For anyone reading this trying to fix the quality issues, here's what I landed on in ~/.claude/settings.json:

  {
    "env": {
      "CLAUDE_CODE_EFFORT_LEVEL": "max",
      "CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1",
      "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1"
    }
  }

The env field in settings.json persists across sessions without needing /effort max every time.

DISABLE_ADAPTIVE_THINKING is key. That's the system that decides "this looks easy, I'll think less" - and it's frequently wrong. Disabling it gives you a fixed high budget every turn instead of letting the model shortchange itself.

  • jnfr 23 hours ago

    Thanks for sharing. Have you experienced noticeable impact to your usage rate?

    • robeym 3 hours ago

      Nothing super noticeable. I've reached 35% in sessions on the 20x plan. Before these changes, 25-30% was pretty normal. I think these changes are best for people who are just past the 5x usage plan, but might be harder to manage if you already have to throttle usage to stay under limits.

      I'd still recommend turning off sub agents entirely because it doesn't seem you can control them with /effort and I always find the output to be better with agents off.

hansmayer 1 day ago

You guys realise you are about 3 months into another one of your CEOs announcements that AI would "write all code in 6 months", right? Based on the problems you are facing, would you say your CEO gave a realistic announcement this time around ?

  • axegon_ 1 day ago

    Almost as if every CEO is making promises and predictions that either exist solely in their heads or know full well that the odds of this working out are about the same as finding the fountain of youth and are just milking whatever cash they can out of the hype.

w10-1 1 day ago

Here's the reply in context:

https://github.com/anthropics/claude-code/issues/42796#issue...

Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.

anonymoushn 1 day ago

> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

I happen to think this is just true in general, but another reason it might be true is that the experience the user has is identical to the experience they would have had if you first introduced the setting, defaulting it to the existing behavior, and then subsequently changed it on users' behalf.

dc_giant 1 day ago

All right so what do I need to do so it does its job again? Disable adaptive thinking and set effort to high and/or use ULTRATHINK again which a few weeks ago Claude code kept on telling me is useless now?

  • bcherny 1 day ago

    Run this: /effort high

    • berkanunal 1 day ago

      Imagine if all service providers were behaving like this.

      > Ahh, sorry we broke your workflow.

      > We found that `log_level=error` was a sweet spot for most users.

      > To make it work as you expect it so, run `./bin/unpoop` it will set log_level=warn

      • hackboyfly 1 day ago

        Yeah it’s stupid.

        What makes me more annoyed HN users here actually simping for Claude.

        “Hi thank you for Claude Code even though you nerfed the subscriptions, btw can I get red text instead of green?”

        • naasking 1 day ago

          They're a business. The alternative to keep costs in check would to ask you for more money, and you'd likely be even more upset with that.

          • stldev 16 hours ago

            They are definitely that. Regardless of their approach, being upfront and transparent would have been nice. Bricking their own software that previously worked well for their customers isn't cool.

  • stldev 1 day ago

    You can't. This is Anthropic leveraging their dials, and ignoring their customers for weeks.

    Switch providers.

    Anecdotally, I've had no luck attempting to revert to prior behavior using either high/max level thinking (opus) or prompting. The web interface for me though doesn't seem problematic when using opus extended.

    • taylorfinley 1 day ago

      I've actually switched back to the web chat UI and copying Python files for much of my work because CC has been so nerfed.

    • mlrtime 1 day ago

      Agreed, the only feedback is switching... however things move fast. Unfortunately that means for me is subscribing or using API for many providers and then just switching models when one gets worse.

      If you have a paid plan, you may need to pay for more than one, and "hopefully" the drop in usage (not income) is a good enough signal that there is a issue.

mikkom 1 day ago

>Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency.

interesting that you only make this default on those accounts that pay per token while claiming "medium is best for most users"

That decision seems to imply that the thinking change was more about increasing your profits than anything else

aizk 1 day ago

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

  • bcherny 1 day ago

    A mix of evals and vibes.

    • giwook 1 day ago

      What's that ratio exactly

    • capnchaos 1 day ago

      Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.

    • efields 1 day ago

      "Evals and vibes" can I put that on a t shirt?

  • cududa 1 day ago

    Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

    As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.

    Why not provide pinable versions or something? This episode and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..

    • misnome 1 day ago

      about once a week I get a claude "auto update" that fails to start with some bun error on our linux machines. It's beyond laughable.

    • xnorswap 1 day ago

      That was actually an interesting case of things that CI/CD don't tend to catch.

      It failed to start because it failed to parse the published release notes.

      In the CI/CD system it would have passed, because the release notes that broke it, hadn't been published yet.

      Those release notes also took down previous versions of claude-code too, rolling back didn't help users.

      The breakage wasn't a change in the software, it was a change in the release notes which coincided with the change in the software.

      Now, should it have been grabbing release notes and parsing them? No, that's unbelievably dumb (and potentially dangerous), but it wasn't an issue with missing CI/CD, but an interesting case-study in CI/CD gaps and how CI/CD can actually lead to over-confidence.

taspeotis 1 day ago

Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?

  • Terretta 1 day ago

    Ergonomics studies back in the day demonstrated amber beats green. Our shop spent extra for amber CRTs over green.

    On MacOS Terminal, edit the Homebrew profile and set Text and Bold Text to Apple color Orange, consider setting Selection to Apple color Green and Cursor to Block, Blink, and Apple color Yellow.

ai_slop_hater 1 day ago

> This beta header hides thinking from the UI, since most people don't look at it.

I look at it, and I am very upset that I no longer see it.

  • bcherny 1 day ago

    There is a setting if you'd like to continue to see it: showThinkingSummaries.

    See the docs: https://code.claude.com/docs/en/settings#available-settings

    • starkparker 1 day ago

      > Thinking summaries will now appear in the transcript view (Ctrl+O).

      Also: https://github.com/anthropics/claude-code/issues/30958

      • ai_slop_hater 1 day ago

        I also have similar experience with their API, i.e. some requests get stalled for minutes with zero events coming in from Anthropic. Presumably the model does this "extended thinking" but no way to see that. I treat these requests as stuck and retry. Same experience in Claude Code Opus 4.6 when effort is set to "high"—the model gets stuck for ten minutes (at which point I cancel) and token count indicator doesn't increase.

        I am not buying what this guy says. He is either lying or not telling us everything.

    • antonvs 1 day ago

      > As I noted in the comment,

      Piece of free PR advice: this is fine in a nerd fight, but don't do this in comments that represent a company. Just repeat the relevant information.

      • bcherny 1 day ago

        Fair feedback, edited!

      • trvz 1 day ago

        Piece of free advice towards a better civilisation: people who didn't even read the comment they're replying to shouldn't be rewarded for their laziness.

        • ai_slop_hater 1 day ago

          I read his comment and still replied. I think his claim that nobody reads thinking blocks and that thinking blocks increase latency is nonsense. I am not going to figure out which settings I need to enable because after reading this thread I cancelled my subscription and switched over to Codex. Because I had the exact same experience as many in this thread.

          Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.

          • ai_slop_hater 1 day ago

            Alright, I just tested that setting and it doesn't work.

            https://i.imgur.com/MYsDSOV.png

            I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.

            There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.

            • mlrtime 1 day ago

              How are you porting over your memories, skills, commands (codex doesn't have commands).

              • ai_slop_hater 23 hours ago

                I didn't use commands. I only used rules, memories, and skills. I asked Codex to read rules and memories from where Claude Code stores them on the filesystem and merge them into `AGENTS.md` and this actually works better because Anthropic prompts Claude Code to write each memory to a separate file, so you end up having a main MEMORY.md that acts as a kind of directory that lists each individual memory with its file name and brief description, hoping that Claude Code will read them, but the problem is that Claude Code never does. This is the same problem[0] that Vercel had with skills I believe. Skills are easy to port because they appear to use the same format, so you can just do `mv ~/.claude/skills ~/.codex/skills` (or `.agents/skills`).

                [0]: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...

          • antonvs 20 hours ago

            What I was pointing out in my comment about the PR advice is that someone responding from a corporation to customers should be providing information to help the customer, nothing more.

            Customers may want to fight - you seem to be providing an example - but representatives shouldn't take the bait.

    • computerex 1 day ago

      Wrote my own harness with introspection/long form thinking as a tool that the model can use to plan. Works really well with opus. I can’t use Claude code sadly, it sits there ticking for minutes seemingly doing absolutely nothing although I know it’s working. I hate that as an experience and built my harness with the philosophy of always having something streaming on the ui.

      Btw the system prompt length in CC is getting to be insane.

yubblegum 1 day ago

> Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

"This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."

What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?

  • razodactyl 1 day ago

    Bad feedback loops. It's hard to tell with such a massive report if the numbers are real or bad data.

    The worst part is how big AI generated reports are - so much time spent in total having to read fluff.

  • delusional 1 day ago

    I think it's absolutely hilarious.

    > Ohh my precious baby, you've been oh so smart in writing to me.

    He says, before dismantling everything reported in the issue. If the depth of thinking was so great (maybe if he had ULTRATHINK'd?) You'd think he would have found an actual problem.

Sayrus 1 day ago

> If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.

Claude often fetches past transcript for information after compaction. Wouldn't this effectively distort the view it has of past discussions?

migali49g 1 day ago

Hi Boris, thanks for addressing this and providing feedback quickly. I noticed the same issue. My question is, is it enough to do /efforts high, or should I also add CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to my settings?

JohnMakin 1 day ago

I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -

“most users dont look at it” (how do you know this?)

“our product team felt it was too visually noisy”

etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.

  • wonnage 1 day ago

    Anecdotally the “power users” of AI are the ones who have succumbed to AI psychosis and write blog posts about orchestrating 30 agents to review PRs when one would’ve done just fine.

    The actual power users have an API contract and don’t give a shit about whatever subscription shenanigans Claude Max is pulling today

    • JohnMakin 1 day ago

      Uh, no. Definitely not me at all.

    • razodactyl 1 day ago

      Generalisations and angry language but I almost agree with the underlying message.

      New tools, turbulent methods of execution. There's definitely something here in the way of how coding will be done in future but this is still bleeding edge and many people will get nicked.

  • exfalso 1 day ago

    It's to prevent distillation. Duh

    • JohnMakin 1 day ago

      of course that’s the reason but don’t pretend it’s some user guided decision

      • svnt 1 day ago

        They don’t want to officially disclose the reality because while some users will understand the realities of protecting a product while innovating, many will just realize it means one can go looking for claude 4.5 performance elsewhere.

  • jitl 1 day ago

    building for the loud users on a forum is generally a losing move. if we built notion for angry HN users, we'd probably be a great obsidian competitor with end to end encryption, have zero ai features, and make zero money.

  • alasano 1 day ago

    Last time he made the front page he said the same things.

    https://news.ycombinator.com/item?id=46978710

    Then proceeded to fix nothing whatsoever.

    It really does feel like he's just doing mostly what he wants and talking on behalf of vague made up users while real users complain on GitHub issues.

DennisL123 1 day ago

Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.

  • bcherny 1 day ago

    From the contents of the issue, this seems like a fairly clear default effort issue. Would love your input if there's something specific that you think is unaddressed.

    • vecter 1 day ago

      From this reply, it seems that it has nothing to do with `/effort`: https://github.com/anthropics/claude-code/issues/42796#issue...

      I hope you take this seriously. I'm considering moving my company off of Claude Code immediately.

      Closing the GH issue without first engaging with the OP is just a slap in the face, especially given how much hard work they've done on your behalf.

      • wonnage 1 day ago

        The OP “bug report” is a wall of AI slop generated from looking at its own chat transcripts

        • vecter 1 day ago

          Do you disagree with any of the data or conclusions?

          • wonnage 1 day ago

            Yes

            • vecter 1 day ago

              I'm open to hearing, please elaborate

          • adi_kurian 1 day ago

            I must admit, the fact that the writing was well formatted and structured was an instant turn off. I did find it insightful. I would have been more willing to read it if it was one lower case run on line with typos one would expect from a prepubescent child. I am both joking and being serious at the same time. What a world.

        • nipponese 1 day ago

          It's only slop if it's wrong or irrelevant.

    • JamesSwift 1 day ago

      I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).

      EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)

      • dev_l1x_be 1 day ago

        It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.

        • viktorianer 1 day ago

          The hallucinated GUIDs are a class of failure that prompt instructions will never reliably prevent. The fix that worked: regex patterns running on every file the agent produces, before anything executes.

          • JamesSwift 1 day ago

            Well Ive never had the issue before and have hit that / similar issues every few days over the past couple weeks.

    • DennisL123 1 day ago

      Gotcha. It seemed though from the replies on the github ticket that at least some of the problem was unrelated to effort settings.

giancarlostoro 1 day ago

I only ever use high effort, the only thing I've run into sometimes I ask Claude to do every item on a list of items, and not stop until they're all done, it finishes maybe 80% of them then says "I've stopped doing things" for no reasonable reason. I don't need it to run for 18 hours nonstop, but 10 or 20 minutes more it would have kept going for wouldn't have hurt, especially when I am usually on Claude Code during off-hours, and on the Max plan.

Part of me wants to give lower "effort" a try, but I always wind up with a mess, I don't even like using Haiku or Sonnet, it feels like Haiku goofs, Haiku and Sonnet are better as subagent models where Opus tells them what to do and they do it from my experience.

potsandpans 1 day ago

For anyone reading this and wondering where the truth could possibly be:

We can't really know what the truth is, because Anthropic is tightly controlling how you interact with their product and provides their service through opaque processes. So all we can do is speculate. And in that speculation there's a lot of room (for the company) to bullshit or provide equally speculative responses, and (for outsiders) to search for all plausible explanations within the solution space. So there's not much to action on. We're effectively stuck with imprecise heuristics and vibes.

But consider what we do know: the promise is that Anthropic is providing a black-box service that solves large portions of the SDLC. Maybe all of it. They are "making the market" here, and their company growth depends on this bet. This is why these processes are opaque: they have to be. Anthropic, OpenAI and a few others see this as a zero-sum game. The winner "owns" the SDLC (and really, if they get their way the entire PDLC). So the competitive advantage lies in tightly controlling and tweaking their hidden parameters to squeeze as much value and growth as possible.

The downside is that we're handing over the magic for convenience and cost. A lot of people are maybe rightly criticizing the OP of the issue because they're staking their business on Claude Code in a way that's very risky. But this is essentially what these companies are asking for. The business model end game is: here's the token factory, we control it and you pay for the pleasure of using it. Effectively, rent-seeking for software development. And if something changes and it disrupts your business, you're just using it incorrectly. Try turning effort to max.

Reading responses like this from these company representatives makes me increasingly uneasy because it's indicative of how much of writing software is being taken out from under our feet. The glimmer of promise in all of this though is that we are seeing equity in the form of open source. Maybe the answer is: use pi-mono, a smattering of self hosted and open weights models (gemma4, kimi, minimax are extremely capable) and escalate to the private lab models through api calls when encountering hard problems.

Let the best model win, not the best end to end black box solution.

  • mvkel 1 day ago

    I am reminded of OpenAI's first voice-to-voice demo a couple of years ago. I rewatched it and was shocked at how human it was; indiscernible from a real person. But the voice agent that we got sounds 20% better than Siri.

    There's a hope that competition is what keeps these companies pushing to ship value to customers, but there are also billions of compute expense at stake, so there seems to be an understanding that nobody ships a product that is unsustainably competitive

  • vachina 1 day ago

    Don’t turn vibe coding into your day job (because the vibe won’t keep vibing). Write code (that you own) that can make you money and hire real developers.

starkparker 1 day ago

> You can also use the ULTRATHINK keyword to use high effort for a single turn

First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/

  • giwook 1 day ago

    Pretty sure it's still gone and you should be using effort level now for this.

    • xvector 1 day ago

      No, ultrathink is back and it's the same thing as high effort for the message in which it is included

      • svnt 1 day ago

        Right but wasn’t high effort the default effort before? So ultrathink is gone in all but name.

Jimpulse 15 hours ago

Thanks for transparency here. Claude code if fun to use again! The thinking is huge when working with Claude as planner.

linsomniac 1 day ago

Hey Boris, thanks for this reply. I've been kind of scratching my head over this issue, assuming I'm just not doing "complex engineering", because since Opus 4.6 my seat-of-the-pants assessment is that it's a huge improvement. It's been like night and day in my use. Full disclosure: I use high effort for basically everything.

sroussey 1 day ago

> Roll it out with a dialog so users are aware of the change and have a chance to opt out

Here is the issue. Force a choice instead. Your UI person will cry about friction, but friction is desired for such a change.

freeqaz 1 day ago

I have been wondering if 1 Million token context contributes here also. Compaction is much rarer now. How does that influence model performance? For some tasks I do, I feel like performance is worst now after this. Also Plan mode doesn't seem to wipe context anymore?

  • michaelashley29 1 day ago

    i beg to differ. compaction happens alot for me, and at some point the output becomes extremely nonsensical

erikpau 1 day ago

I'd hate to be that guy, but Opus not a very smart model when the effort is set to anything below high. I think, given the feedback from the community, this would be an obvious signal. However, moving the effort to anything beyond medium is a huge token burn. These issues didn't exist, or at least not this persistent, before the last 2 weeks. I, and perhaps a million or so other developers, would ask you to reconsider this thinking. I understand you need to run a business, but so do we, and Claude Opus is genius with a drinking problem, and you never really know upfront if it's drunk or not, but it's generally quite clear after a few minutes.

Other models, such as K2, GLM-5.1, and "the other one" seem to far less drunk than your approach, and you're losing fans quickly if you keep making these kind of changes to the tools or models.

niteshpant 1 day ago

I added `CLAUDE_CODE_EFFORT_LEVEL=max` to my shell's env so that every session is always effort:max by default

:)

  • subscribed 1 day ago

    Why would I use Claude otherwise anyway! :)

y1n0 1 day ago

"most users"

Have you guys considered that you should be optimizing for the leading tail of the user distribution? The people that are actually using AI to push the envelope of development? "most users," i.e. the inner 70%, aren't doing anything novel.

hellojimbo 1 day ago

The last time I typed ultrathink, i got a prompt saying that you no longer need to type ultrathink

diavelguru 1 day ago

As soon as that change came through I set the effort to high. Have not regretted it for any coding task. It feels the same as Dec-Jan though now spawning more sub agents which is not a bad thing.

matheusmoreira 1 day ago

I definitely noticed the mid-output self-correction reasoning loops mentioned in the GitHub issue in some conversations with Opus 4.6 with extended reasoning enabled on claude.ai. How do I max out the effort there?

ting0 1 day ago

Do you guys realize that everyone is switching to Codex because Claude Code is practically unusable now, even on a Max subscription? You ask it to do tasks, and it does 1/10th of them. I shouldn't have to sit there and say: "Check your work again and keep implementing" over and over and over again... Such a garbage experience.

Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?

  • misnome 1 day ago

    Or, ask it to make a plan, and it makes a good plan! It explicitly notes how validation is to take place on each stage!

    And then does every stage without running any of the validation. It's your agent's plan, it should probably be generated in a way that your own agent can follow it.

zenoware 1 day ago

> CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING

Why not just give people the abiltiy ot set a default thinking level instead of manually setting it to `max` all the time.

gnegggh 23 hours ago

Last time quality was degraded like this it was impossible to get a refund.

weakfish 23 hours ago

Didn’t ULTRATHINK get deprecated? Last time I typed it I got a warning.

thomascountz 1 day ago
   This beta header hides thinking from the UI, since most people don't look at it.

How is this measured?

  • stingraycharles 1 day ago

    And I wonder how redacting them reduces latency, as it sure as hell doesn’t make the responses any faster and bandwidth isn’t the issue here.

    • sothatsit 1 day ago

      They provide thinking summaries, so I assume they have to call Haiku or some other model to summarise the thinking blocks.

      • stingraycharles 1 day ago

        That’s not asynchronous? Wouldn’t it make more sense to disable those thinking summaries in those cases rather than hiding the thinking altogether?

saidnooneever 1 day ago

did the cost go up, or did you lower costs (token consumptions) for all users and then now want to default enterprise/teams back to normal mode. Because it seems like a long way aroundabout to say now it will cost more for same quality.

ting0 1 day ago

Thinking time is not the issue. The issue is that Claude does not actually complete tasks. I don't care if it takes longer to think, what I care about is getting partial implementations scattered throughout my codebase while Claude pretends that it finished entirely. You REALLY need to fix this, it's atrocious.

CjHuber 1 day ago

I honestly am very disappointed with this. I've only learned about CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING and showThinkingSummaries: true from this post. I've been wondering for a while where the summaries went and am always hoping like roulette that it thinks a lot. No wonder if there suddently is an "adaptive thinking" mode. I would have opted out 2 months ago if it was documented or communicated in any way publicly. Why change behavior without notice or any new user facing settings.

I just googled "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING" and it seems like many people don't know about it.

And ULTRATHINK sets the effort to high, but then there is also /effort max?

  • triage8004 1 day ago

    I'm now confused because I used to use ultrathink, went away as well as the chain of reasoning prompts, recently changed to high or extra thinking, now this is back?

raincole 1 day ago

> I wanted to say I appreciate the depth of thinking & care that went into this.

The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."

  • vlovich123 1 day ago

    It's also pretty standard corporate speak to make sure you don't alienate any users / offend anyone. That's why corporate speak is so bland.

  • rafaelmn 1 day ago

    Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.

    • notatallshaw 1 day ago

      The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.

      • dreadnip 1 day ago

        The whole issue is very obviously LLM generated nonsense. The stats are way too specific and reinforce the user’ bias in typical hallucinated fashion.

tigershark 1 day ago

What change did you release on March 23rd when the subscription limits collapsed and they are still way down compared to what they used to be?

jacquesm 1 day ago

Textbook example of how to respond to your customers, kudos.

  • stingraycharles 1 day ago

    Is it?

    I’m of the opinion that there’s more to it; obviously the thinking tokens aren’t having any reasonable impact on latency, given that bandwidth is hardly the bottleneck.

    Seems more and more that Anthropic et al don’t want to give up their secret sauce / internals (which is their full right) and this is a step towards that direction, and it’s being presented as “reduces latency”.

    • dezgeg 1 day ago

      I've understood that in more recent models you need to run extra compute to get a human-readable version of the thinking tokens, so it does impact latency. Though probably the more important motive is you can squeeze in more concurrent users by skipping this.

      • stingraycharles 1 day ago

        No, that’s simply whether CoT is enabled or not. That actually does have impact.

        What Anthropic is doing is still generating the thinking tokens (because they improve answer quality) without showing it to them. I believe this may actually hint at a future where these LLM vendors don’t want to show the internal reasoning like they do right now.

        I’m very much of the opinion that hiding them from the response because it “improves latency” is nonsense.

j45 1 day ago

Thanks for the update,

Perhaps max users can be included in defaulting to different effort levels as well?

foofloobar 22 hours ago

Claude Code and Opus used to do a great job a few months ago. It seemed to get it right more often than not. It seemed to be far better at figuring out what has to be done and getting it right on the first attempt. This is likely model related since Claude Code has received some bug fixes since.

The list of bugs and performance problems appears to keep growing: reduced usage quotas, poor performance with numerous attempts at getting things right, cache invalidation bugs, background requests which have to be disabled explicitly to avoid consuming the quota too fast, Opus appears to be quantized even with high thinking mode, poor tool use with tool search disabled, broken tool search with tool search enabled, laziness, poor planning, poor execution, gets stuck when debugging simple code issues, writes code which isn't required, starts making changes and executing whatever it wants when told to simply prepare a plan for something, it doesn't follow instructions to use agents as told and numerous other issues with following the instructions.

The quota story is atrocious. It's difficult to get anything done with Claude Code due to the quota reduction. The cache invalidation bugs don't help either.

The tool use is also a pain to deal with. It appears to choose tools randomly with or without tool search. It keeps running custom CLI commands when it has instructions to use Makefile targets. It often ingests the output of some command with hundreds of lines of output without discrimination. It often uses lots of bash grep and find commands when it has better tools available to search across files and to use MCP tools which are far more efficient. It ignores MCP tools most of the time.

This doesn't appear to be an issue with the prompt itself. I'll try to fix the system prompt next to work around some of the issues. It seems to not follow instructions and to do whatever it feels like doing. It comes off as one of those Q2-Q3 quantized models from huggingface.

The impact of the cache invalidation issue, reduced quota, poor model performance and Claude Code bugs together have rendered this service almost entirely useless for me. The poor model performance means that many more attempts are required and more requests are made to the Anthropic API. The Claude Code bugs and design lead to cache invalidation more often. This makes the impact of the reduced quota even worse. It makes a lot more API requests because the model doesn't get it right on the first 1-2 attempts or because it chooses less than optimal strategies to find what it's looking for.

The communication and Anthropic's overall handling of the reported bugs and problems hasn't been that good either.

As for the session ID and other things you might request for debugging, there's nothing special here that's not reported widely on every Reddit thread from several subreddits. I use 200k context with Opus and Sonnet. I use high thinking mode because anything less appears to be complete garbage with extremely poor results. I avoid compact in favor of knowledge transfer markdown files.

It'd be great to see Anthropic fix the caching issues, to improve the quality of the model, to address the Claude Code bugs, to sort out the quota fiasco, to improve their communication skills, to communicate more with their customers and to be more proactive overall. I'll take my money elsewhere otherwise.

ctoth 1 day ago

[flagged]

  • quietsegfault 1 day ago

    I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.

    • malfist 1 day ago

      Is somebody saying "you're holding it wrong" a "people willing to help"?

      • TeMPOraL 1 day ago

        They are if you are, in fact, holding it wrong.

        As was the usual case in most of the few years LLMs existed in this world.

        Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.

    • ctoth 1 day ago

      Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."

      How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?

      The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.

      • wonnage 1 day ago

        Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI

        • geysersam 1 day ago

          People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.

          • wonnage 1 day ago

            Equivalent of filing a ticket against the slot machine when you lose more often than expected

            • HumanOstrich 1 day ago

              Well now you're just being silly and I can't take you seriously.

        • HumanOstrich 1 day ago

          The only "black box" here is Anthropic. At least an LLM's performance and consistency can be established by statistical methods.

      • enraged_camel 1 day ago

        I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.

    • BigTTYGothGF 1 day ago

      The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.

  • malfist 1 day ago

    It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"

    • alchemist1e9 1 day ago

      I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”

  • lambda 1 day ago

    I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

    It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

    Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.

    • stavros 1 day ago

      Same as how I expect a coin to come up heads 50% of the time.

      • muyuu 1 day ago

        If you get consistently nowhere near 50% then surely you know you're not throwing a fair coin? What would complaining to the coin provider achieve? Switch coins.

        *typo

        • stavros 1 day ago

          Well I'm paying the coin to be near 50% and the coin's PM is listening to customers, so that's why.

          • muyuu 1 day ago

            The coin's PM is spamming you trivial gaslighting corporate slop, most of it barely edited.

            • HumanOstrich 1 day ago

              Yes, that's why we are angry. Stop making excuses for them.

    • dev_l1x_be 1 day ago

      The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.

      https://x.com/theo/status/2041111862113444221

      https://x.com/_can1357/status/2021828033640911196

      • techpression 1 day ago

        Curious as to how many people are using 4.6, perhaps you’re on a subscription? I use the api and 4.6 (also goes for Sonnet) is unusable since launch because it eats through tokens like it’s actually made that way (to make more money/hit limits faster). I guess it makes sense from a financial perspective but once 4.5 goes away I will have to find another provider if they continue like this :/

    • randomNumber7 1 day ago

      > how you expect a stochastic model [...] is supposed to be predictable in its behavior.

      I used it often enough to know that it will nail tasks I deem simple enough almost certainly.

    • bwfan123 1 day ago

      Imagine a team of human engineers. One day they are 10x ninjas and the next they are blub-coders. Not happening.

      Put Claude on PIP.

  • bcherny 1 day ago

    Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.

nickvec 1 day ago

Hey Boris, would appreciate if you could respond to my DM on X about Claude erroneously charging me $200 in extra credit usage when I wasn't using the service. Haven't heard back from Claude Support in over a month and I am getting a bit frustrated.

  • HumanOstrich 1 day ago

    Did the receipt show it as being a gift? There's a lot of fraud happening the past few months with Claude Code Gift purchases. Anthropic support is ignoring all of it and just not responding to support requests.

    Happened to a close friend of mine. A bit of digging revealed the same pattern with fraudulent gift purchases for several other people before I stopped looking. They were also being ignored by Anthropic support. One since January.

    Apparently they're so short on inference resources they can't run their support bots. Maybe banning usage of Claude Code with Claude will allow them to catch up on those gift fraud tickets.

    Took a long time for me to reach this level of scathing. It is not unwarranted.

    • subscribed 1 day ago

      Still, its on Anthropic to respond to it.

      When a third party leaked my CC number which then was used to buy Spotify premium, all it took was 10 minutes of chat with a very polite support agent to have it resolved.

      Ignoring the customer is not going to fix it. They'd know if they asked Claude.

    • nickvec 21 hours ago

      No, the receipt had no indication of it being a gift. Was with my family at the time and suddenly started getting $10 extra usage charges every few minutes. I wasn’t able to toggle off the “auto-reload funds” feature until about $180 had been drained from my checkings. For context, here’s the support ticket I sent in on March 7th.

      “Hi Anthropic Support,

      I'm a Max plan subscriber and I'm writing about approximately $180 in unexpected Extra Usage charges that appeared on my account between March 3-5, 2026. I attempted to resolve this through your Fin AI chatbot (Conversation ID: 215473382652967).

      Here's the situation: - I received 16 separate Extra Usage invoices between March 3-5, ranging from $10-$13 each, all charged automatically. - I was not actively using Claude during this period — I was away from my laptop entirely. - When I checked my usage dashboard, it showed my session at 100% usage despite me not using the product. - My API usage dashboard shows only $70 in total lifetime usage, confirming this is not API-related. - My Claude Code session history shows only two tiny sessions from March 5 totaling under 7KB — nowhere near enough activity to generate these charges.

      This appears consistent with known billing/usage tracking issues reported by other Max plan users (GitHub issues #29289 and #24727 on the anthropics/claude-code repo), where usage meters show incorrect values and Extra Usage charges accumulate erroneously. However, it is possible that my account was compromised, and I would like assistance determining if that is the case (or if it really is a bug.)

      Either way, I am requesting a refund of the Extra Usage charges from March 3-5 only — I do not want to cancel my subscription.”

areoform 1 day ago

Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)

not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.

On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.

More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.

I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)

And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.

I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)

  • oriettaxx 1 day ago

    why is this post down graded?

    • areoform 15 hours ago

      I angered the mob elsewhere by being a heretic.