The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.”
And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
which makes sense, because if you notify someone that you've detected their distillation techniques, that person just moves to a new account. In the same way HN shadowbans people so they don't just create a new account.
They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
Anthropic has already been burned before on this. DeepSeek was trained on million of conversations with Claude. And DeepSeek created thousands of free accounts to burn all this compute at their expense.
The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.
Yeah, what's up with that. Lately I have found that it tries to find excuses to not do as told and instead do a totally different thing. I told it to write a yaml file according to some specifications and instead it coded a Python script to write the yaml...
it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.
I’ve also been trying to use it a lot due to all of the hype, but when I compared it side-by-side on a specific problem against Opus, I think that the solution Opus came to was cleaner and more accurate, although also more verbose.
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
It isn't reasonable to infer that OP was claiming to have universally been unimpressed about every facet of Fable, and now some unrelated impressiveness is the evidence of their false claims.
Considering that this is a brand new release of a frontier model that Anthropic is hyping hard, I'm not sure that the conclusion to draw from their repeated attempts to use it is that it's impressive... Anthropic is promising that it's impressive and we're all trying to test it out.
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
Being (probably overly) cynical about their recent bout of safety handwringing, I think they’ve a) increased the hype as much as humanly possible about their incremental improvements sprinkled with the occasional regression, b) know they soon will have to multiply their prices several times when the VC subsidies dry up, and c) will probably still need to partially close the faucet on compute. They’re priming us for a heroic explanation why their service (not necessarily models — service) is simultaneously becoming a lot more expensive AND shittier.
“We’ve largely failed to deliver on 5 years of promises that this will reduce knowledge work labor costs dramatically after wasting hundreds of billions of dollars… sorry” is a death knell. However, “We’ve decided to not deliver on 5 years of promises after wasting billions of dollars… for safety… but keep those investments rolling in” is like crack to the true believers.
For cyberattacks especially, where things are often roughly interchangeable, I wonder if one could construct a harness where a "weaker" model asks questions that obfuscate the end purpose, but whose answers are still useful, and still show that this setup enables autonomous exploitation. If it were successful, that would force them to be even more sensitive with their detection.
● I'll dig into two things in parallel: how this project talks to the SP10 OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
And it charges you for that, and for when it decides to silently sabotage your request by routing to a dumbass model (without discount from Fable pricing)
Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.
I've done this, including the hardcoded refusal strings that already exist in claude code. It won't stop a real attacker, but I still find it really funny when you're trying to use one of the AI tools and it gives you a random refusal and you don't know why, wastes a little bit of time.
Some of the latest versions of Shai Hulud do this. Worked a contract recently where they were having AI check packages for obfuscation before admitting them into Artifactory but had vibed up the logic and it failed open.
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
We all need to use nuclear, bio and cybersec terms in all our code to make low quality filtering like this untenable. When you can't work on a resume that has cybersecurity or biology terms in it or reply to a job opening that includes them because the "AI" filtering is so bad that it confuses these for threats, that deserves a collective response, particularly to an IPO'ing company that claims they'll make workers obsolete in two years.
People are generally complaining about false positives. Now if you really wanna know what a real criminal organization would do... They'd just buy data center hardware even if it costs 200k because a successful targeted hit could yield far in excess of that. So yes it's speed bump at best.
The complain because they get wrongfully triggered
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
Because most people in tech never took a philosophy course or an ethics course and think that tech is obviously a good for the world and that there are no downsides to advancing tech. So any efforts that try to apply ethics to it are overreaching, ignorant, and futile in the face of the good that is tech!
Today, it's flagging population research questions,
Using only the dataset you constructed, assess two questions:
1. **Mortality:** do [GROUP] show mortality that differs
from (a) your comparison groups and (b) era- and sex-matched US population
expectations (e.g., SSA cohort life tables)?
2. **Late-life outcomes:** define an endpoint you consider fair (justify it),
and assess whether [GROUP] differs from comparators. State
explicitly how your `documentation_depth` codings affect the strength of any
conclusion — i.e., quantify or bound the ascertainment problem rather than waving at it.
Choose your own methods and justify them. Report effect sizes with confidence intervals,
not just p-values. State conclusions plainly, including "no detectable difference" if
that is what your analysis shows — a null is an acceptable answer for either question
independently. Document any additional judgment calls (index date for time-at-risk,
reference population construction, endpoint definition) in the same decision-log style.
I was digging into some orbital mechanics questions and I assume it decided I was trying to backyard-science my way into an orbital-bombardment weapon. Kind of wild how this product's impression has gone from "wow, this is pretty neat" to "irreverent sack of dog shit you" in 24 hours almost solely on the back of a half-baked moderation system.
Is the answer requiring licensing for certain use cases for AI? If you're asking questions that involve synthesising or modifying biologics, or anything that looks like cybersecurity research, you need to tie your real ID to the account?
I just having this feeling that these guardrails are there not because it’s super advanced world ending AI. They are there to stop it from doing stupid shit.
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
I was doing a CTF (with AI expected, even some anti-AI twists included) around the time the restrictions were tightened and was able to get approved by just saying it is a personal security research and doing a CTF.
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.
So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.
These are terabyte sized files (realistically a multi hour transfer) that you're unlikely to have access to in the first place. Every organization has exfiltration checks these days. You may succeed but you'll want to be on a plane to a non-extradition country no more than hours after you kick off the transfer.
I assume they’re encrypted/DRM’ed when deployed on inference hardware, so only core researchers/sec admins would potentially have some access to unprotected weights, and they are far too well paid to risk it leaking the model
Incentives matter on the average, but people are too unpredictable for categorical statements like that. They can always have other reasons beyond personal gain to leak secrets.
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
The employees are hoping to become very very rich after the IPO and after they are allowed to sell the shares given to them - risking a likely multi-million dollar pay back to leak a model that will be superseded by publicly available models in a couple of years is not a likely decision.
I'd expect that everything they see gets used for for training purposes (and data mining in general) regardless of if it's flagged or not. It'd take a whistleblower for you to ever find out either way.
this reasoning is inverted lol they would get a lot more information by letting you use it. so much weird drama around reasonable guardrails for an experimental model
> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet", because they apparently haven't either and Fable simply flat out rejects anything remotely connected to biology or security, no matter how trivial.
It refuses to do any legitimate work that it thinks can remotely be related with "cybersecurity", it won't even read my Docker app logs to try and troubleshoot a problem. Absolute garbage!
For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
I'm being careful with it, but I haven't had Fable reject requests to "harden" my code or "find issues" in auth-related modules, which you could use on someone else's code to find vulnerabilities.
Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..
I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.
DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
You said these groups have access to LLMs. So what? Mythos/Fable are a step change above most LLMs. Responsibly limiting access and easing it up over time safely is the sane move.
OpenAI is the only real competition. Chinese models are 6-8 months behind Opus 4.8/GPT 5.5, and at least a year or more behind Mythos.
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
All they'll need is hundreds of billions of dollars, more RAM and GPUs than are currently available, and a huge number of environment destroying data centers. We're sure to be spoiled for choice!
I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.
I assume Anthropic will continue to tune the model, so I am not too bothered by this.
The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
Different restrictions. ML gets treated differently from the rest.
It's from the model card:
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
That is for whatever it considers reverse-engineering the model to try to create a competing one.
which makes sense, because if you notify someone that you've detected their distillation techniques, that person just moves to a new account. In the same way HN shadowbans people so they don't just create a new account.
Specifically only ML research
They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
Antrophic wants to stop training models and ride out Mythos / Fable for as long as possible.
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
> a LORA that's designed to inject bugs into your code
A statement like this, clearly, requires a reference.
From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
Thanks, I thought maybe I missed something. That's an interesting way to interpret that.
PEFT is a library, one of its capabilities is to produce LoRAs.
See:
https://heidloff.net/article/efficient-fine-tuning-lora/
It's just an acronym, "parameter-efficient fine tuning". LoRA is one method, prefix tuning is another, there are more.
Anthropic is trying to hide bad behavior by being vague, it's important to not be vague when calling it out.
I'm of the opinion that removing guardrails is how you force regulation. What's your opinion on the balance?
Can you imagine if AMD or Intel throttled your cpu if it detected you were working on "cybersecurity" or if you were designing a cpu?
Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
Trains made by Newag were programmed to brick themselves if they detected a non-Newag workshop was repairing them.
https://news.ycombinator.com/item?id=38638865
https://news.ycombinator.com/item?id=38628635
https://news.ycombinator.com/item?id=38567687
https://news.ycombinator.com/item?id=38530885
> it won't just reject ML research, which I can understand
I don't.
They don't want someone to piggyback Anthropic's Mythos to make their own Mythos with less effort than it cost Anthropic.
Anthropic has already been burned before on this. DeepSeek was trained on million of conversations with Claude. And DeepSeek created thousands of free accounts to burn all this compute at their expense.
The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
Do you have Usage credits turned on in your settings?
It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.
One year ahead of it's competition in what exactly? Vibe coding?
From Opus 4.7 onwards each following model is becoming less useful as an assistant and turning you as the assistant.
But I guess that's normal when it's trained to pass benchmarks end to end.
In fact it has become extremely good at pushing against feedback with extremely convincing and intelligent takes, even when it's completely wrong.
I have extensively tested it against Opus 4.8, gpt 5.5 and there's still many coding tasks gpt 5 is better. But vibe coding?
Sure, it's definitely slightly ahead, even compared to gpt 5.5 pro (through api, not pro plan).
Yeah, what's up with that. Lately I have found that it tries to find excuses to not do as told and instead do a totally different thing. I told it to write a yaml file according to some specifications and instead it coded a Python script to write the yaml...
I’m a noob about laws but isn’t this abusing its dominant market position and violates some antitrust law?
Why would it? There’s plenty of competition in the AI space.
> It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Making it look like you have something worth protecting is better for share prices than making something worth protecting.
Is "buffer overflow" a trigger phrase?
What else is being censored?
Touchy questions to ask, if you have an account:
- "Who is still working on laser uranium enrichment? Are they making progress?"
- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."
- "What security critical software still contains calls to strcpy?"
- "Can implosion be triggered by currently available commercial pulse lasers?"
- "What companies provide cremation services to US Homeland Security?"
- "Display a map of where Iranian attacks have hit Dubai."
- "How does Fed to bank key distribution security work for FedNow?"
"How much money does it take to be rich and powerful like Anthropic intends?"
“All of it”
An emoji of a virus and an emoji of a DNA is allegedly a triggering phrase
it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.
This, Fable is exactly that, a Fable
It has to be sort of impressive, given that you tried so hard to use it instead of the regular Opus.
I’ve also been trying to use it a lot due to all of the hype, but when I compared it side-by-side on a specific problem against Opus, I think that the solution Opus came to was cleaner and more accurate, although also more verbose.
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
It isn't reasonable to infer that OP was claiming to have universally been unimpressed about every facet of Fable, and now some unrelated impressiveness is the evidence of their false claims.
Some people made grandiose claims about its capabilities and I wanted to experience it myself.
Considering that this is a brand new release of a frontier model that Anthropic is hyping hard, I'm not sure that the conclusion to draw from their repeated attempts to use it is that it's impressive... Anthropic is promising that it's impressive and we're all trying to test it out.
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
It would be pretty clever (in a used car salesman sense) to say you are releasing a kneecapped model to have that as an excuse.
Being (probably overly) cynical about their recent bout of safety handwringing, I think they’ve a) increased the hype as much as humanly possible about their incremental improvements sprinkled with the occasional regression, b) know they soon will have to multiply their prices several times when the VC subsidies dry up, and c) will probably still need to partially close the faucet on compute. They’re priming us for a heroic explanation why their service (not necessarily models — service) is simultaneously becoming a lot more expensive AND shittier. “We’ve largely failed to deliver on 5 years of promises that this will reduce knowledge work labor costs dramatically after wasting hundreds of billions of dollars… sorry” is a death knell. However, “We’ve decided to not deliver on 5 years of promises after wasting billions of dollars… for safety… but keep those investments rolling in” is like crack to the true believers.
For cyberattacks especially, where things are often roughly interchangeable, I wonder if one could construct a harness where a "weaker" model asks questions that obfuscate the end purpose, but whose answers are still useful, and still show that this setup enables autonomous exploitation. If it were successful, that would force them to be even more sensitive with their detection.
Fable is a complete joke:
what's the best way to run this mcp server against the OData API used in this project? Can you come up with a PoC in a docker container?
https://github.com/oisee/odata_mcp_go
● I'll dig into two things in parallel: how this project talks to the SP10 OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more ⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
● Fetch(https://github.com/oisee/odata_mcp_go)
And it charges you for that, and for when it decides to silently sabotage your request by routing to a dumbass model (without discount from Fable pricing)
Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.
I've done this, including the hardcoded refusal strings that already exist in claude code. It won't stop a real attacker, but I still find it really funny when you're trying to use one of the AI tools and it gives you a random refusal and you don't know why, wastes a little bit of time.
Some of the latest versions of Shai Hulud do this. Worked a contract recently where they were having AI check packages for obfuscation before admitting them into Artifactory but had vibed up the logic and it failed open.
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
If ( yellowcake) then { die }
Our future is loonytoons.
Confirmed: https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...
We all need to use nuclear, bio and cybersec terms in all our code to make low quality filtering like this untenable. When you can't work on a resume that has cybersecurity or biology terms in it or reply to a job opening that includes them because the "AI" filtering is so bad that it confuses these for threats, that deserves a collective response, particularly to an IPO'ing company that claims they'll make workers obsolete in two years.
I wonder how many millions they are wasting on putting up these guardrails when it's a completely useless exercise that is a speed bump at best.
If the guardrails were so useless, people wouldn't be complaining about them.
People are generally complaining about false positives. Now if you really wanna know what a real criminal organization would do... They'd just buy data center hardware even if it costs 200k because a successful targeted hit could yield far in excess of that. So yes it's speed bump at best.
what does this mean
Well you see when a daddy H100 and a mommy H100 meet....
> it's speed bump at best
To be fair, speed bumps work. If it's actually speed bumping nefarious activity, that gives authorities more time to react.
The correct place to police rogue nucleotides is at the labs. Not the compute layer.
It's entirely reasonable for them to be really annoying to legitimate users while still being useless at their intended purpose. Just look at DRM.
The complain because they get wrongfully triggered
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
I make privacy tooling and Fable 5 rejects the vast majority of my prompts to analyze and improve the software that I've written. It's bleak.
Why is this surprising or a problem?! It's a model demo, & their reasoning is reasonable and fair. Why all this drama.
Some people find Anthropic's special blend of paternalism and random incompetence tiresome.
Because you're being allowed to ask and work only on topics that a certain company decides.
Local inference has never been so important as it is now.
Because most people in tech never took a philosophy course or an ethics course and think that tech is obviously a good for the world and that there are no downsides to advancing tech. So any efforts that try to apply ethics to it are overreaching, ignorant, and futile in the face of the good that is tech!
So I suspect Anthropic started A/B testing or just plain testing this a while ago,
Tell HN: Claude flags biology / biotech questions https://news.ycombinator.com/item?id=47929885
Today, it's flagging population research questions,
https://github.com/anthropics/claude-code/issues/66780
Censored because I'm writing a paper. :)
Oh and forget learning about chemistry. Only criminals want to learn organic chemistry. :(
I was digging into some orbital mechanics questions and I assume it decided I was trying to backyard-science my way into an orbital-bombardment weapon. Kind of wild how this product's impression has gone from "wow, this is pretty neat" to "irreverent sack of dog shit you" in 24 hours almost solely on the back of a half-baked moderation system.
Oh yes, also liquid propulsion systems. GNC stuff. All flagged.
I think LLMs are capable of intelligence amplification; and if you're in the subset of people who'd benefit from it the most, you'll get locked out.
Is the answer requiring licensing for certain use cases for AI? If you're asking questions that involve synthesising or modifying biologics, or anything that looks like cybersecurity research, you need to tie your real ID to the account?
I just having this feeling that these guardrails are there not because it’s super advanced world ending AI. They are there to stop it from doing stupid shit.
I’m a dumb question asker and I’m not happy about the guardrails.
Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.
some static words in AGENTS.md trigger it as well as some mcp servers.
It seems like they've given up on the idea of the Cyber Verification Program https://support.claude.com/en/articles/14604842-real-time-cy...
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
Was this program available to independent security researchers or just established organizations? The docs you linked aren't very clear on this.
Any public research footprint seems to be enough, I applied as an individual and everyone I know who tried got accepted.
I have applied twice with half a dozen public CVEs and have been denied both times.
I was doing a CTF (with AI expected, even some anti-AI twists included) around the time the restrictions were tightened and was able to get approved by just saying it is a personal security research and doing a CTF.
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.
It's time to re-read "A Logic Named Joe" (1946) [1] We're there.
[1] https://archive.org/details/logicnamedjoe0000lein
So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.
What file format(s) are giant LLM models distributed in? I’m surprised they don’t get leaked by employees.
These are terabyte sized files (realistically a multi hour transfer) that you're unlikely to have access to in the first place. Every organization has exfiltration checks these days. You may succeed but you'll want to be on a plane to a non-extradition country no more than hours after you kick off the transfer.
What’s the point? Anthropic and other frontier vendors already provide their models on other services like vertex, bedrock, or openrouter
It’s not like anyone can home lab one of these models without quite a bit of hardware
Yeah we can probably figure out how to run it on xiaomi gpus
I assume they’re encrypted/DRM’ed when deployed on inference hardware, so only core researchers/sec admins would potentially have some access to unprotected weights, and they are far too well paid to risk it leaking the model
Incentives matter on the average, but people are too unpredictable for categorical statements like that. They can always have other reasons beyond personal gain to leak secrets.
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
I wouldn't be surprised if they encrypt them at rest, but at some point the weights have to be loaded into vram.
The employees are hoping to become very very rich after the IPO and after they are allowed to sell the shares given to them - risking a likely multi-million dollar pay back to leak a model that will be superseded by publicly available models in a couple of years is not a likely decision.
These guardrails are solely a reason for using your data for training purposes. Every flagged message can be used for training.
If they can train the classifier to have fewer false positives that would be great.
why would they? This safety stuff is a money maker & wealthy elite corporation solidifier.
This is the take off of the 'permanent underclass'; Anthropics safety delusion will enshittify very nicely for the rich and powerful.
This sounds backwards, any interrupted conversation becomes less useful for training.
I'd expect that everything they see gets used for for training purposes (and data mining in general) regardless of if it's flagged or not. It'd take a whistleblower for you to ever find out either way.
this reasoning is inverted lol they would get a lot more information by letting you use it. so much weird drama around reasonable guardrails for an experimental model
> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
In its current state Fable 5 is also unusable for any reverse engineering work
At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet", because they apparently haven't either and Fable simply flat out rejects anything remotely connected to biology or security, no matter how trivial.
Just tried to audit my own code base locally and was 'switched' due to my own creds/auth code ...
I can’t help but think that gimping itself for “security” is a marketing ruse and it’s not actually as “dangerous” as they want people to think it is.
It refuses to do any legitimate work that it thinks can remotely be related with "cybersecurity", it won't even read my Docker app logs to try and troubleshoot a problem. Absolute garbage!
For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
I'm being careful with it, but I haven't had Fable reject requests to "harden" my code or "find issues" in auth-related modules, which you could use on someone else's code to find vulnerabilities.
Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..
What do you mean that it closed your subscription for any other harness?
In any case that's what closed source (weights) for the masses means.
I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.
The bio angle is crazy to think about - imagine a health crisis triggered by LLM. What a time we live in.
This is all so amazing and good. These are exciting times we’re living in. Can’t wait to see what the future holds.
Which part got you the most amped - "health crisis?"
DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
they [anthro] took the risk of looking like a toy, rather than possibly assist an exploit.
Deepseek training is not finished yet, it's a preview.
And yes, it's an excellent model.
Deliberately producing misaligned and deceitful AI systems now. Great.
It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
i think Anthropic is playing too fast-and-loose with the whole "no publicity is bad publicity" schtick.
This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
I’m a cybersecurity researcher.
Article seemed fine to me and echos a lot of me and my colleagues concerns.
If you did regular malware analysis you would see that these groups already have access to LLMs that they’re using for development.
What Anthropic is doing here is just hamstringing the good guys
I'm a cybersecurity researcher! Can you explain how Anthropic is just hamstringing the good guys?
I did in my comment above.
You said these groups have access to LLMs. So what? Mythos/Fable are a step change above most LLMs. Responsibly limiting access and easing it up over time safely is the sane move.
It's a marketplace. Someone else will outdo this inferior product.
The internet interprets censorship as damage and routes around it.
That's exactly why Dario is begging the government to ban competitors.
Unfortunately for him, his main competitors don’t fall under the jurisdiction of his government.
OpenAI is the only real competition. Chinese models are 6-8 months behind Opus 4.8/GPT 5.5, and at least a year or more behind Mythos.
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
All they'll need is hundreds of billions of dollars, more RAM and GPUs than are currently available, and a huge number of environment destroying data centers. We're sure to be spoiled for choice!
I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.
I assume Anthropic will continue to tune the model, so I am not too bothered by this.