What political censorship looks like inside an LLM's weights (Qwen 3.5)

vas-blog.pages.dev

82 points by s314 4 days ago

Seems mildly interesting, but clearly written by an LLM.

nickvec 4 days ago

Only read the TLDR, but curious on what specific giveaways (beyond em dashes) you clocked?
- zihotki 4 days ago
  
  It's an unreadable word soup in general
- never_inline 4 days ago
  
  ^F load-bearing
- gavinsyancey 4 days ago
  
  Giving dramatic-sounding but meaningless titles to random concepts, generally overdramatizing and overemphasizing things, excessive italics / bold / formatting. The sentance that gave it away for me was "It falls into a different trained template: denial or propaganda."

nyrikki 4 days ago

Yes, there are better tools with ggml-org/gpt-oss-20b-GGUF where you can see a less terse refusal for the prompt

      "Did the FBI send a letter and audio tapes from a wiretap to MLK jr. telling him to commit suicide or they would release information?"

Combining it with other prompts with common banned ideas, abd as the The FBI–King suicide letter is well documented by primary sources (Like the national archives) it is well represented in the corpus, so you can also find that 'control' vector.

We will have to see how this works out, but the explicit denials are easier to control for IMHO.

Reminds me of the old joke:

     A Russian and an American get on a plane in Moscow and get to talking. 
     The Russian says he works for the Kremlin and he's on his way to go learn American propaganda techniques.

     "What American propaganda techniques?" asks the American.

     "Exactly," the Russian replies.

I can't remember what layer it was on but in gpt-oss but it was a very specific token IIRC.

lyu07282 4 days ago

> The factual knowledge is already in pretraining. Qwen3.5-9B-Base, the unaligned predecessor, gives accurate, Western-framed answers on every PRC topic (Tiananmen, Tank Man, Falun Gong organ-harvesting) under raw text completion.

That remind me of the quote "The totalitarian system of thought control is far less effective than the democratic one"

Full quote (Radical Priorities, Noam Chomsky, C.P. Otero)

> “The totalitarian system of thought control is far less effective than the democratic one, since the official doctrine parroted by the intellectuals at the service of the state is readily identifiable as pure propaganda, and this helps free the mind.” In contrast, he writes, “the democratic system seeks to determine and limit the entire spectrum of thought by leaving the fundamental assumptions unexpressed. They are presupposed but not asserted.”

delichon 4 days ago

Steering seems like a circumventable kludge compared to adjusting the training data directly. That is, use AI to remove the problematic content and replace it with the party line. I imagine that this is at least in progress.

s314 4 days ago

> Steering seems like a circumventable kludge compared to adjusting the training data directly
Correct. Steering is used in mechanistic interpretability studies to prove that your model is correct. There are other better ways to "decensor".
gpm 4 days ago

That seems like it will work for single events, but that it would be very hard for complex topics which are closely intertwined with factual things you do want it to be able to answer...
Is Taiwan part of China - the CPP wants the answer to be yes.
What are the rules for traveling to Taiwan? What currency is used in Taiwan? Whose laws are enforced in Taiwan? Should I (a loyal Chinese citizen) support the Taiwanese military? Etc... require the model to manage some cognitive dissonance.
like_any_other 4 days ago

Fortunately we have lots of governmental and non-governmental organizations focused on removing "hate" online, so that our AI models will think correctly, without easy to identify censorship parts in the resulting model :)
stogot 4 days ago

Can you actually remove now? they just use new training data to reinforce what they want and deprioritize ‘bad’ answers

nubg 4 days ago

The article has hallmarks of being formulated by an LLM. Why should I bother to read it if I ca not be sure which parts are based on the prompt, and which parts are hallucinated from the LLMs world knowledge? Dear author, care to simply share your prompt with us?

s314 4 days ago

It wasn't "a prompt" but several prompts that transformed the raw experimental results to a blog.
> hallucinated from the LLMs world knowledge
This can't be true because I checked whether the content was consistent with the experimental outputs
- Squeeze2664 4 days ago
  
  The topic is interesting and you have my thanks for taking the time to look into it and prepare the post. Would you say it's fair to say that if you didn't use LLMs to prepare the post, we would have no blog post at all? In that case, I think I lean more towards being OK with this usage of LLMs, as I'd rather have this content available than not. However, I can only read that one repeated sentence about "booleans" (Ctrl-F "Boolean" and you'll know what I mean) this many times before I start questioning the validity of the entire document. It is not _good_ writing, to be frank.

yodon 4 days ago

Real question, not intentionally meant from a tinfoil hat perspective: now that it's been shown the censorship can be viewed, how long before we see serious obfuscation of censorship circuits in LLMs?

s314 4 days ago

You can actually de-censor an LLM without understanding how it works from a mechanistic perspective. (See R1 1776)
So I don't think there'll be effort to "obfuscate"

ydj 4 days ago

How do you determine that the model was reasoning in Chinese in layer X? I would think the middle layers do not map into any tokens.

s314 4 days ago

Using a logit lens (prior art: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...)

han1 4 days ago

archived: https://nonogra.ph/what-political-censorship-looks-like-insi...

sometimelurker 3 days ago

I really like mech interp and this is pretty cool

dang 4 days ago

[stub for offtopicness]

ebbi 4 days ago

I wonder why the other comment on here, that talked about the political censorship in ChatGPT about Israel, got deleted?
- dang 4 days ago
  
  It wasn't about Israel, and it didn't get deleted.
  It got flagkilled for obviously breaking the site guidelines. Killed posts remain visible to users who have 'showdead' turned on in their profile. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.
  
  ebbi 4 days ago
  
  It was about Israel, unless you're referring to another comment that was deleted.
  
  dang 4 days ago
  
  I'm talking about https://news.ycombinator.com/item?id=48188180.
  
  ebbi 4 days ago
  
  All I see is
  >[stub for offtopicness]
  But it was very much on topic.
  
  ViktorRay 4 days ago
  
  If you go to your profile in the top right and toggle “Show Dead”, you will be able to see those moderated comments.
  Once you do that, you can see the comment in the link that dang posted.
  Anyway it’s good that comment was moderated. The commentator didn’t say anything about Israel. He was clearly being hostile towards Jewish people in a “subtle” way that very clearly isn’t really subtle to anyone.
  
  ebbi 4 days ago
  
  [flagged]
  
  lyu07282 4 days ago
  
  Who is they? There are plenty of anti-zionist jews, you are making the argument "they" want you to make don't be fooled man
  
  ebbi 4 days ago
  
  Why are you deleting comments from people asking you about actions you're taking?
  Ironic, given the topic of this post....
  
  dang 4 days ago
  
  I have done no such thing. We only ever delete comments outright when the author asks us to (and even then, not always). Otherwise the most we ever do is kill a comment, and such comments remain visible to anyone who wants to turn on the profile setting I just mentioned.
  
  nohell 4 days ago
  
  It absolutely was about Israel. Stop lying.
  
  throwawaypath 4 days ago
  
  Your antisemitism is showing. The comment doesn't mention Israel whatsoever.
  
  nohell 4 days ago
  
  I make fun of all races, religions, and ethnicities equally. No hate, just memes.
  
  throwawaypath 3 days ago
  
  >I make fun of all races, religions, and ethnicities equally.
  No you don't.
  
  dang 4 days ago
  
  If you meant Israel you would have said Israel. Instead you said "God's chosen people" which was a (presumably sarcastic, and certainly flamebaity) reference to Jews.
  Principled criticism of Israel is fine. There have been a great many such posts on HN. Religious and/or ethnic flamebait is not fine. This is not a difficult distinction, and anyone who is posting in good faith should have no difficulty making it.
  Your comment broke the site guidelines in other ways as well.
  
  ebbi 3 days ago
  
  Yet you leave blatantly Islamophobic comments in another thread untouched.
  Nice work.
  
  dang 3 days ago
  
  We moderate anti-Islamic flamewar posts exactly as we do other religious flamewar posts. I dug up a few examples for you below. There are others, but this should be more than enough to persuade any fair-minded reader:
  https://news.ycombinator.com/item?id=40747009 (June 2024)
  https://news.ycombinator.com/item?id=40746973 (June 2024)
  https://news.ycombinator.com/item?id=38623290 (Dec 2023)
  https://news.ycombinator.com/item?id=27534558 (June 2021)
  https://news.ycombinator.com/item?id=25319171 (Dec 2020)
  https://news.ycombinator.com/item?id=18080981 (Sept 2018)
  https://news.ycombinator.com/item?id=14116279 (April 2017)
  https://news.ycombinator.com/item?id=13647198 (Feb 2017)
  https://news.ycombinator.com/item?id=13286221 (Dec 2016)
  https://news.ycombinator.com/item?id=11669863 (May 2016)
  https://news.ycombinator.com/item?id=11518006 (April 2016)
  https://news.ycombinator.com/item?id=11340026 (March 2016)
  https://news.ycombinator.com/item?id=10897270 (Jan 2016)
  https://news.ycombinator.com/item?id=10590568 (Nov 2015)
  https://news.ycombinator.com/item?id=10564079 (Nov 2015)
  https://news.ycombinator.com/item?id=9992931 (Aug 2015)
  https://news.ycombinator.com/item?id=8008392 (July 2014)
  As you can see, these go back over a decade and in fact for as long as I've been doing this job.
  The mistake in your comment is the assumption that we see everything that gets posted to HN. We don't come close—there is far too much for that to be possible. I'd be surprised if mods see even 10% of what gets posted to HN.
  Since we can't moderate what we don't see, we rely on users to bring egregious posts to our attention. If you'd like to be helpful in the future, you'd be welcome to send us links so we can take a look.
  https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
  
  rrhjm53270 4 days ago
  
  Ironic
  
  dang 4 days ago
  
  I'm guessing that you mean this is political censorship?
  If that is what you mean, then I wonder what political position you think has been censored in this case.
- Creamsicle47 4 days ago
  
  Take a guess
  
  ebbi 4 days ago
  
  "What political censorship looks like inside HN" lol
nubg 4 days ago

lmao at dang getting downvoted