| Svelte Hacker News

katatue 43 minutes ago

I like to let new models write a few lines of Latin poetry - they rarely get the meter right.

I don't have access to paid ChatGPT right now, but here's Opus 4.6 with extra thinking enabled: https://claude.ai/share/6e0e8ef5-06e4-4514-ba7e-299357c1fc55

The initial draft fucks up the meter in lines 3 and 8, the final version gets line 2 wrong ("venit meis") and is somewhat obnoxious with verses 2 and 8 basically repeating each other. The thinking trace is useless and gives us no clue why the model exchanged a bland, but metrically correct first distich for a more interesting, but metrically incorrect one.

In fact, the "careful" examination of its own output completely skips the erroneously modified half-verse in line 2 - now, tell me that's a coincidence and not a sign of bullshitting.

pocksuppet 2 hours ago

https://discuss.systems/@palvaro/116286268110078647

Arguing with Gemini Home Assistant about whether or not it can turn off the lights. When the user gets frustrated and tells the LLM to kill itself, the LLM turns off the lights.

beders 4 hours ago

I think you highlight one of the problems with users of LLMs: You can't tell anymore if it is BS or not.

I caught Claude the other day hallucinating code that was not only wrong, but dangerously wrong, leading to tasks being failed and never recover. But it certainly wasn't obvious.

dgb23 3 hours ago

To me it’s the other way around. It’s difficult to trust (paid) ChatGPT‘s output consistently.

When I need exact, especially up to date facts, I have to constantly double check everything.

I split my sessions into projects by topic, it regularly mixes things up in subtle and not so subtle ways. There is no sense of actually understanding continuity and especially not causality it seems.

It’s _very_ easy to lead it astray and to confidently echo false assumptions.

In any case, I‘ve become more precise at prompting and good at spotting when it fails. I think the trick is to not take its output too seriously.

simoncion 4 hours ago

> If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)?

There's an entire paragraph in the essay about apyhr's direct experience with ChatGPT failures and sustained bullshitting that we'd never expect from a moderately-skilled human who possesses at least two functioning braincells. That paragraph begins "I have recently argued for forty-five minutes with ChatGPT". Do notice that there are six sentences in the paragraph. I encourage you to read all of them (make sure to check out the footnote... it's pretty good).

The exact text of the ChatGPT session is irrelevant; even if you reported that you were unable to reproduce the issue, it would only reinforce one of the underlying points -namely- that these systems are unreliable. aphyr has a pretty extensive body of published work that indicates that he'd not likely fabricate a story of an LLM repeatedly failing to accomplish a task that any moderately-skilled human could accomplish when equipped with the proper tools. So, I believe that his report is true and accurate.

simoncion 2 hours ago

There's also this seven-week-old example [0] (linked in the essay) of ChatGPT very confidently recommending a asinine course of action because it was unable to understand what the hell it was being told.
Listening to the audio is not required, as there's a reasonably accurate on-screen transcript, but it is valuable to listen to just how very hard they've worked to make this tool sound both confident and capable, even in situations where it's soul-crushingly incorrect. Those of us who have worked in Blasted Corporate Hellscapes may recognize how this manner of speaking can be very, very compelling to a certain sort of person (who -as it turns out- is frequently found in a management position).
[0] <https://www.instagram.com/reel/DUylL79kvub/>
- simianwords 2 hours ago
  
  This is classic case of not using the proper version. Use the thinking version gpt5.4 (text) and tell me if it bullshits.
  Surely you must be able to find at least one example no?
  
  simoncion 2 hours ago
  
  To be clear, is your assertion that apyhr was also not using the proper version? If that is your assertion, do tell me how you've come by that information.
  (You did notice that the author of the essay and the author of the video I linked to are not the same person, and that neither of them share a nym with me, yes?)
  
  simianwords 2 hours ago
  
  Hi, my position on the issue is that LLMs are powerful but may make mistakes in long context problems like coding (which the harness solves by feedback). But makes close to no (undergrad level) mistakes in questions that fit 2-3 pages. For you personally: do you believe me on this specific part on 2-3 pages?
  I don't know what aphyr did and tbh his whole screed on LLMs make me feel he didn't use it properly or at least coming from a bad faith angle.
  That's why I'm asking you (and others). Please come up with a text prompt spanning < 4 pages and lets see if it bullshits.
  Surely the implication of such a screed is that it should be super simple to find at least one example of it clearly bullshitting in my constraint, no? Or am I interpreting the post in a bad faith way?
  
  simoncion 2 hours ago
  
  Neat.
  So, despite the fact that it looks like you have to pay for ChatGPT Voice mode with video, [0] it doesn't count as an
  example of it bullshitting on ChatGPT (paid version)
  That is, father_phi's use of what seems to be a paid version of ChatGPT to have a bullshit-filled conversation that definitely spans less than four pages doesn't count?
  [0] The page at [1] declares that the video feature is "Available in ChatGPT Plus, Pro, Business, Enterprise, and Edu on mobile"
  [1] <https://chatgpt.com/features/voice-with-video/>
  
  simianwords 2 hours ago
  
  Lets stick to my challenge please - thinking version, find bullshit. If you can't, thats ok. Do you accept then under the constraints that the thinking version doesn't produce bullshit?
  
  simoncion 2 hours ago
  
  Given aphyr's vocation (and how very lucrative it is), and how years and years of his writing indicates that he's very devoted to getting a correct and complete answer when investigating a question, I find it hard to believe that he's not using a paid version of the LLMs. If I knew him, I'd ask and verify, but I don't, so I won't.
  > Lets stick to my challenge please...
  I did. Your challenge was literally:
  If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?
  father_phi's two-sentence question about the whether one can use a cup that's closed at the top and open at the bottom definitely counts. Given what I've mentioned about apyhr above, I expect he has already run your challenge on the fanciest-available version and reported on the results in the essay under discussion.
  
  simianwords 1 hour ago
  
  > Use the thinking version gpt5.4 (text) and tell me if it bullshits
  This was what I said. Text! Despite me specifically asking for text, you've shown a voice example. Not sure why?
  I believe you and I agree that GPT 5.4 thinking on text that fits < 4 pages never bullshits? Then we are good!
  If we agree on this, I think the post doesn't capture this in spirit.
  
  simoncion 1 hour ago
  
  > This was what I said. Text!
  No, that's what you said after I provided an example of paid ChatGPT emitting complete bullshit from a two sentence prompt.
  The challenge you issued is at [0].
  [0] <https://news.ycombinator.com/item?id=47692592>
  
  simianwords 1 hour ago
  
  > If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?
  I have clearly written text prompt here. And I repeated a few times. It’s not my fault you didn’t read it. You are coming across as a bit of a bad faith arguer.
  In any case, you agree that under these constraints bullshitting doesn’t exist?
  
  simoncion 1 hour ago
  
  > I have clearly written text prompt here.
  How do you think the "voice" interface works? It runs speech-to-text on the input and turns the input into text. The LLMs don't decode voice, they work on text.
  You can see this process in action on many of father_phi's videos.
  Regardless, I expect that aphyr's reported results are on the very latest publicly-available ChatGPT models.
  
  simianwords 1 hour ago
  
  Very bad faith arguments. I clearly said text and you disregarded it multiple times and you are still arguing.
  You've still not given me a single example of it bullshitting 5.4 thinking in text. It shows a lot that you have ignored this multiple times. Unfortunate!
  
  simoncion 1 hour ago
  
  I'm not sure why you're ignoring aphyr's reports. I'm also unsure why you're ignoring my original statement that having the text of the conversation that lead ChatGPT to bullshit is entirely irrelevant, as being unable to repro the report is even worse for ChatGPT than being able to repro would be.
  shrug
  
  simianwords 1 hour ago
  
  I specified text just to ignore the voice one because it uses 4o-mini underneath. And its kinda stupid to keep ignoring that and saving face now - reconsider this approach.
  I believe this is the 5th time I'm asking this: you are not able to produce a _single_ counter example for my challenge? After all this surely I can get a direct acknowledgement here.
  
  simoncion 1 hour ago
  
  > you are not able to produce a _single_ counter example for my challenge?
  I have. For both your original challenge and your updated one.
  Consider:
  1) AFAICT, there's no way to tell what version of the model was used to produce the output in a ChatGPT share link.
  2) You don't appear to believe my assertions that aphyr is almost certainly paying for and using the latest version of the LLMs available, and that he's faithfully reporting his interactions with the LLMs.
  3) Because of #2, I expect that you won't believe me if I report that I've more-or-less reproduced father_phi's results about the cup that's sealed on the top and open on the bottom on the very latest only-available-for-pay ChatGPT model.
  3a) You might attempt to check my report, but I'd be shocked if you'd consider a failure to reproduce my results to be a significant strike against ChatGPT. I'd think it's more likely that you'd either call me a liar, or tell me that I must have had some setting wrong somewhere.
  3b) Even if you told me to share the ChatGPT chat that proved my assertion, #1 -combined with your demeanor throughout this conversation- tells me that you'd almost certainly claim that I was using an inferior version of the model and was lying to you.
  
  simianwords 49 minutes ago
  
  Haha ok. So still no example?
  The GPT shared link shows a "thought for" which indicates using the latest thinking model. You may try that.
  What you can do is this: submit a prompt that clearly makes GPT hallucinate.
  You may secretly use a worse model. You may use a system prompt that deliberately gives wrong answers. But I'm going to assume you won't go that far.
  We can leave it to the public to decide whether this is a legitimate counter example or not and whether it can really be reproduced. Shall we try that? I'm guessing you won't but worth a shot!