points by PunchyHamster 22 hours ago

Let's start with most outright alarming error - the claude statistics are taken out of whole 2 data points

logicprog 22 hours ago

That's sort of the point. There isn't enough data to extrapolate, and yet that's exactly what those outraged about AI were doing, and when you do do the very minimal types of analyses (permutation tests, and looking at distributions, mostly) that are actually valid, safe, standard, and useful to do on such low amounts of date, again, no evidence for the outrage shows up, and the two releases look so normal that it sort of shows no one would've cared if they hadn't known or found out that Claude was involved.

I really think this a much better standard of evidence — limited though it is — to outrage-fueled cherry-picked anecdotes, which is what has been driving this whole thing. If you disagree, and think the outrage should go one when I've shown there's an absence of evidence entirely for it (although of course, that's not evidence of absence; maybe I'll have to eat my words 5 releases down the line, but appealing to that now feels like a Russell's Teapot), would you care to explain why?

  • ofjcihen 22 hours ago

    I know you’re defending your work here but this behavior does absolutely nothing to help your point.

    • logicprog 22 hours ago

      Fair point. Let me edit (if I still can) to tone it down.

its-summertime 12 hours ago

If one asks "Is the house on 123 Road Street, NJ, taller than the statistical average", then that there is only 1 datapoint for the house on 123 Road Street, NJ. Which is also 100% of the houses on 123 Road Street, NJ.

kelnos 10 hours ago

You can apply that to the outrage too: the people pissed off about this are going off 2 measly data points.

runarberg 22 hours ago

The interpretations of the p-value is also alarming. One of the first thing they teach you in statistics class is: “an absence of evidence is not evidence of absence”.

This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.

Traditional p-hacking is done by oversampling and overtesting. If you do 20 analysis on average one will show p < 0.05 by random chance. This analysis is doing the inverse of that. Under-sampling, and concluding with p > 0.05

  • logicprog 22 hours ago

    > This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.

    I tried pretty hard to avoid saying that, can you point me at how to rephrase? The point I'm trying to make is just that there is absolutely no evidence at all for what people are saying with such absolutism and claimed objectivity (that Claude made rsync worse), and thus it doesn't justify the outrage.

    > Under-sampling, and concluding with p > 0.05

    How would I avoid under-sampling here? And if you're going to say it's because I only have 2 data points, well, the side making the positive claim — that Claude made rsync worse — only had two as well, and unremarkable ones at that, as I've tried very hard to show.

    • runarberg 22 hours ago

      You are interpreting the p-values on their own merit rather then using them to test a null-hypothesis. Quotes like:

      > With a p-value of 74%, the answer is a decisive no. The odds ratio is 1.06 — essentially 1:1. Claude releases are no more likely to be above the median than any other releases.

      are problematic in this context as the correct conclusion here is you just don‘t have enough data conclude whether or not you are more likely to encounter a bug after a Claude commit.

      > How would I avoid under-sampling here?

      You don‘t. You admit that you don’t have enough data and move on. What you are trying to do here is prove a negative, which is extremely hard to do. In your discussion you claim that the users complaining had no right to, however nothing in your analysis showed they were wrong. We simply don‘t have enough data (yet) to say either way. When we have enough data they may be proven right or wrong, but until then, we cannot conclude either way.

      If you insist still, I recommend looking into bayesian analysis. Theoretically at least the posterior distribution from a bayesian analysis can be interpreted directly and analyses on its own merits. However I suspect your posterior will have way too much uncertainty to reach any conclusions.

      • logicprog 21 hours ago

        Edited that claim, and made several clarifications elsewhere. The whole point of this analysis is that outrage is unjustified on the basis of two totally statistically unremarkable releases that no one would have remarked on pre-AI (my further proof of this is that there was a pre-AI remarkably broken release, and no one did comment!) and zero positive evidence outside cherry-picked anecdotes for any negative impact. We should wait for outrage and version pinning and cancelation until there is evidence, no? I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently; I'm not trying to build any kind of predictive model for future Claude releases to say anything grander than "these specific releases are fine, what are we freaking out about?", not some claim about what Claude-exposed releases will look like or trend like in the future or in general.

        • runarberg 16 hours ago

          There is a lot more context to the outrage which is missing from your analysis. People have multiple reasons to be mad at AI usage, you mention some of them in your introduction, and you put a (statistically insignificant) measure on only one of them. In your analysis you have shown that exactly one of these reasons is anecdotal. That does not mean they are wrong, and it especially does not mean they are unjustified.

          That you found a single pre-AI release which did not cause outrage is proof of nothing. This single release is equally anecdotal, and statistically insignificant.

          So, the biggest context that is missing here is that people hate AI for various reasons, and they don‘t want their favorite tools to fall victim to AI for equally many reasons. It is only natural that people who hate AI react this way when they find out their favorite tool uses AI, and doubly so when they sniff correlation between their favorite tools use of AI and bugs.

          > I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently.

          Well, there is no evidence against harm either. But what you did here is a bit of a slight of hand. In your analysis your null hypothesis is: “There is no difference in bug count between releases which includes code commits from Claude Code and releases which don‘t”. (You then go about doing what every psychology major is taught not to do; find evidence for the null hypothesis, not against it). However what hypothesis testing is for is to use a representative sample to generalize over a wider population. You do hypothesis testing because you want to demonstrate that your sample is representative of a wider population, that you just so happened to have picked the two sample, by random chance, which shows the effect regardless of the experiment.

          By calculating the p-values you were telling me that you were in fact ready to make generalizing statements over a wider population of commits, but your results were statically insignificant, so really you should not draw any conclusions from them. You have not, in fact, shown that they aren’t different from the rest of the population.

          • wzdd 6 hours ago

            > In your analysis you have shown that exactly one of these reasons is anecdotal.

            This was actually the convincing one for me though. “Did AI increase the rsync bug rate? Dunno, can’t tell yet” seems like a fine conclusion to me. Plenty of people in this thread and previous ones on the topic seem convinced one way or another, so it’s nice to see actual numbers.

            • runarberg 2 hours ago

              The numbers are statistically insignificant though. So you cannot use them to generalize over a wider population.

              I think in this era of scientific literacy people tend to overcorrect in the absence of evidence. Anecdotal evidence still evidence though, and people are right to react to them.

              If we remove our frequentis hats and put on our baysian hat (which is a wise thing to do when n is very low) we can take into consideration evidence from multiple direction at the same time as we upgrade our belief. A baysian might start with the prior that claude assisted commits have the same distribution as non-assisted commits. I would start with a Poisson distribution as my prior, and then they would factor inn all the evidence of AI slop they have seen in their lives and update their posteriors accordingly. Claude caude has been wrong about so many things in the past, which should contribute to a smaller lambda then the control group.

  • xmddmx 21 hours ago

    The concept you need here is "Statistical Power".

    The ELI5 version is that there are two mistakes you can make when looking at a P value:

    Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive.

    Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative.

    https://en.wikipedia.org/wiki/Power_(statistics)

    One can calculate statistical power for a given experimental protocol.

    My hunch is that if you did this, you would find this experiment is grossly under-powered.

    This means you can't make the "absence of evidence" claim.

    • davrosthedalek 18 hours ago

      He can't make the evidence of absence claim, but he can absolutely make the absence of evidence claim.

      • xmddmx 18 hours ago

        Perhaps in an “everyday language” way, but not in the technical, statistical sense.

        In an underpowered statistical study, a claim that two experimental conditions did not differ are not persuasive.

        • davrosthedalek 17 hours ago

          No. It's a description of the result of the maybe underpowered study. the underpowered study did not find evidence. Evidence is absent. Because it is underpowered, it's not evidence that the effect is absent.

          The claim is not "two experimental conditions did not differ". The claim is "The data do not show evidence that the experimental conditions did differ".

          • xmddmx 17 hours ago

            You say "the underpowered study did not find evidence". Not true, it found quite a bit of evidence - many statistics were presented. There is no absence of evidence. The author wrote about the evidence, presenting P values and other statistics.

            Of course the critical part is not the numbers, but what they mean.

            So, what does the evidence mean?

            The author interprets it to mean that there is no difference. They state this several times:

            "46% EXACT PERMUTATION TEST P-VALUE (ONE-SIDED, H₁: CLAUDE MEAN > HISTORICAL)[...] What this p-value tells us is There's nothing unusual about the Claude group."

            "74% ONE-SIDED P-VALUE (H₁: CLAUDE MORE LIKELY ABOVE MEDIAN) Fisher's exact test asks: if we split all releases at the historical median (0.74 sev/10c), are these Claude releases significantly buggy than previous releases (more likely to land above the median)? With a p-value of 74%, the answer is a decisive no. "

            In an under-powered study, when a P value is above your alpha level cutoff (.05, .01, whatever was chosen) you can't distinguish between "no effect" and "could be an effect, but I didn't see one".

            • davrosthedalek 17 hours ago

              Many statistics were presented. In the view of the author (and I think he is correct), none of them show evidence for an increased bug rate from Claude. That is absence of evidence (...for the increased bug rate).

              The two examples you bring are not claims of absence of evidence, but claims of evidence of absence. The author takes the result as evidence that there is no effect. As I wrote, the author shouldn't do that, because indeed you cannot distinguish between "no effect exists" and "no effect observed". But again, these are (wrong) claims for evidence of absence.

              The author can absolutely claim: I did these statistical tests, and none showed evidence that there is an effect. Absence of evidence. It's not a claim that there will never be evidence. Just that there is none from these tests.

              Edit: To convert the absence of evidence into evidence for absence, indeed you need to understand the statistical power of your test, and how it is affected by alternate hypotheses. And for that, without having done the math, having only two data points seems very thin.