> If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge.
Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!
> Audio is never used to train commercial models without explicit consent
I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.
Reminds me of my experience when trying to remove my Airbnb account, they require my ID card scans of both sides. I said fuck it and never touch this company again
This would eliminate the credit report, monitoring and fixing industry, which would be a good thing.
Court records are public in the US. If creditors want to know if you’ve been in financial trouble, they should check for bankruptcies and lawsuits, not the extrajudicial version of those that the credit reporting companies run based on hearsay.
Credit reporting is better in some ways than alternative systems of “vouching” for someone.
It’s not better in all ways, of course, but the alternative is not “everyone gets cheap credit extended to them” but rather “people who rich people know and trust get cheap credit extended to them, some others get more expensive credit, and some get no credit extended”. It’s not obvious to me that that’s better.
> Selling the solution to the problem you caused ought to be illegal.
Most tech solutions are built on the problems they created. This includes phones, cars, computers, every software upgrade, and almost every electronic gadget. You are forced to use them because the world around you is no longer compatible with the way of life that was before the introduction of these tech.
My interpretation would be that cars are necessary to live in places where urban design assumes that we'll use cars to get around. Many cities are designed this way.
Similarly, phones are required now for some activities, like online banking. First it was an option, then it became the norm.
Per the WSJ article last week, I suspect Mercor's playing in a grey area of contracts. It wasn't just voice.[0]
A lot of people were basically wiretapping themselves AND their businesses!
While a lot of Mercor "contractors" claim Mercor over-reached with data gathering via Insightful, it's kind of smart because people are too afraid to complain too much knowing they'll not only lose their primary job, but also open themselves up to uncapped liability for willful misconduct.
The irony runs deeper than the free analysis offer. The whole Mercor contractor relationship was this exact pattern: hand over studio-quality voice recordings and ID scans to get paid for data labeling work that didn't require either. "Explicit consent" was buried in the terms, and people clicked through because they needed the paycheck.
Now 40k people have learned that biometrics aren't passwords. You can't rotate your voice.
There's this myth (that came to you in pop culture) that you end up sounding like Tom Waits.
In reality, some phlegm aside, their voice is still the same in any way that matters.
If you knew people who didn't smoke and started (not uncommon in the 80s and 90s, quite a few people I know started smoking in university, or after the stress of a first job, some even later), and also the inverse, you can trivially hear it for yourself.
My voice is exactly the same as before I started smoking heavily, and I have never had any of the associated problems that most people seem to have (lung capacity, stamina, infections, phlegm etc) - pot luck I guess, like most things
Do you have a source for that? I can tell with pretty good accuracy whether my students smoke from their voices (adult language learners, we take smoke breaks together and they have no reason to conceal it), and would be very surprised if I’m just that lucky and there’s nothing a person can pick up on acoustically.
Do you need to calibrate it to be able to repeat it, and does that calibration change if you are at a different altitude and in different conditions, such as humidity?
Does merely changing altitude (or ambient pressure) change voice enough to be considered different by a recognition or synthesizing system?
Vocal lessons are both a lot of fun and a lot of work. I haven't been using any voiceprint systems but I know most humans are unable to tell that my trained voice is the same physical person as my old voice. Would be curious to find out if an AI voiceprint system can discern whether it's the same or not.
Are you talking about singing lessons, or actual talking training? Singing lessons helped me sing but didn't change the way i talked at all, but i was only able to afford them for a summer so maybe it takes more time than that
I'm referring to speaking, not singing. After a _lot_ of work, I can speak passably as a woman or man and switch freely between the two. Depending on context I generally choose just one for the entire conversation, as switching tends to cause whiplash in the listener (^_^).
The ability to switch mid-sentence is mostly just something I discovered I can do and is fun. But the ability to pass as my real gender is something that helps me feel safe. And when needed, being able to occasionally pass as my prior gender (e.g., when calling my bank until I can change my name/gender legally), it also quite useful.
When I was in NYC a while back, I met a woman at a friend's dinner party. She sounded totally American, but was in fact Brazilian. She worked as a lawyer, and said that she'd had to get extensive voice training in order to sound American so that people would take her more seriously professionally. I have no idea if the professional part worked, but the accent, mannerisms etc was amazing - I would never have guessed.
You’ll really like this then, it’s a clip of Phil Hendrie who I recently discovered. He does tons of voices and sound effects, his studio has multiple microphone and switches between them for different speakers.
Here is a clip of him when someone called his studio thinking they were the local Pizza Hut. Phil does all the other voices, including the phone system.
Also: it’s not just the first order smoking, respiratory issues, increased chance of illness, and chronic coughing can damage your voices presentation.
I have been telling people for years that biometrics (face, fingerprint, voice) is your username, not your password. But people are easily swayed by convenience.
> Now 40k people have learned that biometrics aren't passwords. You can't rotate your voice.
Voices aren't strong.
There just aren't that many unique characteristic parameters behind a voice - it's largely dictated by an evolutionary shared shared larynx and vocal tract. They aren't fingerprints.
The fact that human voice impersonation is not only widely possible but popular should give you an indication of this. Prosody, intonation, range, etc. - it's all flexible and can be learned and duplicated.
The signals are simple too, because we have to encode and decode them quickly. You may or may not be able to picture and rotate an apple tree in your head, but you can easily read this sentence in the voice of David Attenborough.
Moreover, you can easily fine tune a voice model to fit any other speaker. You can store the unique speaker embeddings in a very thin layer. Zero and few shot unseen sampling can even come close to full reproduction. You can measure this all quantitatively.
Voices are not, and never have been, fingerprints. They're just not that unique.
You can rotate your voice with substantial effort. Just speak differently: higher or lower pitch, a different accent. Your friends may look at you funny for the first few years.
This is an important point with biometrics that most people don't realize. When I say that biometrics aren't good security, most people are perplexed because they have seen movies and such that are high-tech where iris scans or fingerprints are the pinnacle of security.
I like to tell them this story that I read somewhere a decade or so ago. It might not be a true story (I never checked) but it's a helpful way of thinking about it.
Bob landed a great job and decided to celebrate by buying a new luxury car (a BMW in my recollection, but could be wrong) that had a thumbprint authentication for unlocking and for starting it, so you never have to carry external keys. One day a thief decided to steal Bob's car. They broke in to his house and tied him up. When they demanded the keys and he said there weren't any, they decided to cut off his thumb and use it as the key. Now Bob has no thumb and his car still got stolen.
The story I remember is French police units specifically launching focused investigation on the sudden explosion of crypto people / family members getting kidnapped and having a finger or more chopped off.
I did find your story from 2005 about a man having his finger chopped off once the thieves realized they would need his appendage every time in order to start the car</i>2.
I think "CYA" is maybe a misleading or overflowery term.
In the idealized world, the legal system is meant to provide an accessible alternative to violence for reconciling disputes, but it's increasingly wielded as an impossibly kafkaesque system meant to maintain corporate power over individuals.
I think "CYA" is an overly-flowery term for the reality that they're blocking every avenue for legal recourse, while a variety of other avenues still exist for which adding friction requires the maintenance of expensive and ongoing costs (owning multiple residences, hiring security, etc.)
(To be clear, I am advocating for a more accessible and level legal system, not for UHC-style violence.)
I'm taking some college courses, and one of them explicitly suggests to keep maybe-not-okay communications off of email so that "you don't expose your company to risks of litigation."
Ah, I see. So, when discussing ways to ensure cuatomers cannot utilize our warranty process, I'll make sure to do so in ways that are not traceable and won't show up in discovery.
The underlying reason is that employees don't always know what they're talking about, but their nonsense could be useful to the other side in a court case.
The bigger the company, the more speculation there is about stuff people don't actually understand.
Did you go to high school? A sister of a friend of a friend says blah blah blah and everybody knows that yadda yadda. Same thing happens in big companies, especially among people who are out of the loop but wish they knew all the inside details. I see this all the time and sometimes it sounds like something that would be pretty damaging in a court case.
In other cases I have heard people who ought to know better speculating about “what if” they didn’t have to follow the letter of some corporate policy that was rooted in risk avoidance. Again, it looks bad but it doesn’t mean anything concrete (except that the person might have iffy judgment).
I said this based on my years of working at companies on projects specifically to do things like delete all data as soon as it was legally permissible so it could never come up in court again.
And most of my “let’s take this offline” chats have led to discussions around doing illegal shit.
Hell, I had one manager give me handwritten code on paper and instructions to commit it under my name. The code in question would cause sales to go through without the discounts presented to customers because the discount service was buggy and his metrics were based on successfully completed sales. Even threatened to fire me when I said no, and only backed down when I put the paper in my pocket and asked if he would like for anyone else outside the room to see it or if he would not use me as a fall guy.
If your employees “don’t know what they’re talking about” then either they are not representative of the companies views and have no power to enact illegal policies for the company, or they do and you don’t have controls. Trying to hide that shit by default means you don’t get the benefit of the doubt, like you are giving them.
Sorry, I shouldn't have said it that way. What I meant was "remember high school?" Reading back i think it could look like "are you someone who didn't go to high school, because you sound uneducated?" Not my intent, and not something i would say.
The situations you describe are not what I have experienced, which I guess makes me lucky.
My point was that in discovery, the idle chatter of know-nothings looks bad. But if there are companies that really have something to hide, well I guess that's what discovery is for. And as for your manager pal, if someone did that I'd be looking for work that very afternoon.
>Sorry, I shouldn't have said it that way. What I meant was "remember high school?" Reading back i think it could look like "are you someone who didn't go to high school, because you sound uneducated?" Not my intent, and not something i would say.
apology accepted and I rescind my insult.
> My point was that in discovery, the idle chatter of know-nothings looks bad. But if there are companies that really have something to hide, well I guess that's what discovery is for. And as for your manager pal, if someone did that I'd be looking for work that very afternoon.
I did, switched jobs a few weeks later after that. Did keep the paper and let him know I still had it just to fuck with him back during those weeks however.
> The situations you describe are not what I have experienced, which I guess makes me lucky.
It may be the opposite and I was just unlucky, but I have run into multiple situations with companies making 100s of millions to billions a year where that sort of behavior occurred, so if people are being trained to hide unfortunate conversations then I am going to assume the worst barring large amounts of contrary evidence.
This is just companies fighting back against the ever-expanding powers of state surveillance.
Back when the relevant laws were written, most communications was oral and in-person, writing was reserved for the "important stuff". We now apply the laws that were designed for memos to messages on Slack, which are a lot like conversations than permanent documents.
That makes a lot of sense to me, thank you. I was probably projecting a lot of my own fears and feelings into the interpretation of a lot of what some of my courses are trying to teach me.
The general rule for email, text, and all other communications I've heard is: "Don't write anything that you wouldn't be comfortable seeing on the front page of the New York times."
Heard that first from a US mil commander who once ran for a minor political office like state rep.
I’ve also been told to preface all of my written communications with “dear lawyers and the FDA” at a job. Not that we did anything illegal, but sometimes you catch yourself writing statements that would be really easy to misconstrue.
> In the idealized world, the legal system is meant to provide an accessible alternative to violence for reconciling disputes, but it's increasingly wielded as an impossibly kafkaesque system meant to maintain corporate power over individuals.
This is an overly flowery way of saying: violence.
The worst of the consequences are the same. People end up dead, destitute, and/or with long-term health consequences and are unable to enjoy the fruits labor in the worst cases. In the milder cases i think i'd prefer a bruise for a week to a huge financial loss.
There are plenty of nonviolent extralegal options. Ranging fron sit-ins and protests, to destruction of property, to many examples in the CIA's subtle sabotage field guide like running meetings poorly.
They're saying due to the real world effects, the current system isn't meaningfully different from violence. They aren't advocating for violence in turn.
This reminds me of all the new companies that want to "help" you get your public information out of $CORPORATE hands; as if these companies will some how not succumb to either enshittification or breach.
The good thing about the grift economy is it grifts itself, like the turtles!
I remember an AI dataset tool asking candidates to record a 1 minute self intro video for interview purposes in 2022. I was wondering if they were manually watching all of them.
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.
Happy to discuss the forensic detection side. AudioSeal
watermarks, AASIST anti-spoofing, and how the detection landscape changes
once voice biometrics start leaking at scale.
Interesting - thanks for the rabbit hole today. ;)
Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.
HSBC offered voice verification years ago and I just laughed and said nope.
I don’t even use biometrics on apple devices, I use a 6 digit pin.
It was always a stupid idea.
The thing about been willing to trade convenience for security is you get called paranoid and then when the other shoe does drop and you are still doing that you still get called paranoid for the current thing you are not doing that “everyone does”.
Assuming Apple is truthful on this matter (so far it seems so), Apple devices store a mathematical representation of the data, not the data itself (i.e. not a picture of your finger) and keep it only on device on a special hardware section designed for extra security. When apps ask for authentication, they can never inspect the data, they can only ask “does this match?”.
Even if you were somehow able to exfiltrate the data and find some way to transform it for something nefarious, you’d still need to first attack and bypass a specific hardware feature of the target’s device.
So sure, not having any representation of the data anywhere is technically more secure (maybe, as typing your code could be intercepted by a shoulder surfer or a camera), but biometrics on Apple devices are fundamentally not the same as having your raw data available on a random server somewhere.
Also, given how many times you enter a 6-digit number over a day, it's absolutely trivial to steal it. Let alone basic patterns people use, smudges etc.
In the use case of a mobile phone, apple's face id absolutely improves security several-fold.
> Self-audit your public audio footprint. Search YouTube, podcast directories, and old Zoom recording
This is suggestion #1 on your list of remediation steps for victims, but you didn't provide any information on how anyone would actually do that. How exactly would I search the internet for copies of my voice?
Please don't tell me the solution is giving an embedding of my voice to another third party.
Great question. There's no "reverse voice search" yet the way there is for images — that's genuinely a tool the world needs. In the meantime, the most useful thing is searching your name across YouTube and podcast platforms to map out what's already public. And for Mercor contractors specifically, the California AG
breach notice gives you a solid legal basis to request full deletion. Worth doing today.
Note, this comment and your other one (https://news.ycombinator.com/item?id=47931838) were autokilled by HN, because it (rightly) detected that you're using AI to write your comments. I vouched this one to unkill it before I realized it was AI and supposed to be dead. I unvouched it, but your comment's still alive. So now I'm leaving a note saying mea culpa, and to suggest not using AI in your comments unless you want to be autokilled.
One more data point for why sueing companies should lead to CEO getting prison time as well. And ideally invent some kind a of equivalent of pruson for non human persons like organisations.
Because right now the incentive to do what's right are so low. Taking a risk with other's people lives is becomming the norm for companies.
I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
We have thumb drives that can store petabytes of data?
Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data
Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.
To some degree IMO big data is still a mindset when it might take a day to process your data in a normal SQL query. Some tech doesn't scale to the data size for all use cases, and you need different solutions.
Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.
Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.
I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.
Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.
That’s not the issue I’m hitting here primarily but yes.
My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.
Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.
I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.
Unfortunately the answer is usually people just want to hand wave away the critique for one reason or another. “People already do that” is an easy truism for stifling discussion.
> Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).
I wish all social media sites a very haha die in a fire.
Mate you're on a social media site right now that often has AI-generated content displayed at the top of whats "trending". Sure the general user-base does a better job here flagging that sort of stuff, as AI seems to be a shared interest in much of the community, but it still sneaks it's way by
You’re technically right but I think we can all agree HN is significantly different from the major players. The vast majority of us see the same posts and comments, for starters. The churn of posts is also much slower. You log on 2-3 times spread out in a day and you see 90% of the main posts. Top posts linger for 24-48hrs regularly.
No media uploading, memes are few and far between (usually punished), etc.
10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.
Nowadays you just throw all the data into a black box and believe whatever it says blindly.
Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.
Pedantic and relevant. If they lost the voice samples, they wouldn't have it for training new models. If they were copied, then they have lost nothing in terms of training.
I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.
If only our past 20 year old self data could be so ephemeral…
Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.
The West-German debate in the 70s came from the realization that the sheer size of the Holocaust/Shoah was in no small degree due to bureaucratic record keeping. Storing someone's ethnicity is potentially dangerous for that person.
There's also the other implication that the (East) Germans were Soviet just 35 years ago.
But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.
My understanding was that it was more that words can be concatenated into new words in German which is not so much a stereotype as more a misunderstanding of fact. I.e. You wouldn't think much about something like enjoyable-comuppence but schadenfreude looks more impressive without the hyphen.
I would argue it's not the exact same thing. Sure, when overdone then you would get the same. But the way it is, commonly used concatenated words are words, not just hyphenated words. They are used as words and without an extra though people don't parse them into separate parts, unlike they do with a list of words with hyphens.
E.g. you don't think of firefighter as fire-fighter in ordinary usage.
Yeah, so Germany had a ton of secret police files and of course learned very well what happens when a bunch of people start collecting dossiers.
So yeah, of course they've developed that type of distrust. Americans should have also after the 50-60s paranoia of red scare, black people etc. Instead they just spend a few decades building a anti-social state.
If you had a company, why not just tell all customers that their data is save but don't waste any money on security at all: in case of a breach, just write an apology email to your clients, promise a full investigation, and move on.
Obviously, you don't have to face any legal consequences, so why worry?
Sorry for the rant... but I just find this lack of liability frustrating.
I like this. I'm genuinely curious whether you could create a Delve [0] for security. Companies could pay for the "security review and package and dashboard" virtue signal, put an impressively secure looking logo on their site and effectively whitewash needing to do anything else. I suspect a sufficiently expensive law firm could draft the requisite legals to shield the principals SecCo from the eventual unveiling, but not before SecCo could make hundreds of millions and the rest of the industry could save hundreds of millions on their shit-as-fuck security practices anyway. Call the spade a spade.
So, they should all just rotate their voices ... right?
I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.
One of the problems is that "forever passwords" is a term used positively when I worked in banking, as it was a password that the customer could not forget and would not need support using.
So I could easily see a lot of people viewing this as a positive.
That's a really good point. It lays bare some of my biases when it comes to thinking about and communicating with "normal people" about this sort of thing.
the "it's easier" people operate on a fundamentally different way than you or I. they thrive in the world of plausible deniability and social trust. They almost dont care what happens to them as long as it isnt their fault. And they do not consider putting themselves at risk to be the same as being at fault
in a certain light, it's kind of admirable. they live like the world is the way it should be
Functionally, biometrics are closer to a username than a password.
Fingerprints, DNA, iris scans, gait patterns, etc. are all something you can't change (much like a permanent account ID) and are constantly being presented to the world (much like an email address). In addition under US law, police can compel presentation of fingerprints, but passwords are protected under the 5th amendment.
That's fair. Though, thinking about it this way, I'd argue they're even more like a permanent API key. Again, messaging completely lost on people who don't spend time worrying these things.
Mercor had a SOC 2, an MSA, all the right clauses. Voices still leaked. The apology email writes itself.
Why is voice and biometric stuff still server-side at all in 2026? Whisper.cpp runs on a phone. WebGPU works. Half these "we keep your voice secure" pipelines could run in the browser today.
The real reason isn't capability. It's cost. Centralised compute is cheaper to run, but that math only holds if you don't price in the periodic breach. Which nobody does until it's their own employees on the leak list.
Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
What happens now is that a lot of clueless CTO that didn't know about this company now know it's name. So the outcome of this mess is probably more business for Mercor
I mean, just look at what happened to Crowdstrike....
I was floating near some ex agency and GS15 folks yesterday in Houston, they explained to me that the Israeli cybersecurity apparatus has had everyone's voicemails for the last 20 years because they inserted themselves into the supply chain of voicemails somehow or another.
Kind of nuts all the ways audio data can be used now.
I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.
GOOG-411 was "competing" with a strong company (1-800-FREE411) by serving no ads in a category worth ~$3.5B at the time. It was inexplicable at the time, but they did this to get voice samples, way back when. For reasons like that, I expect that this category of training is baked — but I don't have current domain knowledge fwiw.
If this is real, the bigger issue might not even be the leak itself. It could be that we are quietly moving into a world where voice plus ID is enough to fully impersonate someone, and most systems are still not built for that reality.
There is also an ugly labor story here. The people labeling and training these systems are often the least protected when the data pipeline itself turns into the attack surface.
>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.
good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.
>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.
Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly
The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.
Someone who has hundreds or thousands of clients presumably couldn't remember every client's voice either, so no meaningful security is lost. They are approximately as secure or insecure as before
>presumably couldn't remember every client's voice either, so no meaningful security is lost
there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.
the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.
With most US banks, you can ask them to put in a note on your account file for a code word, it will show up anytime the account file is pulled up. Now, whether or not a customer service agent will know to do so is another question. Maybe as attack vectors like this are utilized more often it will become part of their SOP. Or just stop using voice verification. In my experience, even if you pass voice verification, it only grants you access to the account and check balance and txs but still requires information like PIN or a code sent in the app or phone number. There are attack vectors for these as well but not guaranteed.
The other use cases (like calling payroll, etc) likely don’t have the same protections and probably would be more effective.
The biometric pairing is what makes this particularly bad. A leaked password is recoverable. A leaked voiceprint combined with ID scans is permanent, you can not rotate your voice.
The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.
I'm pretty sure Google and Apple already have some decent examples of a LOT of people's voices in concert with other data collation. Google Voice IIRC was bought for audio sampling voicemail in the first place. Not sure if Apple has done similar, but would be more surprised if they didn't... Let alone the voice search options for both.
I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
Is this post not just an ad for a vibe coded site / product? It adds no new info on the mercor breach and advertises something which I presume has even worse safety practices
I'm curious: if i create an online sample from my voice, might this make it a lot harder for an AI model to identify me if every trainingdata contains my particular voice sample?
Isn’t this going to immediately become daily news?
Half the time I call a company they say “we are recording your voice for security / authentication purposes”.
The companies that do that have all the information on me that they require for me to set up an account, so their data breaches will be just like this one, but 1000x larger.
Can we just fast forward through the part where this works for ID theft, past the firefox age verification plugin that uses these datasets, and even through the part where people in the plugin dataset are digital outcasts (this voice has been used too many times. Want to try another?)
At the end of this dark predictable tunnel, maybe there will be a ban on biometrics for important stuff, a repeal of the age verification laws, and actual privacy legislation with teeth.
Where I live there was a common scam to manipulate voice recordings from phone calls. I was very careful back then with phone calls when I ran my own business. Like 15 years ago. Kinda crazy that any service would use voice recognition today as stated.
im the founder of a company that runs deepfake phishing simulations for enterprises, so biased on this one .. but the operational thing the piece misses is that this is the first widely circulated dump where voice, govt ID and selfie all came from the same onboarding session i.e. most enterprise call center auth still treats those as 3 independent factors ..
The scarier piece is that an attacker pulls a contractor from the dump, finds their employer on linkedin, then calls that companys IT helpdesk for a password reset with the cloned voice.
Great point about the helpdesk vector. The LinkedIn-to-IT-reset path is a brilliant illustration of how social engineering chains work. And you're right that audio is the frontier video deepfake detection has gotten really good, lots of great tools out there. Audio is the next wave, and the teams building solutions
for real-world call quality are going to unlock a massive market. Exciting space to be in.
This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
40k people are not under thread, I am getting AI contractor job offers every month on UpWork, I am glad I haven't accepted more than one as it is just not worth to do.
You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?
I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.
These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.
Tell us more about that fraud story! Was the person your attorney or accountant? Or just some "smart" person who decided to wisely save time by doing fraud?
It was a fund administrator. I still find it unbelievable that they would so casually do this. And yes, they thought they were very smart... and helpful too...
Because historically that's how it worked, but officials just looked at the document and verified that it was the real thing. Then photocopiers came along and it became normalized to take copies of the documents. Then digital copies happened and that changed things completely when coupled with networking technology. What the officials in charge don't seem to understand is that by making digital copies in networked environments the IDs themselves lost their value completely, after all if the digital copy serves any purpose at all as a stand-in for the original then they have become that original.
It feels like a dead end because it's being used wrong. "Is this John's voice?" is the wrong question. "Does this call look like how John normally calls?" is way
more interesting. Same device, same time of day, same way of starting a sentence. That whole pattern is much harder to fake than a voice alone. The authentication isn't dead, it just needs to grow up from a single check into a full picture.
Mercor is the most scummy company out there, run by a bunch of sleazeball 20 somethings who are getting a lot of press as the youngest billionaires in the making.
Fidelity seemed to sign you up for this when you called them on the phone almost automatically. Ridiculous since it was defeated easily in a hacker movie from the 1990s using a tape recorder.
they literally handed over their voice, their face, and their government id to train ai models for peanuts - and now lapsus is sitting on 4tb of 'you' that you can never change like a password
> If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge.
Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!
> Audio is never used to train commercial models without explicit consent
I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.
Reminds me of my experience when trying to remove my Airbnb account, they require my ID card scans of both sides. I said fuck it and never touch this company again
Quickly, extract some more money from this customer and hold their data hostage!
This reminds me of those identity theft settlements, where you need to prove your identity to claim the reward
Has your identity been stolen? Try our free credit monitoring for a month!
Selling the solution to the problem you caused ought to be illegal.
This would eliminate the credit report, monitoring and fixing industry, which would be a good thing.
Court records are public in the US. If creditors want to know if you’ve been in financial trouble, they should check for bankruptcies and lawsuits, not the extrajudicial version of those that the credit reporting companies run based on hearsay.
Credit reporting is better in some ways than alternative systems of “vouching” for someone.
It’s not better in all ways, of course, but the alternative is not “everyone gets cheap credit extended to them” but rather “people who rich people know and trust get cheap credit extended to them, some others get more expensive credit, and some get no credit extended”. It’s not obvious to me that that’s better.
> Selling the solution to the problem you caused ought to be illegal.
Most tech solutions are built on the problems they created. This includes phones, cars, computers, every software upgrade, and almost every electronic gadget. You are forced to use them because the world around you is no longer compatible with the way of life that was before the introduction of these tech.
I probably agree with you but what on earth are phones and cars doing in this list? They solve obvious physical problems not caused by a company.
My interpretation would be that cars are necessary to live in places where urban design assumes that we'll use cars to get around. Many cities are designed this way.
Similarly, phones are required now for some activities, like online banking. First it was an option, then it became the norm.
Exactly.
General Motors contributed significantly to the decline of passenger rail in the USA.
See https://en.wikipedia.org/wiki/General_Motors_streetcar_consp...
Per the WSJ article last week, I suspect Mercor's playing in a grey area of contracts. It wasn't just voice.[0]
A lot of people were basically wiretapping themselves AND their businesses!
While a lot of Mercor "contractors" claim Mercor over-reached with data gathering via Insightful, it's kind of smart because people are too afraid to complain too much knowing they'll not only lose their primary job, but also open themselves up to uncapped liability for willful misconduct.
[0] https://www.wsj.com/tech/ai/mercor-ai-startup-personal-data-...
The irony runs deeper than the free analysis offer. The whole Mercor contractor relationship was this exact pattern: hand over studio-quality voice recordings and ID scans to get paid for data labeling work that didn't require either. "Explicit consent" was buried in the terms, and people clicked through because they needed the paycheck.
Now 40k people have learned that biometrics aren't passwords. You can't rotate your voice.
> biometrics aren't passwords. You can't rotate your voice.
"My voice is my passport. Verify me."
I have to renew my passport every 10 years or so. How do I do that with my voice? I guess it's time to take some vocal lessons.
just take up smoking heavily
Despite popular belief, even heavy smoking does not alter your voice in a significant way.
Depends on what you're smoking
and mostly how.
bacon!
Despite popular belief, even heavy smoking does not alter your voice in a significant way.
I guess you don't listen to Sinatra.
Or John Mellencamp, who repeatedly states in interviews that he likes what smoking does to his singing voice.
Source: it came to me in a dream.
There's this myth (that came to you in pop culture) that you end up sounding like Tom Waits.
In reality, some phlegm aside, their voice is still the same in any way that matters.
If you knew people who didn't smoke and started (not uncommon in the 80s and 90s, quite a few people I know started smoking in university, or after the stress of a first job, some even later), and also the inverse, you can trivially hear it for yourself.
My voice is exactly the same as before I started smoking heavily, and I have never had any of the associated problems that most people seem to have (lung capacity, stamina, infections, phlegm etc) - pot luck I guess, like most things
Do you have a source for that? I can tell with pretty good accuracy whether my students smoke from their voices (adult language learners, we take smoke breaks together and they have no reason to conceal it), and would be very surprised if I’m just that lucky and there’s nothing a person can pick up on acoustically.
20 years of heavy smoking :)
Although it does seem to affect some people more than others for sure, I guess it depends how and what you're smoking.
Easier to inhale an undisclosed amount of helium before recording your password voice
Excellent idea!
Do you need to calibrate it to be able to repeat it, and does that calibration change if you are at a different altitude and in different conditions, such as humidity?
Does merely changing altitude (or ambient pressure) change voice enough to be considered different by a recognition or synthesizing system?
I recommend sulfur hexafluoride for something harder to replicate. Nothing like making hackers risk their life to impersonate you
Or skip the half measures and go straight for the dioxygen diflouride.
https://www.science.org/content/blog-post/things-i-won-t-wor...
Reminds me of the Interrail data breach [https://stateofsurveillance.org/news/eurail-data-breach-3080...]
The fediverse take on that was "customers are advised to rotate their faces and birthdays."
Vocal lessons are both a lot of fun and a lot of work. I haven't been using any voiceprint systems but I know most humans are unable to tell that my trained voice is the same physical person as my old voice. Would be curious to find out if an AI voiceprint system can discern whether it's the same or not.
Are you talking about singing lessons, or actual talking training? Singing lessons helped me sing but didn't change the way i talked at all, but i was only able to afford them for a summer so maybe it takes more time than that
I'm referring to speaking, not singing. After a _lot_ of work, I can speak passably as a woman or man and switch freely between the two. Depending on context I generally choose just one for the entire conversation, as switching tends to cause whiplash in the listener (^_^).
I'm curious as to what prompted you to pursue this ability.
There is a common enough need for this for some
I'm trans.
The ability to switch mid-sentence is mostly just something I discovered I can do and is fun. But the ability to pass as my real gender is something that helps me feel safe. And when needed, being able to occasionally pass as my prior gender (e.g., when calling my bank until I can change my name/gender legally), it also quite useful.
Does the f0 change? Or is it like power distribution of harmonics change? Or is it something else?
When I was in NYC a while back, I met a woman at a friend's dinner party. She sounded totally American, but was in fact Brazilian. She worked as a lawyer, and said that she'd had to get extensive voice training in order to sound American so that people would take her more seriously professionally. I have no idea if the professional part worked, but the accent, mannerisms etc was amazing - I would never have guessed.
I always answer my likely spam calls in a weird high pitched fake voice just in case.
You’ll really like this then, it’s a clip of Phil Hendrie who I recently discovered. He does tons of voices and sound effects, his studio has multiple microphone and switches between them for different speakers.
Here is a clip of him when someone called his studio thinking they were the local Pizza Hut. Phil does all the other voices, including the phone system.
https://share.google/QHNkgsOdvGj7tapfk
>> "My voice is my passport. Verify me."
Well met, fellow Uplinker!!
>Well met, fellow Uplinker!!
I'm pretty sure this person worked at Playtronics.
Smoke 40 cigarettes a day, your voice will be unrecognisable in no time
Also: it’s not just the first order smoking, respiratory issues, increased chance of illness, and chronic coughing can damage your voices presentation.
Biometrics are "what you are", not "what you know" or "what you have".
Voice fingeprinting is essentially useless because it is easily recorded and reproduced.
I have been telling people for years that biometrics (face, fingerprint, voice) is your username, not your password. But people are easily swayed by convenience.
If your user name is tattooed on your forehead, yes.
> Now 40k people have learned that biometrics aren't passwords. You can't rotate your voice.
Voices aren't strong.
There just aren't that many unique characteristic parameters behind a voice - it's largely dictated by an evolutionary shared shared larynx and vocal tract. They aren't fingerprints.
The fact that human voice impersonation is not only widely possible but popular should give you an indication of this. Prosody, intonation, range, etc. - it's all flexible and can be learned and duplicated.
The signals are simple too, because we have to encode and decode them quickly. You may or may not be able to picture and rotate an apple tree in your head, but you can easily read this sentence in the voice of David Attenborough.
Moreover, you can easily fine tune a voice model to fit any other speaker. You can store the unique speaker embeddings in a very thin layer. Zero and few shot unseen sampling can even come close to full reproduction. You can measure this all quantitatively.
Voices are not, and never have been, fingerprints. They're just not that unique.
> Now 40k people have learned that biometrics aren't passwords. You can't rotate your voice.
The problem is that even if you know that, you still get bombarded by banking apps promising "biometrics are more secure than passwords, switch now!"
I doubt 1% of the 40k will learn anything.
also this took me way too long to realize it had nothing to do with warhammer.
This comment is pure LLM.
I feel like we're right on the threshold where we give up and start interacting with slop like it's human written.
You can rotate your voice with substantial effort. Just speak differently: higher or lower pitch, a different accent. Your friends may look at you funny for the first few years.
This is an important point with biometrics that most people don't realize. When I say that biometrics aren't good security, most people are perplexed because they have seen movies and such that are high-tech where iris scans or fingerprints are the pinnacle of security.
I like to tell them this story that I read somewhere a decade or so ago. It might not be a true story (I never checked) but it's a helpful way of thinking about it.
Bob landed a great job and decided to celebrate by buying a new luxury car (a BMW in my recollection, but could be wrong) that had a thumbprint authentication for unlocking and for starting it, so you never have to carry external keys. One day a thief decided to steal Bob's car. They broke in to his house and tied him up. When they demanded the keys and he said there weren't any, they decided to cut off his thumb and use it as the key. Now Bob has no thumb and his car still got stolen.
The story I remember is French police units specifically launching focused investigation on the sudden explosion of crypto people / family members getting kidnapped and having a finger or more chopped off.
I did find your story from 2005 about a man having his finger chopped off once the thieves realized they would need his appendage every time in order to start the car</i>2.
https://news.sky.com/story/french-police-investigating-serie...
[2] https://www.newscientist.com/article/mg18624943-600-finger-c...
I think "CYA" is maybe a misleading or overflowery term.
In the idealized world, the legal system is meant to provide an accessible alternative to violence for reconciling disputes, but it's increasingly wielded as an impossibly kafkaesque system meant to maintain corporate power over individuals.
I think "CYA" is an overly-flowery term for the reality that they're blocking every avenue for legal recourse, while a variety of other avenues still exist for which adding friction requires the maintenance of expensive and ongoing costs (owning multiple residences, hiring security, etc.)
(To be clear, I am advocating for a more accessible and level legal system, not for UHC-style violence.)
I'm taking some college courses, and one of them explicitly suggests to keep maybe-not-okay communications off of email so that "you don't expose your company to risks of litigation."
Ah, I see. So, when discussing ways to ensure cuatomers cannot utilize our warranty process, I'll make sure to do so in ways that are not traceable and won't show up in discovery.
The underlying reason is that employees don't always know what they're talking about, but their nonsense could be useful to the other side in a court case.
The bigger the company, the more speculation there is about stuff people don't actually understand.
That’s not the underlying reason.
The underlying reason is to break the law and not get caught. Let’s be real here.
Did you go to high school? A sister of a friend of a friend says blah blah blah and everybody knows that yadda yadda. Same thing happens in big companies, especially among people who are out of the loop but wish they knew all the inside details. I see this all the time and sometimes it sounds like something that would be pretty damaging in a court case.
In other cases I have heard people who ought to know better speculating about “what if” they didn’t have to follow the letter of some corporate policy that was rooted in risk avoidance. Again, it looks bad but it doesn’t mean anything concrete (except that the person might have iffy judgment).
> Did you go to high school?
Hey, fuck you too buddy.
I said this based on my years of working at companies on projects specifically to do things like delete all data as soon as it was legally permissible so it could never come up in court again.
And most of my “let’s take this offline” chats have led to discussions around doing illegal shit.
Hell, I had one manager give me handwritten code on paper and instructions to commit it under my name. The code in question would cause sales to go through without the discounts presented to customers because the discount service was buggy and his metrics were based on successfully completed sales. Even threatened to fire me when I said no, and only backed down when I put the paper in my pocket and asked if he would like for anyone else outside the room to see it or if he would not use me as a fall guy.
If your employees “don’t know what they’re talking about” then either they are not representative of the companies views and have no power to enact illegal policies for the company, or they do and you don’t have controls. Trying to hide that shit by default means you don’t get the benefit of the doubt, like you are giving them.
Sorry, I shouldn't have said it that way. What I meant was "remember high school?" Reading back i think it could look like "are you someone who didn't go to high school, because you sound uneducated?" Not my intent, and not something i would say.
The situations you describe are not what I have experienced, which I guess makes me lucky.
My point was that in discovery, the idle chatter of know-nothings looks bad. But if there are companies that really have something to hide, well I guess that's what discovery is for. And as for your manager pal, if someone did that I'd be looking for work that very afternoon.
>Sorry, I shouldn't have said it that way. What I meant was "remember high school?" Reading back i think it could look like "are you someone who didn't go to high school, because you sound uneducated?" Not my intent, and not something i would say.
apology accepted and I rescind my insult.
> My point was that in discovery, the idle chatter of know-nothings looks bad. But if there are companies that really have something to hide, well I guess that's what discovery is for. And as for your manager pal, if someone did that I'd be looking for work that very afternoon.
I did, switched jobs a few weeks later after that. Did keep the paper and let him know I still had it just to fuck with him back during those weeks however.
> The situations you describe are not what I have experienced, which I guess makes me lucky.
It may be the opposite and I was just unlucky, but I have run into multiple situations with companies making 100s of millions to billions a year where that sort of behavior occurred, so if people are being trained to hide unfortunate conversations then I am going to assume the worst barring large amounts of contrary evidence.
This is just companies fighting back against the ever-expanding powers of state surveillance.
Back when the relevant laws were written, most communications was oral and in-person, writing was reserved for the "important stuff". We now apply the laws that were designed for memos to messages on Slack, which are a lot like conversations than permanent documents.
That makes a lot of sense to me, thank you. I was probably projecting a lot of my own fears and feelings into the interpretation of a lot of what some of my courses are trying to teach me.
The general rule for email, text, and all other communications I've heard is: "Don't write anything that you wouldn't be comfortable seeing on the front page of the New York times."
Heard that first from a US mil commander who once ran for a minor political office like state rep.
I’ve also been told to preface all of my written communications with “dear lawyers and the FDA” at a job. Not that we did anything illegal, but sometimes you catch yourself writing statements that would be really easy to misconstrue.
> In the idealized world, the legal system is meant to provide an accessible alternative to violence for reconciling disputes, but it's increasingly wielded as an impossibly kafkaesque system meant to maintain corporate power over individuals.
This is an overly flowery way of saying: violence.
The worst of the consequences are the same. People end up dead, destitute, and/or with long-term health consequences and are unable to enjoy the fruits labor in the worst cases. In the milder cases i think i'd prefer a bruise for a week to a huge financial loss.
There are plenty of nonviolent extralegal options. Ranging fron sit-ins and protests, to destruction of property, to many examples in the CIA's subtle sabotage field guide like running meetings poorly.
They're saying due to the real world effects, the current system isn't meaningfully different from violence. They aren't advocating for violence in turn.
This reminds me of all the new companies that want to "help" you get your public information out of $CORPORATE hands; as if these companies will some how not succumb to either enshittification or breach.
The good thing about the grift economy is it grifts itself, like the turtles!
I remember an AI dataset tool asking candidates to record a 1 minute self intro video for interview purposes in 2022. I was wondering if they were manually watching all of them.
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.
Interesting - thanks for the rabbit hole today. ;)
Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.
Didn't this happen three weeks ago?
Mercor has definitely released statements with boilerplate "investigations are underway."
HSBC offered voice verification years ago and I just laughed and said nope.
I don’t even use biometrics on apple devices, I use a 6 digit pin.
It was always a stupid idea.
The thing about been willing to trade convenience for security is you get called paranoid and then when the other shoe does drop and you are still doing that you still get called paranoid for the current thing you are not doing that “everyone does”.
Paraphrasing Franklin and Churchill, those who trade some security for some convenience may soon find themselves possessed of neither at all.
> I don’t even use biometrics on apple devices
Assuming Apple is truthful on this matter (so far it seems so), Apple devices store a mathematical representation of the data, not the data itself (i.e. not a picture of your finger) and keep it only on device on a special hardware section designed for extra security. When apps ask for authentication, they can never inspect the data, they can only ask “does this match?”.
Even if you were somehow able to exfiltrate the data and find some way to transform it for something nefarious, you’d still need to first attack and bypass a specific hardware feature of the target’s device.
So sure, not having any representation of the data anywhere is technically more secure (maybe, as typing your code could be intercepted by a shoulder surfer or a camera), but biometrics on Apple devices are fundamentally not the same as having your raw data available on a random server somewhere.
Also, given how many times you enter a 6-digit number over a day, it's absolutely trivial to steal it. Let alone basic patterns people use, smudges etc.
In the use case of a mobile phone, apple's face id absolutely improves security several-fold.
> Self-audit your public audio footprint. Search YouTube, podcast directories, and old Zoom recording
This is suggestion #1 on your list of remediation steps for victims, but you didn't provide any information on how anyone would actually do that. How exactly would I search the internet for copies of my voice?
Please don't tell me the solution is giving an embedding of my voice to another third party.
Great question. There's no "reverse voice search" yet the way there is for images — that's genuinely a tool the world needs. In the meantime, the most useful thing is searching your name across YouTube and podcast platforms to map out what's already public. And for Mercor contractors specifically, the California AG breach notice gives you a solid legal basis to request full deletion. Worth doing today.
Note, this comment and your other one (https://news.ycombinator.com/item?id=47931838) were autokilled by HN, because it (rightly) detected that you're using AI to write your comments. I vouched this one to unkill it before I realized it was AI and supposed to be dead. I unvouched it, but your comment's still alive. So now I'm leaving a note saying mea culpa, and to suggest not using AI in your comments unless you want to be autokilled.
Thanks for saving me the tokens.
One more data point for why sueing companies should lead to CEO getting prison time as well. And ideally invent some kind a of equivalent of pruson for non human persons like organisations.
Because right now the incentive to do what's right are so low. Taking a risk with other's people lives is becomming the norm for companies.
The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.
Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.
I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
Were you not around for the Big Data heyday a decade ago?
Until thumb drives became large enough to fit most datasets it stopped becoming Big Data. Just normal data.
We have thumb drives that can store petabytes of data?
Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data
Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.
That's a decent quality 3 hours movie :D
> We have thumb drives that can store petabytes of data
We do?
Please provide a link.
You would need 4 and change of these 245tb Kioxias to hold 1 petabyte, and an entire server grade computer to run them.
https://www.tomshardware.com/pc-components/ssds/kioxia-unvei...
Or 250 of these ~$400 4tb flash drives and an insane number of dongles to connect them all:
https://www.slashgear.com/1847725/largest-usb-thumb-drive-hi...
Plus one more for your parity drive.
It was a question that you've edited out the punctuation. You're asking the exact same thing as the person you've replied
My rule of thumb was "can it fit in RAM on a server?" If it can, then it's not big data.
500GB is in the "fits" category.
You can quadruple that and could still fit in server RAM
To some degree IMO big data is still a mindset when it might take a day to process your data in a normal SQL query. Some tech doesn't scale to the data size for all use cases, and you need different solutions.
Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.
Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.
Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.
“Before LLM’s there was_____”
I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.
Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
Do LLMs require that much more data than the tradional ML approaches we've seen over the years?
Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.
That’s not the issue I’m hitting here primarily but yes.
My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.
Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.
I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.
I really hate this when it's something negative that humans also do. It's like, yeah, people do do that, but why are we automating {negativeTrait}?
Unfortunately the answer is usually people just want to hand wave away the critique for one reason or another. “People already do that” is an easy truism for stifling discussion.
> Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).
I wish all social media sites a very haha die in a fire.
Mate you're on a social media site right now that often has AI-generated content displayed at the top of whats "trending". Sure the general user-base does a better job here flagging that sort of stuff, as AI seems to be a shared interest in much of the community, but it still sneaks it's way by
You’re technically right but I think we can all agree HN is significantly different from the major players. The vast majority of us see the same posts and comments, for starters. The churn of posts is also much slower. You log on 2-3 times spread out in a day and you see 90% of the main posts. Top posts linger for 24-48hrs regularly.
No media uploading, memes are few and far between (usually punished), etc.
10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.
Nowadays you just throw all the data into a black box and believe whatever it says blindly.
Data that is publicly available also can't be stolen or leaked. Nobody can steal Mozilla's common voice dataset.
Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.
pedantic and true. What was stolen was not data, but future revenue based on exclusive access to that data.
Pedantic and relevant. If they lost the voice samples, they wouldn't have it for training new models. If they were copied, then they have lost nothing in terms of training.
The use of "steal" for non-physical things pre-dates the use of "data" in the modern sense [1]. Policing language incorrectly is not reasonable.
[0] https://www.opensourceshakespeare.org/views/plays/play_view....
[1] https://www.etymonline.com/word/data
Money is not a physical thing.
> Germans (because of course)
I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.
Love it, also love how Datenschatten can also imply that it disappears when someone shines light on it
If only our past 20 year old self data could be so ephemeral…
Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.
The Stasi would be the obvious cultural context.
In the US of course the government buys this sort of information legally from corporations.
> The Stasi would be the obvious cultural context.
There is also the rather famous example of how earlier census data was used in the 40’s.
Once the government has your data, they have it. The next generation of representatives may not follow all the same rules and norms
The stasi could only dream of the kind of surveillance the NSA et al has today.
Or Facebook or Equifax.
The West-German debate in the 70s came from the realization that the sheer size of the Holocaust/Shoah was in no small degree due to bureaucratic record keeping. Storing someone's ethnicity is potentially dangerous for that person.
I took the "because of course" to be about having a word for everything - a stereotypical idea about the German language.
There's also the other implication that the (East) Germans were Soviet just 35 years ago.
But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.
East Germany was not Soviet. Under influence/control of the Soviets, yes, but not part of the Soviet Union.
I was implying all 3 of the above:
- we learned the hard way that data will be used to kill people, during the Nazi regime
- we learned it again in the GDR with the Stasi being a little less obvious but still ruining people's livelihoods
- and German comes up with compound words for such things
That's like saying that English (because of course) is able to describe the concept by a combination of words.
My understanding was that it was more that words can be concatenated into new words in German which is not so much a stereotype as more a misunderstanding of fact. I.e. You wouldn't think much about something like enjoyable-comuppence but schadenfreude looks more impressive without the hyphen.
I would argue it's not the exact same thing. Sure, when overdone then you would get the same. But the way it is, commonly used concatenated words are words, not just hyphenated words. They are used as words and without an extra though people don't parse them into separate parts, unlike they do with a list of words with hyphens.
E.g. you don't think of firefighter as fire-fighter in ordinary usage.
Germany resisted Google Street View until 2023, which was something I thought was very impressive.
Yeah, so Germany had a ton of secret police files and of course learned very well what happens when a bunch of people start collecting dossiers.
So yeah, of course they've developed that type of distrust. Americans should have also after the 50-60s paranoia of red scare, black people etc. Instead they just spend a few decades building a anti-social state.
> The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.
Except no company is learning this lesson.
The enterprise threat model includes "our own users", and the modus operandi is to maintain as much information on that threat as possible.
The only winning move is not to play.
Seems a bit like blaming the victim? Your voice (like DNA) is kind of ambient data that's hard to hide.
Or you could put it in a box with no connection to the internet.
Introducing… The Hooli Box!
Do Germans have lots of words or just a lack of spaces?
If you had a company, why not just tell all customers that their data is save but don't waste any money on security at all: in case of a breach, just write an apology email to your clients, promise a full investigation, and move on.
Obviously, you don't have to face any legal consequences, so why worry?
Sorry for the rant... but I just find this lack of liability frustrating.
I like this. I'm genuinely curious whether you could create a Delve [0] for security. Companies could pay for the "security review and package and dashboard" virtue signal, put an impressively secure looking logo on their site and effectively whitewash needing to do anything else. I suspect a sufficiently expensive law firm could draft the requisite legals to shield the principals SecCo from the eventual unveiling, but not before SecCo could make hundreds of millions and the rest of the industry could save hundreds of millions on their shit-as-fuck security practices anyway. Call the spade a spade.
0 - https://techcrunch.com/2026/03/22/delve-accused-of-misleadin...
So, they should all just rotate their voices ... right?
I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.
One of the problems is that "forever passwords" is a term used positively when I worked in banking, as it was a password that the customer could not forget and would not need support using.
So I could easily see a lot of people viewing this as a positive.
That's a really good point. It lays bare some of my biases when it comes to thinking about and communicating with "normal people" about this sort of thing.
People having a bad memory is it enormous cost to institutions, which is why biometrics is so appealing in the first place.
Them being forever passwords is the value prop. The risk scene has changed, but that was essentially always the pitch.
Biometrics are fine. My dog has an ID chip in injected under his skin.
Your dog has nothing worth stealing and is not responsible for anything.
the "it's easier" people operate on a fundamentally different way than you or I. they thrive in the world of plausible deniability and social trust. They almost dont care what happens to them as long as it isnt their fault. And they do not consider putting themselves at risk to be the same as being at fault
in a certain light, it's kind of admirable. they live like the world is the way it should be
That’s HN users towards politics and the environment. Sitting smugly with their yubikey and encrypted laptop while the world around them crumbles.
Functionally, biometrics are closer to a username than a password.
Fingerprints, DNA, iris scans, gait patterns, etc. are all something you can't change (much like a permanent account ID) and are constantly being presented to the world (much like an email address). In addition under US law, police can compel presentation of fingerprints, but passwords are protected under the 5th amendment.
That's fair. Though, thinking about it this way, I'd argue they're even more like a permanent API key. Again, messaging completely lost on people who don't spend time worrying these things.
Mercor had a SOC 2, an MSA, all the right clauses. Voices still leaked. The apology email writes itself.
Why is voice and biometric stuff still server-side at all in 2026? Whisper.cpp runs on a phone. WebGPU works. Half these "we keep your voice secure" pipelines could run in the browser today.
The real reason isn't capability. It's cost. Centralised compute is cheaper to run, but that math only holds if you don't price in the periodic breach. Which nobody does until it's their own employees on the leak list.
Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
What happens now is that a lot of clueless CTO that didn't know about this company now know it's name. So the outcome of this mess is probably more business for Mercor
I mean, just look at what happened to Crowdstrike....
Mercor has around 5 customers that make up 95% of its revenue. Anybody who needs to know about them already does.
At minimum, collecting voiceprints should come with much stricter consent, retention and security requirements than ordinary "training data"
It more looks like the purpose of such company was to steal such data.
Look at their privacy policies. It absolutely is. They are harvesting video, voice, and much more.
> What does an attacker actually do with thirty seconds of someone's clean read voice plus a scan of their driver's license?
I could think of quite a few things. I know that my bank and brokerage use voice ID.
I was floating near some ex agency and GS15 folks yesterday in Houston, they explained to me that the Israeli cybersecurity apparatus has had everyone's voicemails for the last 20 years because they inserted themselves into the supply chain of voicemails somehow or another.
Kind of nuts all the ways audio data can be used now.
A few Israeli companies supply the software used to record phone calls when you call customer service.
I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
It's already there. And keeps moving.
Even have a nice UI on top.
https://voicebox.sh/
Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.
https://commonvoice.mozilla.org/en/languages
GOOG-411 was "competing" with a strong company (1-800-FREE411) by serving no ads in a category worth ~$3.5B at the time. It was inexplicable at the time, but they did this to get voice samples, way back when. For reasons like that, I expect that this category of training is baked — but I don't have current domain knowledge fwiw.
Yep, the silence around provenance is probably the most suspicious part
If this is real, the bigger issue might not even be the leak itself. It could be that we are quietly moving into a world where voice plus ID is enough to fully impersonate someone, and most systems are still not built for that reality.
There is also an ugly labor story here. The people labeling and training these systems are often the least protected when the data pipeline itself turns into the attack surface.
>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.
good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.
>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.
Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly
The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.
Someone who has hundreds or thousands of clients presumably couldn't remember every client's voice either, so no meaningful security is lost. They are approximately as secure or insecure as before
>presumably couldn't remember every client's voice either, so no meaningful security is lost
there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.
the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.
With most US banks, you can ask them to put in a note on your account file for a code word, it will show up anytime the account file is pulled up. Now, whether or not a customer service agent will know to do so is another question. Maybe as attack vectors like this are utilized more often it will become part of their SOP. Or just stop using voice verification. In my experience, even if you pass voice verification, it only grants you access to the account and check balance and txs but still requires information like PIN or a code sent in the app or phone number. There are attack vectors for these as well but not guaranteed.
The other use cases (like calling payroll, etc) likely don’t have the same protections and probably would be more effective.
The biometric pairing is what makes this particularly bad. A leaked password is recoverable. A leaked voiceprint combined with ID scans is permanent, you can not rotate your voice.
The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.
I'm pretty sure Google and Apple already have some decent examples of a LOT of people's voices in concert with other data collation. Google Voice IIRC was bought for audio sampling voicemail in the first place. Not sure if Apple has done similar, but would be more surprised if they didn't... Let alone the voice search options for both.
> How to check if your voice is being misused
I love that the answer here is basically.. - you don't -
But maybe mitigate at unreasonable personal costs.
How about services simply stop taking public information as proof of identity?
I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
Is this post not just an ad for a vibe coded site / product? It adds no new info on the mercor breach and advertises something which I presume has even worse safety practices
I'm curious: if i create an online sample from my voice, might this make it a lot harder for an AI model to identify me if every trainingdata contains my particular voice sample?
I saw the red flags immediately when I stumbled across them a year ago maybe. I'm really not surprised.
Isn’t this going to immediately become daily news?
Half the time I call a company they say “we are recording your voice for security / authentication purposes”.
The companies that do that have all the information on me that they require for me to set up an account, so their data breaches will be just like this one, but 1000x larger.
Can we just fast forward through the part where this works for ID theft, past the firefox age verification plugin that uses these datasets, and even through the part where people in the plugin dataset are digital outcasts (this voice has been used too many times. Want to try another?)
At the end of this dark predictable tunnel, maybe there will be a ban on biometrics for important stuff, a repeal of the age verification laws, and actual privacy legislation with teeth.
Where I live there was a common scam to manipulate voice recordings from phone calls. I was very careful back then with phone calls when I ran my own business. Like 15 years ago. Kinda crazy that any service would use voice recognition today as stated.
im the founder of a company that runs deepfake phishing simulations for enterprises, so biased on this one .. but the operational thing the piece misses is that this is the first widely circulated dump where voice, govt ID and selfie all came from the same onboarding session i.e. most enterprise call center auth still treats those as 3 independent factors ..
The scarier piece is that an attacker pulls a contractor from the dump, finds their employer on linkedin, then calls that companys IT helpdesk for a password reset with the cloned voice.
Fwiw we put up a free realtime face swap demo a while back at https://www.callstrike.ai/deepfake-security-training .. worth a look if you want to actually feel how trivial this has gotten.
Great point about the helpdesk vector. The LinkedIn-to-IT-reset path is a brilliant illustration of how social engineering chains work. And you're right that audio is the frontier video deepfake detection has gotten really good, lots of great tools out there. Audio is the next wave, and the teams building solutions for real-world call quality are going to unlock a massive market. Exciting space to be in.
This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
40k people are not under thread, I am getting AI contractor job offers every month on UpWork, I am glad I haven't accepted more than one as it is just not worth to do.
You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?
I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.
These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.
Tell us more about that fraud story! Was the person your attorney or accountant? Or just some "smart" person who decided to wisely save time by doing fraud?
It was a fund administrator. I still find it unbelievable that they would so casually do this. And yes, they thought they were very smart... and helpful too...
Why is the ID a hidden secret that can be used for anything regarding security in the first place?
Because historically that's how it worked, but officials just looked at the document and verified that it was the real thing. Then photocopiers came along and it became normalized to take copies of the documents. Then digital copies happened and that changed things completely when coupled with networking technology. What the officials in charge don't seem to understand is that by making digital copies in networked environments the IDs themselves lost their value completely, after all if the digital copy serves any purpose at all as a stand-in for the original then they have become that original.
I love how the check if your affected involves giving a voice sample to whatever the fuck that website is
It's like those have been owned websites. Where you type in your name email and they grab your IP location and anything else to sell it off.
This is exactly why "voice as authentication" feels like a dead end to me
It feels like a dead end because it's being used wrong. "Is this John's voice?" is the wrong question. "Does this call look like how John normally calls?" is way more interesting. Same device, same time of day, same way of starting a sentence. That whole pattern is much harder to fake than a voice alone. The authentication isn't dead, it just needs to grow up from a single check into a full picture.
I'm at the point where I might start professionally using a voice changer. I mean what in the world, my guy?
Open Source now?
Mercor is the most scummy company out there, run by a bunch of sleazeball 20 somethings who are getting a lot of press as the youngest billionaires in the making.
Can't wait for them to crash and burn.
30 under 30 doing 10 to 20 candidates right there.
Youngest totally self-made billionaires.
Now open Chinese models can catch up
"My voice is my passport. Verify Me."
:)
HSBC did that. I could never understand that - the exact phrase was in the movie!
Someone probably did it for an internal demo, as a joke. Then people pushed it upwards, until someone clueless approved it.
Fidelity seemed to sign you up for this when you called them on the phone almost automatically. Ridiculous since it was defeated easily in a hacker movie from the 1990s using a tape recorder.
Much cleaner than keeping a finger on you to bypass the print reader.
not to be conspiratorial but stolen? or given away...
they literally handed over their voice, their face, and their government id to train ai models for peanuts - and now lapsus is sitting on 4tb of 'you' that you can never change like a password