> Some users may knowingly install this software on their devices, lured by the promise of “monetizing” their spare bandwidth.
Sounds like they’re targeting networks even if the users are ok participating in, precisely what you’re saying is ok.
As for malware enrolling people into the network, it depends if the operator is doing it or if the malware is 3rd parties trying to get a portion of the cash flow. In the latter case the network would be the victim that’s double victimized by Google also attacking them.
Users are OK with acting as proxies because they don't understand all the shady stuff their proxy is being used for. Also consumer ISPs generally ban this.
But then would you make the same arguments for running a tor node (presumably, you don't know what shady stuff is there, but you know there's shady stuff)?
> These SDKs, which are offered to developers across multiple mobile and desktop platforms, surreptitiously enroll user devices into the IPIDEA network.
> These SDKs, which are offered to developers across multiple mobile and desktop platforms.
> other actors then surreptitiously enroll user devices into the IPIDEA network using these frameworks.
I’m not saying Google did the wrong thing, but it is one private entity essentially handing out a death sentence on its own. The only mitigating thing is that a) technical disruptions were either on their own infra b) legal judgements they then enforced with cooperation from others like Cloudflare. But it’s not clear what the legal proceedings were actually like
Am I the only one cynically thinking that "Russia, Iran, DPRK, PRC, etc" is the "But think of the chiiildren!!!" excuse for doing this?
And when Google say
"IPIDEA’s proxy infrastructure is a little-known component of the digital ecosystem leveraged by a wide array of bad actors."
What they really mean is " ... leveraged by actors indiscriminately scraping the web and ignoring copyright - that are not us."
I can't help but feel this is just Google trying to pull the ladder up behind then and make it more difficult for other companies to collect training data.
>I can't help but feel this is just Google trying to pull the ladder up behind then and make it more difficult for other companies to collect training data.
I can very easily see this as being Google's reasoning for these actions, but let's not pretend that clandestine residential proxies aren't used for nefarious things. The vast majority of social media networks will ban - or more generally and insiously - shadow ban accounts/IPs that use known proxy IPs. This means that they are gating access to their platforms behind residential IPs (on top of their other various blackboxes and heuristics like fingerprinting). Operators of bot networks thus rely on residential proxy services to engage in their work, which ranges from mundane things like engagement farming to outright dangerous things like political astroturfing, sentiment manipulation, and propaganda dissemination.
LLMs and generative image and video models have made the creation of biased and convincing content trivial and cheap, if not free. The days of "troll farms" is over, and now the greatest expense for a bad actor wishing to influence the world with fake engagement and biased opinions is their access to platforms, which means accounts and internet connections that aren't blacklisted or shadow banned. Account maturity and reputation farming is also feeling a massive boon due to these tools, but as an independent market it also similarly requires internet connections that aren't blacklisted or shadow banned. Residential proxies are the bottleneck for the vast majority of bad actors.
> The vast majority of social media networks will ban - or more generally and insiously - shadow ban accounts/IPs that use known proxy IPs. This means that they are gating access to their platforms behind residential IPs (on top of their other various blackboxes and heuristics like fingerprinting)
Social media will ban proxy IPs, yet gleefully force you to provide your ID if you happen to connect from the wrong patch of land. I find it difficult not to support any and all attempts to bypass such measures.
The fact is that there's now a perfectly legitimate use for residential proxies, and the demand is just going to keep growing as more websites decide to "protect their content", and more governments decide to pass tyrannical laws that force people to mask their IPs. And with demand, comes supply, so don't expect them to go away any time soon.
This really just sounds like a rehash of the argument against encryption. "Bad people use it, so it should go away" - never mind that there are completely legitimate uses for it. Never mind that using a residential proxy might be the only way to get any privacy at all in a future where everyone blocks VPNs and Tor, a future where you may not even be able to post online without an ID depending you where you live, a future which we're swiftly approaching.
It's already here, in fact. Imgur blocks UK users, but it also blocks VPNs and Tor. The only way somebody living in the UK can access Imgur is through a residential proxy.
> The only way somebody living in the UK can access Imgur is through a residential proxy.
And very little of value was lost.
> This really just sounds like a rehash of the argument against encryption. "Bad people use it, so it should go away" - never mind that there are completely legitimate uses for it.
Except that almost everything that uses encryption has some legitimate use. There are pretty much no legitimate uses for residential proxies, and their use in flooding the Internet with crap greatly outweighs that.
If I plumbed a 30cm sewage line straight into your living room would you be happy with it? Okay, well, tell you what, let's make it totally legit - I'll drop a tasty ripe strawberry into the stream of effluent every so often, how about that?
No, what they're saying is what they said, what you're implying reveals a strange bias. Web scraping through residential proxies? Please think through your thoughts more. There's much more effective and efficient ways to do so. Multiple bad actors, like ransomware affiliates, have been caught using residential proxy networks. But by all means, don't let facts and cyber threat intelligence get in the way.
> Am I the only one cynically thinking that "Russia, Iran, DPRK, PRC, etc" is the "But think of the chiiildren!!!" excuse for doing this?
Maybe. But until I dropped all traffic from pretty much every mobile network provider in Russia and Israel, I'd get up every morning to a couple of thousand new users of whom a couple of hundred had consistently within a few hundred milliseconds created an account, clicked on the activation link, and then posted a bunch of messages in every forum category spreading hate speech.
Getting rid of malware is good. A private for-profit company exercising its power over the Internet, not so much. We should have appropriate organizations for this.
The proxies is the reason why you get spam in your Google search result, spam in your Play store (by means of fake good reviews), basically spam in anything user generated.
It directly affects Google and you, I don’t see why they should not do this.
Spam in Google search results is due to Google happily taking money from the spammers in exchange for promoting their spam, or that the spam sites benefit Google indirectly by embedding Google Ads/Analytics.
I don't see any spam in Kagi, so clearly there is a way to detect and filter it out. Google is simply not doing so because it would cut into their profits.
"SEO spammers being more advanced than multi-billion-dollar search conglomerate" is a myth. Spam sites have an obvious objective: display ads, shill affiliate links or sell products. All these have to be visible, since an ad or product you can't see/buy is worthless. It is trivial to train a classifier to detect these.
But let's play devil's advocate and say you are right and spammers are successfully outsmarting Google - well, Kagi does use Google results via SerpAPI by their own admission, meaning they too should have those spam results. Yet they somehow manage to filter them out with a fraction of the resources available to Google itself with no negative impact on search quality.
Many are "compensated" (in the way of software they didn't pay for), so the real question is that of disclosure (in which case many software vendors check the box in the most minimal way possible by including it as fine print during the install)
No, the question is not just disclosure. People have their bandwidth stolen, and sometimes internet access revoked due to this kind of fraud and misuse - disclosure wouldn’t solve that
Also, as a website owner, these residential proxies are a real pain. Tons and tons of abusive traffic, including people trying to exploit vulnerabilities and patently broken crawlers that send insane numbers of requests, and no real way to block it.
It's just nasty stuff. Intent matters, and if you're selling a service that's used only by the bad guys, you're a bad guy too. This is not some dual-use, maybe-we-should-accept-the-risks deal that you have with Tor.
I learn: proxy networks run by large corps are good. True internet is bad. While I understand that often we are talking about Malware/Worms etc that enable this. However, i find it often disturbing to here often a lot of libertarian speech from the tech scene, while on the other hand are feeling themselves very comfortable to take over state power like policing efforts to save the world.
> Ones which you pay for and which are running legitimately, with the knowledge (and compensation) of those who run them.
The problem is, it is by default unethical to have residential users be exit nodes for VPNs - unless these users are lawyers or technical experts.
No matter what you do as a "residential proxy" company - you cannot prevent your service being used by CSAM peddlers, and thus you cannot prevent that your exit nodes aren't the ones whose IP addresses show up when the FBI comes knocking.
Residential proxies are the only way to crawl and scrape. It's ironic for this article to come from the biggest scraping company that ever existed!
If you crawl at 1Hz per crawled IP, no reasonable server would suffer from this. It's the few bad apples (impatient people who don't rate limit) who ruin the internet for both users and hosters alike. And then there's Google.
First of: Google has not once crashed one of our sites with GoogleBot. They have never tried to by-pass our caching and they are open and honest about their IP ranges, allowing us to rate-limit if needed.
The residential proxies are not needed, if you behave. My take is that you want to scrape stuff that site owners do not want to give you and you don't want to be told no or perhaps pay a license. That is the only case where I can see you needing a residential proxies.
>The residential proxies are not needed, if you behave
I'm starting to think that somee users in hackernews do not 'behave' or at least they think they do not 'behave' and provide an alibi for those that do not 'behave'.
That the hacker in hackernews does not attract just hackers as in 'hacking together features' but also hackers as in 'illegitimately gaining access to servers/data'
As far as I can tell, as a hacker that hacks features together, resi proxies are something the enemy uses. Whenever I boot up a server and get 1000 log in requests per second and requests for commonly exploited files from russian and chinese IPs, those come from resi IPs no doubt. There's 2 sides to this match, no more.
> You can’t get much crawling done from published cloud IPs.
Think about why that might be. I'm sorry, if you legitimately need to crawl the net, and do so from a cloud provide, your industry screwed you over with bad behaviour. Go get hosting with a company that cares about who their customers are, you're hanging out with a bad crowd.
No, no they really aren't, but I was thinking the "scraping industry" in the sense that that's a thing. Getting hosting in smaller datacenters is simple enough, but you may need to manage your own hardware, or VMs. Many will help you get your own IP ranges and ASN, that's going to go a long way, if you don't want to get bundled in with the bad bots.
This differs obviously, but having an ASN in our case means that we can deal you, contact you and assume that you're better than random bot number 817.
One thing about Google is that many anti-scraping services explicitly allow access to Google and maybe couple of other search engines. Everybody else gets to enjoy CloudFlare captcha, even when doing crawling at reasonable speeds.
Just today I wanted to get a list of locations of various art events around the city which are all located on the same website, but which does not provide a page with all events happening this month on a map. I need a single map to figure out what I want to visit based on distance I have to travel, unfortunately that's not an option - only option is to go through hundreds of items and hope whatever I picked is near me.
Do you think this is such a horrible thing to scrape? I can't do it manually since there are few hundred locations. I could write some python script which uses playwrite to scrape things using my desktop browser in order to avoid CloudFlare. Or, which I am much more familiar with, I could write a python script that uses BeautifulSoup to extract all the relevant locations once for me. I would have been perfectly happy fetching 1 page/sec or even 1 page/2 seconds and would still be done within 20 minutes if only there was no anti-scraping protection.
Scraping is a perfectly legal activity, after all. Except thanks to overly-eager scraping bots and clueless/malicious people who run them there's very little chance for anyone trying to compete with Google or even do small scale scraping to make their life and life of local art enthusiasts easier. Google owns search. Google IS search and no competition is allowed, it seems.
do we think a scraper should be allowed to take whatever means necessary to scrape a site if that site explicitly denies that scraper access?
if someone is abusing my site, and i block them in an attempt to stop that abuse, do we think that they are correct to tell me it doesn’t matter what i think and to use any methods they want to keep abusing it?
I'd still like the ability to just block a crawler by its IP range, but these days nope.
1 Hz is 86400 hits per day, or 600k hits per week. That's just one crawler.
Just checked my access log... 958k hits in a week from 622k unique addresses.
95% is fetching random links from u-boot repository that I host, which is completely random. I blocked all of the GCP/AWS/Alibaba and of course Azure cloud IP ranges.
It's almost all now just comming of a "residential" and "mobile" IP address space from completely random places all around the world. I'm pretty sure my u-boot fork is not that popular. :-D
Every request is a new IP address, and available IP space of the crawler(s) is millions of addresses.
I don't host a popular repo. I host a bot attraction.
Yep, that’s why that’s all over the place now. The cookie thing is more of a first line of defense. It turns away a lot of shoddy scrapers with nearly no resources on my side. Anubis knocks out almost all of the remainder.
> These efforts to help keep the broader digital ecosystem safe supplement the protections we have to safeguard Android users on certified devices. We ensured Google Play Protect, Android’s built-in security protection, automatically warns users and removes applications known to incorporate IPIDEA SDKs, and blocks any future install attempts.
Nice to see Google Play Protect actually serving a purpose for once.
If I'm not mistaken, the plaintiffs in the US v Google antitrust litigation in the DC Circuit tried to argue that website operators are biased toward allowing Google to crawl and against allowing other search engines to do the same
The Court rejected this argument because the plaintiffs did not present any evidence to support it
For someone who does not follow the web's history, how would one produce direct evidence that the bias exists
Yup exactly. Google must be the only one allowed to scrape the web. Google can't have any other competition. Calling it in "user's best interest" is just like their other marketing cons: "play integrity for user's security" etc
This does nothing against your ability to scrape the web the Google way, AKA from your own assigned IP range, obeying robots.txt, and with an user agent that explicitly says what you're doing and gives website owners a way to opt out.
What Google doesn't want (and I don't think that's a bad thing) is competitors scraping the web in bad faith, without disclosing what they're doing to site owners and without giving them the ability to opt out.
If Google doesn't stop these proxies, unscrupulous parties will have a competitive advantage over Google, it's that simple. Then Google will have to decide between just giving up (unlikely) or becoming unscrupulous themselves.
LLMs aren't a good indicator of success here because an LLM trained on 80% of the data is just as good as one trained on 100%, assuming the type/category of data is distributed evenly. Proxies help when you do need to get access to 100% of the data including data behind social media loginwalls.
Have you got any proof of Google scraping from residential proxies users don't know about, rather than from their clearly labelled AS? Otherwise you're mixing entirely different things into one claim.
That's the whole point. Websites that try to block scraping attempts will let google scrape without any hurdle because of google's ads and search network. This gives google some advantage over new players because as a new name brand you are hardly going to convince a website to allow scraping even if your product may actually be more advantageous to the website (for example assume you made a search engine that doesn't suck like google, and aggregates links instead of copying content from your website).
Proxies in comparison can allow new players to have some playing chance. That said I doubt any legitimate & ethical business would use proxies.
I don't think parent post is claiming that Google is using other people's networks to scrape the web only that they have a strong incentive to keep other players from doing that.
No, there are other scrapers that Google doesn't block or interact with. You can even run scraping from GCP. This has nothing to do with "only Google is allowed to scrape".
They even host apps which exist for scraping data, like https://play.google.com/store/apps/details?id=com.sociallead...
Play Protect blocks malicious apps, not network traffic, so no, it obviously doesn't interfere with Google's apps.
AFAIK it also left SmartTube (an alternative YouTube client) alone until the developer got pwned and the app trojanized with this kind of SDK, and the clean versions are AFAIK again being left alone. No guarantee that it won't change in the future, of course, but so far they seem to not be abusing it.
Malicious here means "most people who aren't trying to argue semantics or otherwise be smartasses about it would consider it malware". That's why the example I gave is a semi-popular software the allows watching YouTube without ads without a premium subscription, i.e. at least in the case I observed, I don't believe this was weaponized against apps that interfere with their business model.
As for "intrusive advertising is malicious", see the second part of the first sentence.
My understanding is that routing through residential IPs is a part of the business of some VPN providers. I don't know how above board they are on this (as in notifying customers that this may happen, however buried in the usage agreement, or even allowing them to opt out).
But, my main point, is that the whole business is "on the up and up" vs some dark botnet.
> While operators of residential proxies often extol the privacy and freedom of expression benefits of residential proxies, Google Threat Intelligence Group’s (GTIG) research shows that these proxies are overwhelmingly misused by bad actors
Mullvad seems to be one of those VPN providers. [1] Though I very much doubt they would sneakily make end-users devices exit nodes. Though, as a historical side note, let's not forget Skype used to make users computers act as a relay as well during its more decentralized days.
Anyone could scrape the net, then modern scrapes came along with their shitty code and absolutely no respect. The reason why so many of us block or throttle scrapers is because they miss behave. They don't back off, they try to by-pass caches and if they crash a site they don't adjust, they will just pound it the ground again when it's back. We managed to talk to one large AI company would didn't really want to fix anything, but told us that they'd be fine with us just rate limiting them, as if we somehow owed them anything. They just get a stupid low rps now, even if we'd let them go faster, if they'd just fix they bot.
Some sites don't want you scraping, but it's their content, their rules. We don't really care, but we have to due to the number and quality of the bots we're seeing. This is in my mind a 100% self-imposed problem from the scrapers.
I'm actually a little shocked seeing that there was a WebOS variant of the residential proxying SDK endpoint. Does that mean there might be a bit more unchecked malware lurking behind the scenes in the LG ecosystem?
Personally I'm surprised they didn't have a Samsung option.
I keep my brand new LG C5 totally disconnected from the internet and use my Apple TV for movie watching. I’m not going to trust a company like LG to secure their devices.
Google shows a samaple of the IOCs but Google Trust Services have issued a number of the SSL certs for those domains that have not been revoked (yet?).
Only looking at the:
- a8d3b9e1f5c7024d6e0b7a2c9f1d83e5.com
- af4760df2c08896a9638e26e7dd20aae.com
- cfe47df26c8eaf0a7c136b50c703e173.com
Looks like a standard MD5 hash domain pattern of which currently there are:
They have a robust KYC that appears to serve, at least in large part, as a way to stay off the shit list of companies with the resources to pursue recourse.
Source: went through that process, ended up going a different route. The rep was refreshingly transparent about where they get the data, why the have the kyc process (aside from regulatory compliance).
Ended up going with a different provider who has been cheaper and very reliable, so no complaints.
Yeah, they make you do a Skype interview (or probably Zoom interview nowadays). You could call this KYC or collateral, depending on your view of the company. It does limit the nefariousness of their clientele but I doubt they do much, or any, monitoring of actual traffic after onboarding (not for compliance reasons, anyway).
I think they should have requested KYC when I was complaining about being unable to log into gmail, but I’m not going to complain as long as the service works.
I don’t use Luminati for anything illegal though, so it’s possible they just have some super amazing abuse detection algorithms that know this.
I've helped multiple people remove residential proxy malware that was turning their network into a brightdata exit node and they had no idea / did not consent to it. Why is google selectively targeting one provider while letting others operate freely?
No, he is referencing Google going after the Chinese company, not the Israel based one. That does not mean there is bias with the commenter at all, just that the companies operate differently and are treated differently. The country of origin is important as Israel based companies are more integrated into the western business world, and tend to at least try to show an effort in keeping spam and other things off their platforms.
Now I do agree that they are both bad companies that should not be allowed to operate the way they do. I would say the same thing about the other 1000 scrapers hitting websites everyday as well (including Google).
What they did not comment directly on, is how many apps / games they might have actually removed from the Playstore with the removal of the SDKs, which would be the actual interesting data.
FWIW a couple of years ago I was involved in a court case where there was a subpoena sent to Luminati to figure out whether or not a specific request had originated from their network, lawyers Luminati replied that they do not keep any logs whatsoever as they aren't required to do so under Israeli law.
Hard to imagine any serious anti-abuse efforts by Luminati if they don't monitor what their users are doing, but this is probably a deliberate effort to avoid potential liability arising from knowing what their users are doing.
Personally, I don’t think either of them are actually meaningfully bad. A bit naughty, maybe?
I do think the disparity in attention is fascinating. These new Chinese players have been getting nonstop press while everyone ignores the established giant.
I've had enough of companies saying "you're connecting from an AWS IP address, therefore you aren't allowed in, or must buy enterprise licensing". Reddit is an example which totally blocks all data to non-residential IP's.
I want exactly the same content visible no matter who you are or where you are connecting from, and a robust network of residential proxies is a stepping stone to achieving that.
If you look at the article, the network they disrupted pays software vendors per-download to sneakily turn their users into residential proxy endpoints. I'm sure that at least some of the time the user is technically agreeing to some wording buried in the ToS saying they consent to this, but it's certainly unethical. I wouldn't want to proxy traffic from random people through my home network, that's how you get legal threats from media companies or the police called to your house.
> that's how you get legal threats from media companies or the police called to your house.
Or residential proxies get so widespread that almost every house has a proxy in, and it becomes the new way the internet works - "for privacy, your data has been routed through someone else's connection at random".
> Or residential proxies get so widespread that almost every house has a proxy in, and it becomes the new way the internet works - "for privacy, your data has been routed through someone else's connection at random".
in a way, yes - the weakness of tor is realistically the lack of widespreadness. Tor traffic is identifiable and blockable due to the relatively rare number of exit nodes (which also makes it dangerous to run exit nodes, as you become "liable").
Engraining the ideas of tor into regular users' internet usage is what would prevent the internet from being controlled and blockable by any actor (except perhaps draconian gov't over reach, which while can happen, is harder in the west).
Of course they're pitching it like everything's above board, but from the article:
> While many residential proxy providers state that they source their IP addresses ethically, our analysis shows these claims are often incorrect or overstated. Many of the malicious applications we analyzed in our investigation did not disclose that they enrolled devices into the IPIDEA proxy network. Researchers have previously found uncertified and off-brand Android Open Source Project devices, such as television set top boxes, with hidden residential proxy payloads.
I love how its the "evil" Open Source project devices, and "other app stores" that are the problem, not the 100s of spyware ridden crap that is available for download from the Play store. Would be interesting to know how many copies of the SDK was found and removed from their own platform.
I live in the UK and can't view a large portion of the internet without having to submit my ID to _every_ site serving anything deemed "not safe the for the children". I had a question about a new piercing and couldn't get info on it from Reddit because of that. I try using a VPN and they're blocked too. Luckily, I work at a copmany selling proxies so I've got free proxies whenever I want, but I shouldn't _need_ to use them.
I find it funny that companies like Reddit, who make their money entirely from content produced by users for free (which is also often sourced from other parts of the internet without permission), are so against their site being scraped that they have to objectively ruin the site for everyone using it. See the API changes and killing off of third party apps.
Obviously, it's mostly for advertising purposes, but they love to talk about the load scraping puts on their site, even suing AI companies and SerpApi for it. If it's truly that bad, just offer a free API for the scrapers to use - or even an API that works out just slightly cheaper than using proxies...
My ideal internet would look something like that, all content free and accessible to everyone.
> that they have to objectively ruin the site for everyone using it. See the API changes and killing off of third party apps.
Third party app users were a very small but vocal minority. The API changes didn't drop their traffic at all. In fact, it's only gone up since then.
The datacenter IP address blocks aren't just for scrapers, it's an anti-bot measure across the board. I don't spend much time on Reddit but even the few subreddits I visited were starting to become infiltrated by obvious bot accounts doing weird karma farming operations.
Even HN routinely gets AI posting bots. It's a common technique to generate upvote rings - Make the accounts post comments so they look real enough, have the bots randomly upvote things to hide activity, and then when someone buys upvotes you have a selection of the puppet accounts upvote the targeted story. Having a lot of IP addresses and generating fake activity is key to making this work, so there's a lot of incentive to do it.
I agree that write-actions should be protected, especially now when every other person online is a bot. As for read-actions, I'll continue to profit off those being protected too but I wouldn't be too bothered if something suddenly changed and all content across the internet was a lot easier to access programmatically. I think only harm can come from that data being restricted to the huge (nefarious) companies that can pay for that data or negotiate backroom deals.
Have you considered that it’s because a new industry popped up that decided it was okay to slurp up the entire internet, repackage it, and resell it? Surely that couldn’t be why sites are trying to keep non humans out.
> I live in the UK and can't view a large portion of the internet without having to submit my ID to _every_ site serving anything deemed "not safe the for the children".
Really? Because I live in the UK and I've never been asked for my ID for anything.
> I want exactly the same content visible no matter who you are or where you are connecting from
The reason those IP addresses get blocked is not because of "who" is connecting, but "what"
Traffic from datacenter address ranges to sites like Reddit is almost entirely bots and scrapers. They can put a tremendous load on your site because many will try to run their queries as fast as they can with as many IPs as they can get.
Blocking these IP addresses catches a few false positives, but it's an easy step to make botting and scraping a little more expensive. Residential proxies aren't all that expensive, but now there's a little line item bill that comes with their request volume that makes them think twice.
> We need more residential proxies, not less
Great, you can always volunteer your home IP address as a start. There are services that will pay you a nominal amount for it, even.
That's already the case (irrespective of residential proxies) because content only serves as bait for someone to hand over personal information (during signup/login) and then engage with ads.
Proxies actually help with that by facilitating mass account registration and scraping of the content without wasting a human's time "engaging" with ads.
Amazon.com now only shows you a few reviews. To see the rest you must login. Social media websites have long gated the carrots behind a login. Anandtech just took their ball and went home by going offline.
There's a company that pays you to keep their box connected to your residential router. I assume it sells residential proxy services, maybe also DDoS services, I don't know. It's aptly named Absurd Computing.
Agreed. With things people paid for and using our wifi data to build their "positioning dbs" that you can't block or turn off on your phone, without "rooting" your own device.
I don’t know. I wouldn’t have thought of myself as proxying other people’s traffic by carrying my iPhone around. (For one thing, it’s my own phone that initiates all the activity- it monitors for Apple devices, the devices don’t reach out to my phone.) I can see how you could frame it that way, though. I just thought they might be referring to something else that I didn’t know about.
I remain skeptical. I can understand how one would might see it that way, but I think it’s stretching the word proxy too far.
Devices on Apple’s Find My aren’t broadcasting anything like packets that get forwarded to a destination of their choosing. I would think that would be a necessity to call it “proxying”.
They’re just broadcasting basic information about themselves into the void. The phones report back what they’ve picked up.
That doesn’t fit the definition to me.
I absolutely don’t mind the fact that my phone is doing that. The amount of data is ridiculously minuscule. And it’s sort of a tit for tat thing. Yeah my phone does it, but so does theirs. So just like I may be helping you locate your AirTag, you would be helping me locate mine. Or any other device I own that shows up on Find My.
It’s a very close to a classic public good, with the only restriction being that you own a relevant device.
I still "run" a small ISP with a few thousand residential ips from my scraping days. The requirements are laughable and costs were negligible in the early 2000s.
This blog post from the company that used promise "don't be evil", one that steals water for data centers from vilages and towns via shady deals, whose whole premise it stealing other people's stuff and claiming it as their own and locking them out and selling their data.. Who made them the arbiter of the internet? No one!!!
They just stole this and get on their high horse to tell people how to use internet? You can eff right off Google.
Have you tried it? Every new account will be shadowbanned and if it's shared you often get blank page 429. None of this was true before the API shutdown.
That’s not my experience, using various VPNs, public networks, Cloudflare and Apple private relays. A captcha is common when logged out but that’s about it, I have not encountered any shadow bans. I create a new account each week.
That's not the same as "blocks all data to non-residential IP's"?
>if it's shared you often get blank page 429. None of this was true before the API shutdown.
See my other comment. I agree there's a non-zero amount of VPNs that are banned from reddit, but it's also not particularly hard to find a VPN that's not banned on reddit.
Private VPS for personal VPN in Netherlands (digital ocean), then Hungary (some small local DC) — both are blocked from day one.
> You've been blocked by network security. To continue, log in to your Reddit account or use your developer token. If you think you've been blocked by mistake, file a ticket below and we'll look into it.
Proton VPN sometimes (mostly?) has this issue too. It's a bit of an hit or miss in there iirc but I have definitely seen the last message of your comment.
That's just mullvad's IP pool being banned. The other VPN providers I use aren't banned, or at least are only intermittently banned that I can easily switch to another server.
I have never interacted with a reddit employee who wasn't actively gaslighting me about the platform. Do you even use the site? I talked to a PM recently who genuinely thought the phone app was something people liked.
everything on Reddit is so locked down it’s useless. even if you do get to post something useful some basement dwelling mod will block it for an arcane interpretation of one of the subreddits 14 rules.
I haven't looked at any court documents, but the WSJ article from Wednesday reported that "Last year, Google sued the anonymous operators of a network of more than 10 million internet-connected televisions, tablets and projectors, saying they had secretly pre-installed residential proxy software on them... an Ipidea spokeswoman acknowledged in an email that the company and its partners had engaged in “relatively aggressive market expansion strategies” and “conducted promotional activities in inappropriate venues (e.g., hacker forums)...”"
There was also a botnet, Kimwolf, that apparently leveraged an exploit to use the residential proxy service, so it may be related to Ipidea not shutting them down.
The need for proxies in any legitimate context became obsolete with starlink being so widespread. Throw up a few terminals and you have about 500-2k cgnat IP addresses to do whatever you like.
The actual secret is to use IPv6 with varied source IPs in the same subnet, you get an insane number of IPs and 90% of anti-scraping software is not specialized enough to realize that any IP in a /64 is the same as a single IP in a /32 in IPv4.
> any IP in a /64 is the same as a single IP in a /32 in IPv4
This is very commonly true but sadly not 100%. I am suffering from a shared /64 on which a VPS is, and where other folks have sent out spam - so no more SMTP for me.
If they're CGNAT then unless Starlink actively provides assistance to block them it won't matter.
As someone who wants the internet to maintain as much anarchy as possible I think it would be nice to see a large ISP that actively rotated its customer IPv6 assignments on a tight schedule.
I'm surprised by the negative takes...
Yes, proxies are good. Ones which you pay for and which are running legitimately, with the knowledge (and compensation) of those who run them.
Malware in random apps running on your device without your knowledge is bad.
> Some users may knowingly install this software on their devices, lured by the promise of “monetizing” their spare bandwidth.
Sounds like they’re targeting networks even if the users are ok participating in, precisely what you’re saying is ok.
As for malware enrolling people into the network, it depends if the operator is doing it or if the malware is 3rd parties trying to get a portion of the cash flow. In the latter case the network would be the victim that’s double victimized by Google also attacking them.
Users are OK with acting as proxies because they don't understand all the shady stuff their proxy is being used for. Also consumer ISPs generally ban this.
But then would you make the same arguments for running a tor node (presumably, you don't know what shady stuff is there, but you know there's shady stuff)?
Running a tor node is pretty stupid from a liability perspective, but at least you have more deniability and you are making an informed choice.
These residential proxies are pretty much universally shady. I doubt most of the users understand what they are consenting to.
tor nodes are zero risk as long as they're not an exit
been running nodes since 2017 on two providers and zero issues
yes, but i was likening being part of the proxy network to being a tor exit node. I should've made my comment clearer.
That's totally something you should consider, even if you decide for running the tor node anyway in the end.
You could say the same about google’s terms of service.
A thousand times yes.
Why would the users care either way?
Some people care about ethics, and try to avoid doing bad stuff, or helping the bad stuff.
Sure, but that only answers why some users might care.
> These SDKs, which are offered to developers across multiple mobile and desktop platforms, surreptitiously enroll user devices into the IPIDEA network.
?
Here’s an alternate spin
> These SDKs, which are offered to developers across multiple mobile and desktop platforms.
> other actors then surreptitiously enroll user devices into the IPIDEA network using these frameworks.
I’m not saying Google did the wrong thing, but it is one private entity essentially handing out a death sentence on its own. The only mitigating thing is that a) technical disruptions were either on their own infra b) legal judgements they then enforced with cooperation from others like Cloudflare. But it’s not clear what the legal proceedings were actually like
> Malware in random apps running on your device without your knowledge is bad.
And ones that have all the indicators of compromise of Russia, Iran, DPRK, PRC, etc
Am I the only one cynically thinking that "Russia, Iran, DPRK, PRC, etc" is the "But think of the chiiildren!!!" excuse for doing this?
And when Google say
"IPIDEA’s proxy infrastructure is a little-known component of the digital ecosystem leveraged by a wide array of bad actors."
What they really mean is " ... leveraged by actors indiscriminately scraping the web and ignoring copyright - that are not us."
I can't help but feel this is just Google trying to pull the ladder up behind then and make it more difficult for other companies to collect training data.
>I can't help but feel this is just Google trying to pull the ladder up behind then and make it more difficult for other companies to collect training data.
I can very easily see this as being Google's reasoning for these actions, but let's not pretend that clandestine residential proxies aren't used for nefarious things. The vast majority of social media networks will ban - or more generally and insiously - shadow ban accounts/IPs that use known proxy IPs. This means that they are gating access to their platforms behind residential IPs (on top of their other various blackboxes and heuristics like fingerprinting). Operators of bot networks thus rely on residential proxy services to engage in their work, which ranges from mundane things like engagement farming to outright dangerous things like political astroturfing, sentiment manipulation, and propaganda dissemination.
LLMs and generative image and video models have made the creation of biased and convincing content trivial and cheap, if not free. The days of "troll farms" is over, and now the greatest expense for a bad actor wishing to influence the world with fake engagement and biased opinions is their access to platforms, which means accounts and internet connections that aren't blacklisted or shadow banned. Account maturity and reputation farming is also feeling a massive boon due to these tools, but as an independent market it also similarly requires internet connections that aren't blacklisted or shadow banned. Residential proxies are the bottleneck for the vast majority of bad actors.
> The vast majority of social media networks will ban - or more generally and insiously - shadow ban accounts/IPs that use known proxy IPs. This means that they are gating access to their platforms behind residential IPs (on top of their other various blackboxes and heuristics like fingerprinting)
Social media will ban proxy IPs, yet gleefully force you to provide your ID if you happen to connect from the wrong patch of land. I find it difficult not to support any and all attempts to bypass such measures.
The fact is that there's now a perfectly legitimate use for residential proxies, and the demand is just going to keep growing as more websites decide to "protect their content", and more governments decide to pass tyrannical laws that force people to mask their IPs. And with demand, comes supply, so don't expect them to go away any time soon.
This really just sounds like a rehash of the argument against encryption. "Bad people use it, so it should go away" - never mind that there are completely legitimate uses for it. Never mind that using a residential proxy might be the only way to get any privacy at all in a future where everyone blocks VPNs and Tor, a future where you may not even be able to post online without an ID depending you where you live, a future which we're swiftly approaching.
It's already here, in fact. Imgur blocks UK users, but it also blocks VPNs and Tor. The only way somebody living in the UK can access Imgur is through a residential proxy.
> The only way somebody living in the UK can access Imgur is through a residential proxy.
And very little of value was lost.
> This really just sounds like a rehash of the argument against encryption. "Bad people use it, so it should go away" - never mind that there are completely legitimate uses for it.
Except that almost everything that uses encryption has some legitimate use. There are pretty much no legitimate uses for residential proxies, and their use in flooding the Internet with crap greatly outweighs that.
If I plumbed a 30cm sewage line straight into your living room would you be happy with it? Okay, well, tell you what, let's make it totally legit - I'll drop a tasty ripe strawberry into the stream of effluent every so often, how about that?
It's another type of proxy. Legitimate uses are the same as for other types of proxies.
No, what they're saying is what they said, what you're implying reveals a strange bias. Web scraping through residential proxies? Please think through your thoughts more. There's much more effective and efficient ways to do so. Multiple bad actors, like ransomware affiliates, have been caught using residential proxy networks. But by all means, don't let facts and cyber threat intelligence get in the way.
Residential proxies aren't used for scraping? That doesn't align well with my experience...
What are the much more effective and efficient ways — since you said it ?
>let facts and cyber threat intelligence get in the way
Appeal to authority by way of invoking the megacorp-branded "threat intelligence" capability (targeted PR exercise).
> Am I the only one cynically thinking that "Russia, Iran, DPRK, PRC, etc" is the "But think of the chiiildren!!!" excuse for doing this?
Maybe. But until I dropped all traffic from pretty much every mobile network provider in Russia and Israel, I'd get up every morning to a couple of thousand new users of whom a couple of hundred had consistently within a few hundred milliseconds created an account, clicked on the activation link, and then posted a bunch of messages in every forum category spreading hate speech.
Getting rid of malware is good. A private for-profit company exercising its power over the Internet, not so much. We should have appropriate organizations for this.
The proxies is the reason why you get spam in your Google search result, spam in your Play store (by means of fake good reviews), basically spam in anything user generated.
It directly affects Google and you, I don’t see why they should not do this.
Spam in Google search results is due to Google happily taking money from the spammers in exchange for promoting their spam, or that the spam sites benefit Google indirectly by embedding Google Ads/Analytics.
I don't see any spam in Kagi, so clearly there is a way to detect and filter it out. Google is simply not doing so because it would cut into their profits.
The reason you don't see spam in Kagi is because nobody is targeting Kagi specifically.
They can probably get away with a lot of stupid rules that would backfire if anybody tried to cater to them specifically.
"SEO spammers being more advanced than multi-billion-dollar search conglomerate" is a myth. Spam sites have an obvious objective: display ads, shill affiliate links or sell products. All these have to be visible, since an ad or product you can't see/buy is worthless. It is trivial to train a classifier to detect these.
But let's play devil's advocate and say you are right and spammers are successfully outsmarting Google - well, Kagi does use Google results via SerpAPI by their own admission, meaning they too should have those spam results. Yet they somehow manage to filter them out with a fraction of the resources available to Google itself with no negative impact on search quality.
Okay. You get right on that. In the meantime, would you rather they did nothing? What do you actually want, in concrete terms?
Many are "compensated" (in the way of software they didn't pay for), so the real question is that of disclosure (in which case many software vendors check the box in the most minimal way possible by including it as fine print during the install)
No, the question is not just disclosure. People have their bandwidth stolen, and sometimes internet access revoked due to this kind of fraud and misuse - disclosure wouldn’t solve that
Also, as a website owner, these residential proxies are a real pain. Tons and tons of abusive traffic, including people trying to exploit vulnerabilities and patently broken crawlers that send insane numbers of requests, and no real way to block it.
It's just nasty stuff. Intent matters, and if you're selling a service that's used only by the bad guys, you're a bad guy too. This is not some dual-use, maybe-we-should-accept-the-risks deal that you have with Tor.
If they're lucky. Sometimes people have their doors kicked in by armed police.
I learn: proxy networks run by large corps are good. True internet is bad. While I understand that often we are talking about Malware/Worms etc that enable this. However, i find it often disturbing to here often a lot of libertarian speech from the tech scene, while on the other hand are feeling themselves very comfortable to take over state power like policing efforts to save the world.
> Ones which you pay for and which are running legitimately, with the knowledge (and compensation) of those who run them.
The problem is, it is by default unethical to have residential users be exit nodes for VPNs - unless these users are lawyers or technical experts.
No matter what you do as a "residential proxy" company - you cannot prevent your service being used by CSAM peddlers, and thus you cannot prevent that your exit nodes aren't the ones whose IP addresses show up when the FBI comes knocking.
[dead]
Residential proxies are the only way to crawl and scrape. It's ironic for this article to come from the biggest scraping company that ever existed!
If you crawl at 1Hz per crawled IP, no reasonable server would suffer from this. It's the few bad apples (impatient people who don't rate limit) who ruin the internet for both users and hosters alike. And then there's Google.
First of: Google has not once crashed one of our sites with GoogleBot. They have never tried to by-pass our caching and they are open and honest about their IP ranges, allowing us to rate-limit if needed.
The residential proxies are not needed, if you behave. My take is that you want to scrape stuff that site owners do not want to give you and you don't want to be told no or perhaps pay a license. That is the only case where I can see you needing a residential proxies.
>The residential proxies are not needed, if you behave
I'm starting to think that somee users in hackernews do not 'behave' or at least they think they do not 'behave' and provide an alibi for those that do not 'behave'.
That the hacker in hackernews does not attract just hackers as in 'hacking together features' but also hackers as in 'illegitimately gaining access to servers/data'
As far as I can tell, as a hacker that hacks features together, resi proxies are something the enemy uses. Whenever I boot up a server and get 1000 log in requests per second and requests for commonly exploited files from russian and chinese IPs, those come from resi IPs no doubt. There's 2 sides to this match, no more.
You can’t get much crawling done from published cloud IPs. Residential proxies are the only way to do most crawls today.
That said, I support Google working to shut these networks down, since they are almost universally bad.
It’s just a shame that there’s no where to go for legitimate crawling activities.
> You can’t get much crawling done from published cloud IPs.
Think about why that might be. I'm sorry, if you legitimately need to crawl the net, and do so from a cloud provide, your industry screwed you over with bad behaviour. Go get hosting with a company that cares about who their customers are, you're hanging out with a bad crowd.
what industry is that? Every industry is on the cloud.
No, no they really aren't, but I was thinking the "scraping industry" in the sense that that's a thing. Getting hosting in smaller datacenters is simple enough, but you may need to manage your own hardware, or VMs. Many will help you get your own IP ranges and ASN, that's going to go a long way, if you don't want to get bundled in with the bad bots.
This differs obviously, but having an ASN in our case means that we can deal you, contact you and assume that you're better than random bot number 817.
Scraping isn’t an industry. There are legitimate and illegitimate scraping pursuits.
There are lots of healthy / productive businesses in the cloud and lots of scumbags, just like any enterprise.
I still have no idea about your point, by the way.
One thing about Google is that many anti-scraping services explicitly allow access to Google and maybe couple of other search engines. Everybody else gets to enjoy CloudFlare captcha, even when doing crawling at reasonable speeds.
Rules For Thee but Not for Me
> many anti-scraping services explicitly allow access to Google and maybe couple of other search engines.
because google (and the couple of other search engines) provide enough value that offset the crawler's resource consumption.
That's cool, but it's impossible for anyone to ever build a competitor that'd replace google without bypassing such services.
You say this like robots.txt doesn't exist.
it almost sounds like they’re saying the contents of robots.txt shouldn’t matter… because google exists? or something?
implying “robots.txt explicitly says i can’t scrape their site, well i want that data, so im directing my bot to take it anyway.”
so many things flat out ignore it in 2026 let's be real
Why are you scraping sites in the first place? What legitimate reason is there for you doing that?
Just today I wanted to get a list of locations of various art events around the city which are all located on the same website, but which does not provide a page with all events happening this month on a map. I need a single map to figure out what I want to visit based on distance I have to travel, unfortunately that's not an option - only option is to go through hundreds of items and hope whatever I picked is near me.
Do you think this is such a horrible thing to scrape? I can't do it manually since there are few hundred locations. I could write some python script which uses playwrite to scrape things using my desktop browser in order to avoid CloudFlare. Or, which I am much more familiar with, I could write a python script that uses BeautifulSoup to extract all the relevant locations once for me. I would have been perfectly happy fetching 1 page/sec or even 1 page/2 seconds and would still be done within 20 minutes if only there was no anti-scraping protection.
Scraping is a perfectly legal activity, after all. Except thanks to overly-eager scraping bots and clueless/malicious people who run them there's very little chance for anyone trying to compete with Google or even do small scale scraping to make their life and life of local art enthusiasts easier. Google owns search. Google IS search and no competition is allowed, it seems.
If you want the data, why not contact the organisation with the website?
Why is hammering the everloving fuck out of their website okay?
1 request per second is nowhere even close to hammering a website.
They made the data available on the website already, there's no reason to contact them when you can just load it from their website.
Dunno, building a Google competitor? How do you think Google got started?
I've used change detection for in-stock alerts before, or event updates. Plenty of legitimate uses.
do we think a scraper should be allowed to take whatever means necessary to scrape a site if that site explicitly denies that scraper access?
if someone is abusing my site, and i block them in an attempt to stop that abuse, do we think that they are correct to tell me it doesn’t matter what i think and to use any methods they want to keep abusing it?
that seems wrong to me.
Saying the quiet part out loud...Shhhs
I'd still like the ability to just block a crawler by its IP range, but these days nope.
1 Hz is 86400 hits per day, or 600k hits per week. That's just one crawler.
Just checked my access log... 958k hits in a week from 622k unique addresses.
95% is fetching random links from u-boot repository that I host, which is completely random. I blocked all of the GCP/AWS/Alibaba and of course Azure cloud IP ranges.
It's almost all now just comming of a "residential" and "mobile" IP address space from completely random places all around the world. I'm pretty sure my u-boot fork is not that popular. :-D
Every request is a new IP address, and available IP space of the crawler(s) is millions of addresses.
I don't host a popular repo. I host a bot attraction.
I’ve been enduring that exact same traffic pattern.
I used Anubis and a cookie redirect to cut the load on my Forgejo server by around 3 orders of magnitude: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...
Aha, that's where the anime girl is from. What sort of traffic was getting past that but still thwarted by the cookie tactic?
I guess the bots are all spoofing consumer browser UAs and just the slightest friction outside of well-known tooling will deter them completely.
Yep, that’s why that’s all over the place now. The cookie thing is more of a first line of defense. It turns away a lot of shoddy scrapers with nearly no resources on my side. Anubis knocks out almost all of the remainder.
> These efforts to help keep the broader digital ecosystem safe supplement the protections we have to safeguard Android users on certified devices. We ensured Google Play Protect, Android’s built-in security protection, automatically warns users and removes applications known to incorporate IPIDEA SDKs, and blocks any future install attempts.
Nice to see Google Play Protect actually serving a purpose for once.
Yeah, it serves the purpose of blocking this kind of proxy traffic that isn't in Google's personal best interests.
Only Google is allowed to scrape the web.
"Only Google is allowed to scrape the web."
If I'm not mistaken, the plaintiffs in the US v Google antitrust litigation in the DC Circuit tried to argue that website operators are biased toward allowing Google to crawl and against allowing other search engines to do the same
The Court rejected this argument because the plaintiffs did not present any evidence to support it
For someone who does not follow the web's history, how would one produce direct evidence that the bias exists
> For someone who does not follow the web's history, how would one produce direct evidence that the bias exists
Take a bunch of websites, fetch their robots.txt file and check how many allow GoogleBot but not others?
Common Crawl provides gzipped robots.txt collections
Yup exactly. Google must be the only one allowed to scrape the web. Google can't have any other competition. Calling it in "user's best interest" is just like their other marketing cons: "play integrity for user's security" etc
Google does not use residential proxies.
This does nothing against your ability to scrape the web the Google way, AKA from your own assigned IP range, obeying robots.txt, and with an user agent that explicitly says what you're doing and gives website owners a way to opt out.
What Google doesn't want (and I don't think that's a bad thing) is competitors scraping the web in bad faith, without disclosing what they're doing to site owners and without giving them the ability to opt out.
If Google doesn't stop these proxies, unscrupulous parties will have a competitive advantage over Google, it's that simple. Then Google will have to decide between just giving up (unlikely) or becoming unscrupulous themselves.
> This does nothing against your ability to scrape the web the Google way
I thought that Google has access to significant portions of the internet that non-Google bots won’t have access to?
Their crawler has known IPs that get a white-glove treatment by every site with a paywall for example
This is demonstrably false by the success of many scrapers from AI companies.
LLMs aren't a good indicator of success here because an LLM trained on 80% of the data is just as good as one trained on 100%, assuming the type/category of data is distributed evenly. Proxies help when you do need to get access to 100% of the data including data behind social media loginwalls.
Have you got any proof of Google scraping from residential proxies users don't know about, rather than from their clearly labelled AS? Otherwise you're mixing entirely different things into one claim.
That's the whole point. Websites that try to block scraping attempts will let google scrape without any hurdle because of google's ads and search network. This gives google some advantage over new players because as a new name brand you are hardly going to convince a website to allow scraping even if your product may actually be more advantageous to the website (for example assume you made a search engine that doesn't suck like google, and aggregates links instead of copying content from your website).
Proxies in comparison can allow new players to have some playing chance. That said I doubt any legitimate & ethical business would use proxies.
I don't think parent post is claiming that Google is using other people's networks to scrape the web only that they have a strong incentive to keep other players from doing that.
No, there are other scrapers that Google doesn't block or interact with. You can even run scraping from GCP. This has nothing to do with "only Google is allowed to scrape". They even host apps which exist for scraping data, like https://play.google.com/store/apps/details?id=com.sociallead...
Does it also block unwanted traffic from Google apps or does it have a particular hatred for companies that interfere with Google's business model?
Play Protect blocks malicious apps, not network traffic, so no, it obviously doesn't interfere with Google's apps.
AFAIK it also left SmartTube (an alternative YouTube client) alone until the developer got pwned and the app trojanized with this kind of SDK, and the clean versions are AFAIK again being left alone. No guarantee that it won't change in the future, of course, but so far they seem to not be abusing it.
Does malicious mean interfering with Google's business model, or does it include intrusive advertising?
Malicious here means "most people who aren't trying to argue semantics or otherwise be smartasses about it would consider it malware". That's why the example I gave is a semi-popular software the allows watching YouTube without ads without a premium subscription, i.e. at least in the case I observed, I don't believe this was weaponized against apps that interfere with their business model.
As for "intrusive advertising is malicious", see the second part of the first sentence.
malicious ≠ intrusive.
but intrusive advertising is malicious
My understanding is that routing through residential IPs is a part of the business of some VPN providers. I don't know how above board they are on this (as in notifying customers that this may happen, however buried in the usage agreement, or even allowing them to opt out).
But, my main point, is that the whole business is "on the up and up" vs some dark botnet.
Oxylabs sells proxies for scrapers, I suppose you can use the socks-proxy as a VPN, and they claim to use Honeygain.
Honeygain is a platform where people sell their residential internet connection and bandwidth to these companies for money.
For comparison Honeygain pays someone 10 cents per GB, and Oxylabs sells it for $8/GB.
We need a better market. I'd sell for $7/GB.
That takes buying low and selling high to a whole new level
FTA
> While operators of residential proxies often extol the privacy and freedom of expression benefits of residential proxies, Google Threat Intelligence Group’s (GTIG) research shows that these proxies are overwhelmingly misused by bad actors
Google's definition of a "bad actor" is someone who wants to use Google without seeing the ads. Or Kagi. Or an AI other than Gemini.
Mullvad seems to be one of those VPN providers. [1] Though I very much doubt they would sneakily make end-users devices exit nodes. Though, as a historical side note, let's not forget Skype used to make users computers act as a relay as well during its more decentralized days.
[1] Using the website mentioned by user Rasbora https://news.ycombinator.com/item?id=46837806
> I don't know how above board they are on this
Saying you don't know something in the comments of an article that explains that thing is a bold strategy
so that only google and anthropic are allowed to scrape the web. No one else may have workarounds
Anyone could scrape the net, then modern scrapes came along with their shitty code and absolutely no respect. The reason why so many of us block or throttle scrapers is because they miss behave. They don't back off, they try to by-pass caches and if they crash a site they don't adjust, they will just pound it the ground again when it's back. We managed to talk to one large AI company would didn't really want to fix anything, but told us that they'd be fine with us just rate limiting them, as if we somehow owed them anything. They just get a stupid low rps now, even if we'd let them go faster, if they'd just fix they bot.
Some sites don't want you scraping, but it's their content, their rules. We don't really care, but we have to due to the number and quality of the bots we're seeing. This is in my mind a 100% self-imposed problem from the scrapers.
Exactly. This is just google building a "moat" around their shady business.
100%
I'm actually a little shocked seeing that there was a WebOS variant of the residential proxying SDK endpoint. Does that mean there might be a bit more unchecked malware lurking behind the scenes in the LG ecosystem?
Personally I'm surprised they didn't have a Samsung option.
I keep my brand new LG C5 totally disconnected from the internet and use my Apple TV for movie watching. I’m not going to trust a company like LG to secure their devices.
> trust a company like LG to secure their devices.
They have an interest in securing their devices so they can sell proxy service themselves.
Why would webOS App Store be any different than the iOS or Android App Store which also have monetization frameworks for bandwidth sharing.
Google shows a samaple of the IOCs but Google Trust Services have issued a number of the SSL certs for those domains that have not been revoked (yet?).
Only looking at the:
- a8d3b9e1f5c7024d6e0b7a2c9f1d83e5.com
- af4760df2c08896a9638e26e7dd20aae.com
- cfe47df26c8eaf0a7c136b50c703e173.com
Looks like a standard MD5 hash domain pattern of which currently there are:
If you look at some of the others (not listed in Google's IOC), they tend to have a pattern with their SSL certs e.g.:- 0e6f931862947ad58bf3d1a0c5a6f91f.com
- 17e4435ad10c15887d1faea64ee7eac4.com would there be any reason any of these would be legitimate?This was easy because it's a Chinese company.
The largest companies in this space that do similar this (oxylabs, brighdata,etc) have similar tactics but are based in a different location.
brighdata = Israel i think oxylabs = Lithuanian, child of NordVPN
Why are they leaving Bright Data (aka Illuminati aka Hola VPN) untouched? They are doing this exact scheme on an industrial scale.
They have a robust KYC that appears to serve, at least in large part, as a way to stay off the shit list of companies with the resources to pursue recourse.
Source: went through that process, ended up going a different route. The rep was refreshingly transparent about where they get the data, why the have the kyc process (aside from regulatory compliance).
Ended up going with a different provider who has been cheaper and very reliable, so no complaints.
Yeah, they make you do a Skype interview (or probably Zoom interview nowadays). You could call this KYC or collateral, depending on your view of the company. It does limit the nefariousness of their clientele but I doubt they do much, or any, monitoring of actual traffic after onboarding (not for compliance reasons, anyway).
I’ve certainly never been asked to do KYC with Luminati after using them for hundreds of terabytes over the years.
It’s not like I’m using some bigco email address or given them any other reason to skip KYC either.
They probably would if they would see your username here!
I think they should have requested KYC when I was complaining about being unable to log into gmail, but I’m not going to complain as long as the service works.
I don’t use Luminati for anything illegal though, so it’s possible they just have some super amazing abuse detection algorithms that know this.
They do KYC when you want to unblock certain domains.
Also not my experience, even though I’ve had to email them for whitelisting.
It might just be because my account is very old?
Maybe, or more likely you’re not trying to pull in content that is considered high risk to them, such as YouTube transcripts.
[flagged]
Since I was also tracking this proxy network as part of my side project, I wrote a short blog post + give access to 16m+ proxy IPs IoCs that belong to this proxy network: https://deviceandbrowserinfo.com/learning_zone/articles/insi...
Note that even after the disruption, I'm still able to route millions of requests/day through IP IDEA's network
Thanks google for saving us. I guess this is the equivalent of rival narcos fighting each other.
But would make for a much less interesting dramatic series. I bookmarked that link for the next time I have insomnia.
I've helped multiple people remove residential proxy malware that was turning their network into a brightdata exit node and they had no idea / did not consent to it. Why is google selectively targeting one provider while letting others operate freely?
You can check if your network is infected here: https://layer3intel.com/is-my-network-a-residential-proxy
It’s interesting that when Luminati, an Israeli company, does this, it’s fine.
When the Chinese do this? Very bad.
They are both bad. You are showing your own bias.
No, he is referencing Google going after the Chinese company, not the Israel based one. That does not mean there is bias with the commenter at all, just that the companies operate differently and are treated differently. The country of origin is important as Israel based companies are more integrated into the western business world, and tend to at least try to show an effort in keeping spam and other things off their platforms. Now I do agree that they are both bad companies that should not be allowed to operate the way they do. I would say the same thing about the other 1000 scrapers hitting websites everyday as well (including Google).
What they did not comment directly on, is how many apps / games they might have actually removed from the Playstore with the removal of the SDKs, which would be the actual interesting data.
FWIW a couple of years ago I was involved in a court case where there was a subpoena sent to Luminati to figure out whether or not a specific request had originated from their network, lawyers Luminati replied that they do not keep any logs whatsoever as they aren't required to do so under Israeli law.
Hard to imagine any serious anti-abuse efforts by Luminati if they don't monitor what their users are doing, but this is probably a deliberate effort to avoid potential liability arising from knowing what their users are doing.
Personally, I don’t think either of them are actually meaningfully bad. A bit naughty, maybe?
I do think the disparity in attention is fascinating. These new Chinese players have been getting nonstop press while everyone ignores the established giant.
I couldn't help but notice that they didn't list the affected apps, other than a couple domains for shady VPN providers...
We need more residential proxies, not less.
I've had enough of companies saying "you're connecting from an AWS IP address, therefore you aren't allowed in, or must buy enterprise licensing". Reddit is an example which totally blocks all data to non-residential IP's.
I want exactly the same content visible no matter who you are or where you are connecting from, and a robust network of residential proxies is a stepping stone to achieving that.
If you look at the article, the network they disrupted pays software vendors per-download to sneakily turn their users into residential proxy endpoints. I'm sure that at least some of the time the user is technically agreeing to some wording buried in the ToS saying they consent to this, but it's certainly unethical. I wouldn't want to proxy traffic from random people through my home network, that's how you get legal threats from media companies or the police called to your house.
> that's how you get legal threats from media companies or the police called to your house.
Or residential proxies get so widespread that almost every house has a proxy in, and it becomes the new way the internet works - "for privacy, your data has been routed through someone else's connection at random".
> Or residential proxies get so widespread that almost every house has a proxy in, and it becomes the new way the internet works - "for privacy, your data has been routed through someone else's connection at random".
Is this a re-invention of tor, maybe I2P?
> Is this a re-invention of tor
in a way, yes - the weakness of tor is realistically the lack of widespreadness. Tor traffic is identifiable and blockable due to the relatively rare number of exit nodes (which also makes it dangerous to run exit nodes, as you become "liable").
Engraining the ideas of tor into regular users' internet usage is what would prevent the internet from being controlled and blockable by any actor (except perhaps draconian gov't over reach, which while can happen, is harder in the west).
IP8 address tumbler? to wit, playing the shell game, to obstruct direct attribution.
They provide an SDK for mobile developers. Here is a video of how it works. [0] They don't even hide it.
[0] https://www.youtube.com/watch?v=1a9HLrwvUO4&t=15s
Of course they're pitching it like everything's above board, but from the article:
> While many residential proxy providers state that they source their IP addresses ethically, our analysis shows these claims are often incorrect or overstated. Many of the malicious applications we analyzed in our investigation did not disclose that they enrolled devices into the IPIDEA proxy network. Researchers have previously found uncertified and off-brand Android Open Source Project devices, such as television set top boxes, with hidden residential proxy payloads.
If popup ads that open the play store are ethical, this is ethical.
I love how its the "evil" Open Source project devices, and "other app stores" that are the problem, not the 100s of spyware ridden crap that is available for download from the Play store. Would be interesting to know how many copies of the SDK was found and removed from their own platform.
I live in the UK and can't view a large portion of the internet without having to submit my ID to _every_ site serving anything deemed "not safe the for the children". I had a question about a new piercing and couldn't get info on it from Reddit because of that. I try using a VPN and they're blocked too. Luckily, I work at a copmany selling proxies so I've got free proxies whenever I want, but I shouldn't _need_ to use them.
I find it funny that companies like Reddit, who make their money entirely from content produced by users for free (which is also often sourced from other parts of the internet without permission), are so against their site being scraped that they have to objectively ruin the site for everyone using it. See the API changes and killing off of third party apps.
Obviously, it's mostly for advertising purposes, but they love to talk about the load scraping puts on their site, even suing AI companies and SerpApi for it. If it's truly that bad, just offer a free API for the scrapers to use - or even an API that works out just slightly cheaper than using proxies...
My ideal internet would look something like that, all content free and accessible to everyone.
> that they have to objectively ruin the site for everyone using it. See the API changes and killing off of third party apps.
Third party app users were a very small but vocal minority. The API changes didn't drop their traffic at all. In fact, it's only gone up since then.
The datacenter IP address blocks aren't just for scrapers, it's an anti-bot measure across the board. I don't spend much time on Reddit but even the few subreddits I visited were starting to become infiltrated by obvious bot accounts doing weird karma farming operations.
Even HN routinely gets AI posting bots. It's a common technique to generate upvote rings - Make the accounts post comments so they look real enough, have the bots randomly upvote things to hide activity, and then when someone buys upvotes you have a selection of the puppet accounts upvote the targeted story. Having a lot of IP addresses and generating fake activity is key to making this work, so there's a lot of incentive to do it.
I agree that write-actions should be protected, especially now when every other person online is a bot. As for read-actions, I'll continue to profit off those being protected too but I wouldn't be too bothered if something suddenly changed and all content across the internet was a lot easier to access programmatically. I think only harm can come from that data being restricted to the huge (nefarious) companies that can pay for that data or negotiate backroom deals.
Reddit's traffic is almost exclusively propaganda bots.
Have you considered that it’s because a new industry popped up that decided it was okay to slurp up the entire internet, repackage it, and resell it? Surely that couldn’t be why sites are trying to keep non humans out.
Fix your government.
Thanks lad. Will get right on it.
Scrapping First-past-the-Post is probably a good start.
Good luck!
> I live in the UK and can't view a large portion of the internet without having to submit my ID to _every_ site serving anything deemed "not safe the for the children".
Really? Because I live in the UK and I've never been asked for my ID for anything.
> I want exactly the same content visible no matter who you are or where you are connecting from
The reason those IP addresses get blocked is not because of "who" is connecting, but "what"
Traffic from datacenter address ranges to sites like Reddit is almost entirely bots and scrapers. They can put a tremendous load on your site because many will try to run their queries as fast as they can with as many IPs as they can get.
Blocking these IP addresses catches a few false positives, but it's an easy step to make botting and scraping a little more expensive. Residential proxies aren't all that expensive, but now there's a little line item bill that comes with their request volume that makes them think twice.
> We need more residential proxies, not less
Great, you can always volunteer your home IP address as a start. There are services that will pay you a nominal amount for it, even.
Okay. So what does ten million requests cost, then? Like... a dollar? Is it a dollar? Is it two dollars if they splurge?
Because if the deterrent here is a line item so small it shows up as 'miscellaneous vibes' on a balance sheet, that's not a barrier. That's a tip jar.
Residential proxies are often priced by Gigabyte and the pricing is several orders of magnitude higher than your estimate.
You can run one, something like ByteLixir, Traffmonetizer, Honeygain, Pawns, there are lots more, just google "share my internet for money"
What will you be proxying? Nobody knows! I haven't had the police at my house yet.
Seems a great way to say "fuck you" to companies that block IP addresses.
You may see a few more CAPTCHAs. If you have a dynamic IP address, not many.
How much can you make if you run all of them at the same time?
Doesn't the ISP detect them?
like $3 a month
and why would they
> I've had enough of companies saying "you're connecting from an AWS IP address
I run a honeypot and the amount of bot traffic coming from AWS is insane. It's like 80% before filtering, and it's 100% illegitimate.
> it's 100% illegitimate.
Based on what?
I think perhaps you merely meant to say that more than 99% of it is illegitimate?
Most of them abuse the ip pool attached to lambda from my experience.
The end game of that is no useful content being accessible without login, or needing some sort of other proof-of-legitimacy.
That's already the case (irrespective of residential proxies) because content only serves as bait for someone to hand over personal information (during signup/login) and then engage with ads.
Proxies actually help with that by facilitating mass account registration and scraping of the content without wasting a human's time "engaging" with ads.
Amazon.com now only shows you a few reviews. To see the rest you must login. Social media websites have long gated the carrots behind a login. Anandtech just took their ball and went home by going offline.
There's a company that pays you to keep their box connected to your residential router. I assume it sells residential proxy services, maybe also DDoS services, I don't know. It's aptly named Absurd Computing.
I'm reading reddit.com from a Tor node, they also have a .onion domain you could use.
Anyone know how to create a usable reddit account from the .onion domain?
I've tried it, and my account was shadowbanned a few hours after I created it. It's very obnoxious.
Reddit bots shadowban almost everyone who post before they have enough comment karma. Nothing to do with Tor or VPN.
I didn't try posting, I tried commenting.
Also, nevermind the tech companies building their own proxy networks, such as Find My or Amazon Sidewalk.
Agreed. With things people paid for and using our wifi data to build their "positioning dbs" that you can't block or turn off on your phone, without "rooting" your own device.
How is Find My a proxy network?
In the literal sense. Your traffic is proxied through devices belonging to unwilling strangers.
By “your traffic” you mean device location reports? Or something else?
Yes. It's "edge routing" that happens to be restricted to a single operator.
The data that powers the app tracking your devices, shown on your devices, yes.
(What else?)
I don’t know. I wouldn’t have thought of myself as proxying other people’s traffic by carrying my iPhone around. (For one thing, it’s my own phone that initiates all the activity- it monitors for Apple devices, the devices don’t reach out to my phone.) I can see how you could frame it that way, though. I just thought they might be referring to something else that I didn’t know about.
I remain skeptical. I can understand how one would might see it that way, but I think it’s stretching the word proxy too far.
Devices on Apple’s Find My aren’t broadcasting anything like packets that get forwarded to a destination of their choosing. I would think that would be a necessity to call it “proxying”.
They’re just broadcasting basic information about themselves into the void. The phones report back what they’ve picked up.
That doesn’t fit the definition to me.
I absolutely don’t mind the fact that my phone is doing that. The amount of data is ridiculously minuscule. And it’s sort of a tit for tat thing. Yeah my phone does it, but so does theirs. So just like I may be helping you locate your AirTag, you would be helping me locate mine. Or any other device I own that shows up on Find My.
It’s a very close to a classic public good, with the only restriction being that you own a relevant device.
> aren’t broadcasting anything like packets that get forwarded to a destination of their choosing
Protocol insists the data only goes back to owner device or Apple server.
I still "run" a small ISP with a few thousand residential ips from my scraping days. The requirements are laughable and costs were negligible in the early 2000s.
This blog post from the company that used promise "don't be evil", one that steals water for data centers from vilages and towns via shady deals, whose whole premise it stealing other people's stuff and claiming it as their own and locking them out and selling their data.. Who made them the arbiter of the internet? No one!!!
They just stole this and get on their high horse to tell people how to use internet? You can eff right off Google.
[flagged]
Have you tried it? Every new account will be shadowbanned and if it's shared you often get blank page 429. None of this was true before the API shutdown.
That’s not my experience, using various VPNs, public networks, Cloudflare and Apple private relays. A captcha is common when logged out but that’s about it, I have not encountered any shadow bans. I create a new account each week.
>Every new account will be shadowbanned
That's not the same as "blocks all data to non-residential IP's"?
>if it's shared you often get blank page 429. None of this was true before the API shutdown.
See my other comment. I agree there's a non-zero amount of VPNs that are banned from reddit, but it's also not particularly hard to find a VPN that's not banned on reddit.
Probably not hard but my poor little innocent VPS at Hetzer that I have had for years is denied and that makes me sad.
Yes you do.
Private VPS for personal VPN in Netherlands (digital ocean), then Hungary (some small local DC) — both are blocked from day one.
> You've been blocked by network security. To continue, log in to your Reddit account or use your developer token. If you think you've been blocked by mistake, file a ticket below and we'll look into it.
Sounds like you just need to sign in or use the api?
I don’t have a reddit account; I open it few times a month from search results
Proton VPN sometimes (mostly?) has this issue too. It's a bit of an hit or miss in there iirc but I have definitely seen the last message of your comment.
Try browsing from any Mullvad vpn. You will be "blocked by network security"
I use mullvad regularly & visit reddit from that connection - it works. But! You have to sign-in.
That's just mullvad's IP pool being banned. The other VPN providers I use aren't banned, or at least are only intermittently banned that I can easily switch to another server.
... if you're logged out. Log in so they don't have to lump you in with every scraper you're sharing a subnet with.
I have never interacted with a reddit employee who wasn't actively gaslighting me about the platform. Do you even use the site? I talked to a PM recently who genuinely thought the phone app was something people liked.
They probably get paid by how many people believe their nonsense.
There are people who actively like it.
I don’t. But they 100% exist.
everything on Reddit is so locked down it’s useless. even if you do get to post something useful some basement dwelling mod will block it for an arcane interpretation of one of the subreddits 14 rules.
Have you tried using it logged out on a vpn? It is impossible.
there are several times where I've had to disable PIA to access reddit's login page
[flagged]
All of this sounds legal, so on what basis did they get them shut down?
I haven't looked at any court documents, but the WSJ article from Wednesday reported that "Last year, Google sued the anonymous operators of a network of more than 10 million internet-connected televisions, tablets and projectors, saying they had secretly pre-installed residential proxy software on them... an Ipidea spokeswoman acknowledged in an email that the company and its partners had engaged in “relatively aggressive market expansion strategies” and “conducted promotional activities in inappropriate venues (e.g., hacker forums)...”"
There was also a botnet, Kimwolf, that apparently leveraged an exploit to use the residential proxy service, so it may be related to Ipidea not shutting them down.
Google does much worse in Google–branded devices and apps, like the wifi location data harvesting.
neat, so let’s stop them too.
the answer is stop all the bad actors, not “well jimmy does it!”
How do you stop mobile proxies operating through similar nefarious business models... CGNAT prevents you from easily identifying the exit nodes.
Working with network operators.
Network operators have zero reason to care, they get paid per the GB for the bandwidth.
$5-9 a GB, its an infinite money glitch actually.
I'll betcha Google uses a lot of residential proxies themselves to scrape data and don't want competitors doing it.
I'll betcha your scraping for google simply by using Chrome
The big players are not taken out. This is sand thrown at our faces.
> attackers can mask their malicious activity by hijacking these IP addresses.
Sounds like "malicious activity" == "scraping activities that don't come from Google"
This problem isn't going until we find a better solution to scraping than using your IP address as a passport.
Of course brightdata doesn't get touched.
objectively good news. thanks, google.
I see Google is doing their best to stamp out the competition.
[dead]
[flagged]
The need for proxies in any legitimate context became obsolete with starlink being so widespread. Throw up a few terminals and you have about 500-2k cgnat IP addresses to do whatever you like.
2k IPs is not enough to do most enterprise scale scraping. Starlink's entire ASN doesn't seem to have enough V4 addresses to handle it even.
The actual secret is to use IPv6 with varied source IPs in the same subnet, you get an insane number of IPs and 90% of anti-scraping software is not specialized enough to realize that any IP in a /64 is the same as a single IP in a /32 in IPv4.
> any IP in a /64 is the same as a single IP in a /32 in IPv4
This is very commonly true but sadly not 100%. I am suffering from a shared /64 on which a VPS is, and where other folks have sent out spam - so no more SMTP for me.
If they're CGNAT then unless Starlink actively provides assistance to block them it won't matter.
As someone who wants the internet to maintain as much anarchy as possible I think it would be nice to see a large ISP that actively rotated its customer IPv6 assignments on a tight schedule.