Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas
That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.
They also have an open API that makes scraper entirely unnecessary too.
Here are the relevant quotes from the article you posted
“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”
“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”
“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”
And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !
The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.
Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.
If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.
The problem isn’t that the data is already public.
The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.
AI crawlers don’t care about robots.txt or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.
Yeah but there’s would be scrappers if the robots file just pointed to a dump file.
Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.
Given that they already ignore robots.txt I don’t think we can assume any sort of good manners on their part. These AI crawlers are like locusts, scouring and eating everything in their path,
Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data.
If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete.
This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.
Not to mention that robots.txt is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?
To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”
The solution is breaking intellectual property and making sharing public data easy and efficient. A top-down imposition DESIGNED to crush the giants back down to the level playing field of the small players into a system where cooperation empower the small and place the burdens on the big with the understanding that all public data is “our” data and nobody, including its custodian should get between US and IT. Something designed by actually competent and clever politicians who will anticipate and counter all the dirty tricks big tech would try to regain the upper hand. I want big tech permanently losing on a field designed to disadvantage anything that accumulates power.
My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.
No no no, you don’t get to invoke grape imagery to defend copyright.
I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.
So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.
My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.
Just provide a full dump.zip plus incremental daily dumps and they won’t have to scrape ?
Isn’t that an obvious solution ? I mean, it’s public data, it’s out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas
The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.
Dumps or no dumps, these AI companies don’t care. They feel like they’re entitled to taking or stealing what they want.
That’s crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.
They also have an open API that makes scraper entirely unnecessary too.
Here are the relevant quotes from the article you posted
“Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024.”
“At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots.”
“Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure.”
And it’s wikipedia ! The entire data set is trained INTO the models already, it’s not like encyclopedic facts change that often to begin with !
The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.
Maybe it’s consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.
If the internet wasn’t becoming a warzone, there really wouldn’t be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.
The problem isn’t that the data is already public.
The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn’t been updated in a month.
AI crawlers don’t care about
robots.txt
or other helpful hints about what’s worth crawling or not, and hints on when it’s good time to crawl again.Yeah but there’s would be scrappers if the robots file just pointed to a dump file.
Then the scraper could just do a spot check a few dozen random page and check the dump is actually up to date and complete and then they’d know they don’t need to waste any time there and move on.
Given that they already ignore robots.txt I don’t think we can assume any sort of good manners on their part. These AI crawlers are like locusts, scouring and eating everything in their path,
Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data. If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete. This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.
Not to mention that
robots.txt
is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”
The solution is breaking intellectual property and making sharing public data easy and efficient. A top-down imposition DESIGNED to crush the giants back down to the level playing field of the small players into a system where cooperation empower the small and place the burdens on the big with the understanding that all public data is “our” data and nobody, including its custodian should get between US and IT. Something designed by actually competent and clever politicians who will anticipate and counter all the dirty tricks big tech would try to regain the upper hand. I want big tech permanently losing on a field designed to disadvantage anything that accumulates power.
My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.
Spoiler for you bros: It will never be enough.
I wish I was still capable of the same belief in the goodness of others.
They don’t have to scrape; especially if
robots.txt
tells them not to.Hey, she was wearing a miniskirt, she wanted it, right?
No no no, you don’t get to invoke grape imagery to defend copyright.
I know, it hurts when the human shields like wikipedia and the openwrt forums are getting hit, especially when they hand over the goods in dumps. But behind those human shields stand facebook, xitter, amazon, reddit and the rest of big tech garbage and I want tanks to run through them.
So go back to your drawing board and find a solution the tech platform monopolist are made to relinquish our data back to use and the human shields also survive.
My own mother is prisoner in the Zuckerberg data hive and the only way she can get out is brute zucking force into facebook’s poop chute.
Luigi them.
Can’t use laws against them anyway…
I think the issue is that the scrapers are fully automatically collecting text, jumping from link to link like a search engine indexer.