Reddit Updates Robots.txt to Combat Unauthorised Data Scraping by AI Startups
- tech360.tv
- Jun 27, 2024
- 2 min read
Reddit to update web standard to block automated website scraping. AI startups accused of plagiarising content from publishers. Reddit to update Robots Exclusion Protocol to prevent unauthorised data scraping.

The move comes after reports surfaced that AI startups were bypassing the existing rule to gather content for their systems without permission or proper credit.
The Robots Exclusion Protocol, commonly known as "robots.txt," is a widely accepted standard used to determine which parts of a website can be crawled. Reddit plans to update this protocol to prevent unauthorised data scraping. Additionally, the platform will maintain rate-limiting techniques to control the number of requests from a single entity and will block unknown bots and crawlers from scraping data on its website.
The issue of AI firms plagiarising content from publishers to create AI-generated summaries has been a growing concern. Publishers have been using robots.txt as a tool to prevent tech companies from using their content without permission. However, recent incidents have shown that some AI startups have found ways to bypass this web standard.
A letter from content licensing startup TollBit to publishers highlighted the problem of AI firms circumventing robots.txt to scrape publisher sites. This was further confirmed by a Wired investigation, which revealed that AI search startup Perplexity had likely bypassed efforts to block its web crawler.
One notable case involved business media publisher Forbes, which accused Perplexity of plagiarising its investigative stories for use in generative AI systems without giving proper credit. These incidents have raised concerns about the ethical use of content by AI startups.
In response to these issues, Reddit has taken a proactive approach by updating its web standard. The platform aims to protect the rights of publishers and ensure that content is used responsibly. However, Reddit has clarified that researchers and organisations like the Internet Archive will still have access to its content for non-commercial use.
Reddit to update web standard to block automated website scraping
AI startups accused of plagiarising content from publishers
Reddit to update Robots Exclusion Protocol to prevent unauthorised data scraping
Source: REUTERS
Comments