top of page
  • tech360.tv

Reddit Updates Robots.txt to Combat Unauthorised Data Scraping by AI Startups

Reddit to update web standard to block automated website scraping. AI startups accused of plagiarising content from publishers. Reddit to update Robots Exclusion Protocol to prevent unauthorised data scraping.

Reddit
Credit: REUTERS

The move comes after reports surfaced that AI startups were bypassing the existing rule to gather content for their systems without permission or proper credit.


The Robots Exclusion Protocol, commonly known as "robots.txt," is a widely accepted standard used to determine which parts of a website can be crawled. Reddit plans to update this protocol to prevent unauthorised data scraping. Additionally, the platform will maintain rate-limiting techniques to control the number of requests from a single entity and will block unknown bots and crawlers from scraping data on its website.


The issue of AI firms plagiarising content from publishers to create AI-generated summaries has been a growing concern. Publishers have been using robots.txt as a tool to prevent tech companies from using their content without permission. However, recent incidents have shown that some AI startups have found ways to bypass this web standard.


A letter from content licensing startup TollBit to publishers highlighted the problem of AI firms circumventing robots.txt to scrape publisher sites. This was further confirmed by a Wired investigation, which revealed that AI search startup Perplexity had likely bypassed efforts to block its web crawler.


One notable case involved business media publisher Forbes, which accused Perplexity of plagiarising its investigative stories for use in generative AI systems without giving proper credit. These incidents have raised concerns about the ethical use of content by AI startups.


In response to these issues, Reddit has taken a proactive approach by updating its web standard. The platform aims to protect the rights of publishers and ensure that content is used responsibly. However, Reddit has clarified that researchers and organisations like the Internet Archive will still have access to its content for non-commercial use.

 
  • Reddit to update web standard to block automated website scraping

  • AI startups accused of plagiarising content from publishers

  • Reddit to update Robots Exclusion Protocol to prevent unauthorised data scraping


Source: REUTERS

As technology advances and has a greater impact on our lives than ever before, being informed is the only way to keep up.  Through our product reviews and news articles, we want to be able to aid our readers in doing so. All of our reviews are carefully written, offer unique insights and critiques, and provide trustworthy recommendations. Our news stories are sourced from trustworthy sources, fact-checked by our team, and presented with the help of AI to make them easier to comprehend for our readers. If you notice any errors in our product reviews or news stories, please email us at editorial@tech360.tv.  Your input will be important in ensuring that our articles are accurate for all of our readers.

bottom of page