Amazon Web Services is currently probing Perplexity AI over potential violations of its regulations. The investigation revolves around allegations that the AI search startup may be scraping internet sites that have explicitly prohibited such actions using the Robots Exclusion Protocol, a standard web method for limiting automated access.
Web scraping is the method of using bots to extract content and data from an internet site. After extracting underlying HTML code and data stored in a database, the scraper can then make a replica of the entire site content elsewhere.
Amazon Web Services Investigates Scraping Practices of Perplexity AI
A spokesperson from Amazon Web Services confirmed to WIRED the ongoing investigation into Perplexity AI. Previous reports indicated that the startup, backed by the Jeff Bezos family fund and Nvidia and valued at $3 billion, allegedly relied on data from scraped websites despite restrictions set by the Robots Exclusion Protocol.
While the said protocol is not legally binding, the terms of service typically are. The Robots Exclusion Protocol functions by placing a plaintext file on a domain to specify which pages should be off-limits to automated bots and crawlers.
Despite being optional, many scrapers have historically respected this protocol. The Amazon spokesperson emphasized that customers of the company's cloud division must comply with the robots.txt guidelines when conducting website crawls.
The spokesperson noted that Amazon Web Services' terms of service forbid customers from using their services for unlawful activities, and customers are obligated to adhere to these terms and all relevant laws.
Perplexity AI came under scrutiny following a previous report alleging the startup's involvement in the unauthorized use of at least one of its articles. Investigations have corroborated these claims, revealing additional instances of improper scraping and plagiarism linked to Perplexity's AI-driven search chatbot systems.
Despite efforts by Condé Nast engineers to block Perplexity's crawler using a robots.txt file across all its websites, WIRED discovered that the company accessed a server via an undisclosed IP address (44.221.181.252).
This server was found to be visiting Condé Nast sites hundreds of times over the past three months, suggesting ongoing scraping activities.
Controversial Practices of Perplexity AI
Perplexity AI's machine seems actively involved in extensively scanning news websites that explicitly block bots from accessing their content. Representatives from The Guardian, Forbes, and The New York Times have all reported detecting the IP address linked to Perplexity's servers on multiple occasions.
WIRED traced the IP address to an Elastic Compute Cloud (EC2) instance hosted on Amazon Web Services. After WIRED inquired whether using Amazon Web Services infrastructure to scrape websites that prohibit such activity violated its terms of service, the company launched an investigation.
In responding to WIRED's investigation, Perplexity CEO Aravind Srinivas asserted that the questions posed indicated a "fundamental misunderstanding" of how the company and the internet work.
He later told Fast Company that the IP address used to scrape Condé Nast websites and a test site was managed by a third-party company providing web crawling and indexing services.
He did not disclose the company's name due to a nondisclosure agreement. When asked if he would instruct the third party to cease crawling WIRED, Srinivas noted that it was complicated.
Sara Platnick, a spokesperson for Perplexity, said the company had responded to Amazon's inquiries and described the investigation as routine. She clarified that Perplexity has not made any changes to its operations in response to Amazon's concerns.