Technology Web design

How to Prevent AI From Scraping Your Website

7 min read

Search engines such as Google use crawlers – otherwise known as bots – to tirelessly process millions of pages per day, creating a vast digital index which makes up their search results.

For the most part, this is a positive, as it means your content can appear in their search results and earn traffic.

However, it’s not just search engines who use these bots and anyone with the skills and infrastructure can have their own crawler.

The ethicality of these bots can vary, from SEO tools like ahrefs collecting data for their reports, to spam bots which may look to harvest email address data.

In recent years, a new, industrial scale breed of bot has emerged – the AI scraper.

What are AI scrapers?

These bots, used in popular tools such as ChatGPT and Google Gemini, go beyond collecting information to help with indexing.

Their objective is to scoop up your website’s data to then train their AI models for everything from providing chatbot answers to generating text or creating photorealistic images.

This has begun to cause some controversy across the industry, with some publishers not happy with perceived privacy and intellectual property violations from bots taking their content.

Getty Images and a group of artists are making a legal challenge against AI image generators, whilst in 2023 Google was hit with a class action suit for their use of AI scraped data.

Some companies such as Tumblr & WordPress owner Automattic have seen the economic opportunities here and agreed deals to allow AI companies to use their data for training purposes.

Why stop AI bots scraping your website?

The biggest concern most publishers have regarding AI bots, is the devaluation of their own content, and a loss of traffic.

Once bots have your content, they can begin serving it within chatbot answers – often without providing credit to the source. This usually means that the user has less reason to ever visit the original place that content was published.

As mentioned above, there are also concerns surrounding intellectual property. If you publish anything particularly unique such as artwork or data that you have IP protections over, then you certainly don’t want AI serving this data as its own.

AI bots may also affect areas such as bandwidth. Every time someone accesses your website, data is transferred, which uses up bandwidth. AI bots with free reign to crawl a website may be a drain on resources if they become particularly aggressive and crawl content excessively – leading to slower loading times for legitimate visitors.

Should you block ChatGPT from scraping your website?

This all depends on your views on AI, and also the type of content that you are posting.

By allowing ChatGPT to access your content, you are helping to train the AI and improve future versions – which is great if you’re an AI fanatic like some members of the 20i team!

There is a risk however that your content may be served elsewhere, without credit to you. Therefore, if you are writing particularly unique or insightful content which can only be found on your website, it is better to keep it that way.

It is also worth considering whether your website contains any sensitive information which, again, you would prefer to keep out of AI’s reach.

Blocking ChatGPT is becoming a more popular practice. As of September 2023, 26% of the top 100 websites in the world had blocks in place. News giant the New York Times is one of these many websites with the GPTBot blocked via robots.txt.

How to stop AI bots from scraping your website

There are multiple methods you can use to try and deter AI bots from scraping your website, including some specific to each crawler.

Preventing ChatGPT from scraping your website is done in the same way as blocking most other mainstream bots.

ChatGPT’s developers, OpenAI, respect the robots.txt protocol. This means you can instruct their crawler, GPTBot, to not access your website by adding the following line to your robots.txt file.

User-agent: GPTBot
Disallow: /

This tells GPTBot (the user-agent) to disallow crawling of all pages on your website. To specify blocking of specific pages or subfolders, just amend the / to your required URL.

Should you block Google AI from scraping your website?

Looking to avoid the number of websites which have blocked ChatGPT, Google has tried to appeal more to website owners by stating their commitment to transparency and showing a desire to provide website credit links as often as possible.

With that said, whether you allow Google AI to crawl your website still comes down to your overall view of AI.

Your data could be used to train AI tools to be more useful in the future, however, there remains the risk of your content becoming available without the need for people to visit your website.

It’s important to know that Google-extended will not block Googles SGE from crawling your website, and therefore blocking Google AI bots poses no risk to your organic search rankings.

How to stop Google AI from scraping your website

Similarly, to ChatGPT, Google’s AI tools can be prevented from scraping your website using the robots.txt file.

In late 2023, Google announced a new standalone product token known as Google-Extended. By placing the following into your robots.txt file, Google AI bots will not be able to use your website content for learning purposes.

User-agent: Google-Extended
Disallow: /

This will block both Google Bard, the conversational AI used for tools such as Gemini, and Vertex AI, which is used for building and deploying generative AI-powered search. Google-extended is now a part of Googles crawler overview documentation.

The above snippet is designed to disallow the crawler from your entire website. If you would only like to block certain pages – for example those with sensitive or valuable data – then replace the / with the URL or directory of your choice.

Blocking other AI bots

If you’d rather keep your website copy away from AI, then on top of the “big two” OpenAI and Google, you may also want to consider Common Crawl.

Common Crawl is one of the largest public datasets used by AI for training, with ChatGPT and other large language models all utilising this dataset, which is collected with a crawler housed in Amazons S3 service. If the Common Crawl bot (otherwise known as CCBot) has your data, then theres a chance that AI bots will have access to that information too.

Because of this, CCBot is the 2nd most blocked AI bot.

As with GPT Bot and Google, you can prevent CCBot from scraping your content by using the robots.txt exclusion protocol. Add the lines below into your robots.txt file to halt its crawling activities:

User-agent: CCBot
Disallow: /

Similarly, Claude, another popular AI tool owned by Anthropic can also be blocked using the following within robots.txt

User-agent: ClaudeBot
Disallow: /

Whilst blocking the larger AI bots from crawling your content is relatively easy, new smaller bots are popping up all of the time, which means that blocking them via robots.txt isn’t always the answer.

Other methods to restrict AI bot access to your content:

.htaccess

Bots can be blocked within your htaccess file, by using rewritecond to forbid certain bots based on their User-agent header. As with the robots.txt method though, this is only effective if you know the exact name of the bot.

Web Application Firewall (WAF)

Install and configure a WAF to filter website traffic. This will enable you to block requests from AI bots by targeting specific user agents or IP addresses. All websites hosted with 20i receive automatic protection by our free WAF.

Use CAPTCHAs or Proof of Work

Implementing these can deter automated bots by requiring a human-like response or computational proof.

HTTP Authentication

This adds a username and password layer to your website. While not foolproof, and unsuitable for a lot of websites, it deters basic bots designed to navigate publicly accessible websites.

IP Blocking

IP blocking can feel like fighting a losing battle at times, as bots can easily switch IPs, however it can still be effective. Monitor your server logs or create honeypot crawler traps to identify specific IPs which may be crawling your site excessively, and then block them. IPs can also be blocked with the 20iCDN.

Legal Measures

An extreme measure, likely only reserved for enterprise level publishers, but if you feel that AI crawling is infringing on any of your intellectual property rights, then consulting legal advice can be the best course of action.

Final thoughts

AI provides opportunities and challenges like the web has never seen before, however by understanding the landscape and using the right strategies, you can ensure that you keep as much control of your content as possible.

What are your thoughts on AI using your content? Are you happy to help these bots learn and become more useful, or do you feel that it is a threat to content producers? Let us know in the comments below.

Want to know more about AI? Read our guide on using AI for writing website copy.

Managed Cloud Hosting

AI Web Design Web Development

3 comments

Cancel reply

publisher says:

April 21, 2024 at 4:37 pm

Publishers provided access to bots so their content can be listed on search engines. Now the Gen AI bots are misusing that access meant for search engines to steal content.

A lot of these Gen AI content scraper bots do not even identify themselves as bots and hide behind cloud providers IP addresses – AWS, GCP, Huawei Cloud etc. There needs to be an easy way to block all web traffic for cloud providers.

Reply
John Behling says:

April 4, 2024 at 6:59 pm

I am not a fan of AI. In the context of web scraping, it’s not so much artificial intelligence as it is accumulated intelligence. There is already an overabundance of shallow, superficial articles on the web either authored by someone with only a casual familiarity with the topic, or an AI-accumulated body of text scraped from those same posts. Propagating more and more low-value content makes it perpetually harder for the consumer to find valuable information hidden amongst the noise. Quality content takes good research and hard work. If a person is too lazy to put in the work, they should not post for the sake of posting. When I personally have published content, it has drawn on years of experience, learning, and research, in addition to many hours spent writing and revising until I felt the quality was the best possible. It irks me to think that all of my hard work can be so easily copied and plagiarized by some lazy person with an “Easy Button” (credit to Staples for that term). The output generated from the method is not typically subject to scrutiny and revision, because the new “author” lacks the in-depth knowledge to do so. AI content can be easy to spot, as it often lacks depth and good writing flow, and stated facts lack context and credibility. I’m all for a better Internet, but AI takes us in the wrong direction.

Reply
David Saunders says:

March 11, 2024 at 12:47 pm

Superb article. Well written and informative.

Thank you

Reply

© Copyright 2024 20i Ltd. All Rights Reserved.