How to Prevent Web Scraping

- 30 mins

A guide to prevent Webscraping

(Or at least making it harder)


Essentially, hindering scraping means that you need to make it difficult for scripts and machines to get the wanted data from your website, while not making it difficult for real users and search engines.

Unfortunately this is hard, and you will need to make trade-offs between preventing scraping and degrading the accessibility for real users and search engines.

In order to hinder scraping (also known as Webscraping, Screenscraping, web data mining, web harvesting, or web data extraction), it helps to know how these scrapers work, and what prevents them from working well, and this is what this answer is about.

Generally, these scraper programs are written in order to extract specific information from your site, such as articles, search results, product details, or in your case, artist and album information. Usually, people scrape websites for specific data, in order to reuse it on their own site (and make money out of your content !), or to build alternative frontends for your site (such as mobile apps), or even just for private research or analysis purposes.

Essentially, there are various types of scraper, and each works differently:

There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even though they use different technologies and methods to get your content.

This collection of tips are mostly my own ideas, various difficulties that I’ve encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.

##How to prevent scraping

Some general methods to detect and deter scrapers:

###Monitor your logs & traffic patterns; limit access if you see unusual activity:

Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.

Specifically, some ideas:

Require registration & login

Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.

In order to avoid scripts creating many accounts, you should:

Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.

Block access from cloud hosting and scraping service IP addresses

Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or Google app Engine, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services. You can also block access from IP addresses used by scraping services.

Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.

Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.

Make your error message nondescript if you do block

If you do block / limit access, you should ensure that you don’t tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:

Instead, show a friendly error message that doesn’t tell the scraper what caused it. Something like this is much better:

This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don’t block and thus cause legitimate users to contact you.

Use Captchas if you suspect that your website is being accessed by a scraper.

Captchas (“Completely Automated Test to Tell Computers and Humans apart”) are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.

As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn’t a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.

Things to be aware of when using Captchas:

Serve your text content as an image

You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.

However, this is bad for screen readers, search engines, performance, and pretty much everything else. It’s also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it’s also easy to circumvent with some OCR, so don’t do it.

You can do something similar with CSS sprites, but that suffers from the same problems.

Don’t expose your complete dataset:

If feasible, don’t provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don’t have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.

This will be ineffective if:

Don’t expose your APIs, endpoints, and similar things:

Make sure you don’t expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.

To deter HTML parsers and scrapers:

Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.

Frequently change your HTML

Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.

If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.

Things to be aware of:

Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.

See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.

Change your HTML based on the user’s location

This is sort of similar to the previous tip. If you serve different HTML based on your user’s location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it’s actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.

Frequently change your HTML, actively screw with the scrapers by doing so !

An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:

<div class="search-result">
  <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)

As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could alsoleave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper.Here’s how the search results page could be changed:

<div class="the-real-search-result">
  <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"the-real-search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>

<div class="search-result" style="display:none">
  <h3 class="search-result-title">Visit example.com now, for all the latest Stack Overflow related news !</h3>
  <p class="search-result-excerpt">EXAMPLE.COM IS SO AWESOME, VISIT NOW! (Real users of your site will never see this, only the scrapers will.)</p>
  <a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)

This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they’re hidden with CSS.

Screw with the scraper: Insert fake, invisible honeypot data into your page

Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:

<div class="search-result" style="display:none">
  <h3 class="search-result-title">This search result is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
  Note that clicking the link below will block access to this site for 24 hours.</p>
  <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)

A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won’t visit the link. A genuine and desirable spider such as Google’s will not visit the link either because you disallowed /scrapertrap/ in your robots.txt (don’t forget this!)

You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.

Serve fake and useless data if you detect a scraper

If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don’t know that they’re being screwed with.

As an example: if you have a news website; if you detect a scraper, instead of blocking access, just serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your faked data or articles indistinguishable from the real thing, you’ll make it hard for scrapers to get what they want, namely the actual, real articles.

Don’t accept requests if the User Agent is empty / missing

Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.

If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)

It’s trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.

Don’t accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers

In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:

If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.

Check the Referer header

Adding on to the previous item, you can also check for the Referer (yes, it’s Referer, not Referrer), as lazily written scrapers may not send it, or always send the same thing (sometimes “google.com”). As an example, if the user comes to an article page from a on-site search results page, check that the Referer header is present and points to that search results page.

Beware that:

Again, as an additional measure against poorly written scrapers it may be worth implementing.

If it doesn’t request assets (CSS, images), it’s not a real browser.

A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won’t as they are only interested in the actual pages and their content.

You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.

Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.

Use and require cookies; use them to track user and scraper actions.

You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.

For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it’s probably a scraper.

Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.

Note that if you use JavaScript to set and retrieve the cookie, you’ll block scrapers which don’t run JavaScript, since they can’t retrieve and send the cookie with their request.

Use JavaScript + Ajax to load your content

You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.

Be aware of:

Obfuscate your markup, network requests from scripts, and everything else.

If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex with multiple layers of obfuscation, bit-shifting, and maybe even encryption), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.

There are several disadvantages to doing something like this, though:

Non-Technical:

Your hosting provider may provide bot - and scraper protection:

For example, CloudFlare provides an anti-bot and anti-scraping protection, which you just need to enable, and so does AWS. There is also mod_evasive, an Apache module which let’s you implement rate-limiting easily.

Tell people not to scrape, and some will respect it

You should tell people not to scrape your site, eg. in your conditions or Terms Of Service. Some people will actually respect that, and not scrape data from your website without permission.

Find a lawyer

They know how to deal with copyright infringement, and can send a cease-and-desist letter. The DMCA is also helpful in this regard.

This is the approach Stack Overflow and Stack Exchange uses.

Make your data available, provide an API:

This may seem counterproductive, but you could make your data easily available and require attribution and a link back to your site. Maybe even charge $$$ for it..

Again, Stack Exchange provides an API, but with attribution required.

Miscellaneous:

What’s the most effective way ?

In my experience of writing scrapers and helping people to write scrapers here on SO, the most effective methods are :

Further reading:

Good luck on the perilous journey of protecting your content…

Please comment your views/suggestions below.

Sunil Tatipelly

Sunil Tatipelly

Supposedly Engineer. Major Geek. Food Freak. Proud IITian. Quirkyalone.

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora