Sunday, September 10, 2017

How To Scrape Any Website And Not Feel Bad About It

In my world, screen scraping is essential to making a living. For years, I've built applications for clients that automate tasks on major websites and, in most cases, if not all, scraping data from these sites was a big part of how the application worked. If screen scraping somehow became illegal, my business would fold. But that won't ever happen.

Periodically I read about this hilarious idea that a website has the power to somehow block scrapers. It's an understandable goal from a competitive standpoint -- after all, if you're trying to make a buck, then the last thing you want is any random joe downloading your entire website, re-branding it and taking your customers away. Including anti-scrape clauses in your terms of service makes sense, and typically websites that do will block IPs and suspend accounts after discovering the scraping. But to say they can stop the scraping altogether is ignoring how the web works in the first place.

If you can read a website in a browser, then you can scrape it, period.

The only real way to stop scraping is to not allow access to the website anymore, such as making certain sections of the website public and others requiring the creation of an official account, which is exactly what LinkedIn did -- before getting sued for it.

Now, I'm not going to say that I agree with the lawsuit. In fact, I take LinkedIn's side, considering they are not just protecting their business but protecting the privacy of their users, both of which I value. Plus, the stress that heavy scraping can put on servers can add extra costs to the business model. That could actually be grounds for a lawsuit against those doing the scraping.

However, it really doesn't matter if LinkedIn, or any website, forces users to log in in order to access the data. So long as they don't charge for the access (and LinkedIn's free tier is barely limited at all), all a programmer has to do is create a free account then work from within that account. And with all the proxy solutions nowadays, it's really not difficult at all for any tech startup, or any random joe sitting in their mom's basement, to go to town automating and scraping LinkedIn.

Web crawlers like Googlebot, Bing, and many, many others from major websites that compile data from the web are screen scrapers at the core, it's simply how a large part of the world wide web functions. Couple this with server side JS and you're looking at a vast landscape of screen scraping that is not going anywhere any time soon, no matter how hard big sites try to stop it.


No comments:

Post a Comment