Sunday, September 10, 2017

How To Scrape Any Website And Not Feel Bad About It

In my world, screen scraping is essential to making a living. For years, I've built applications for clients that automate tasks on major websites and, in most cases, if not all, scraping data from these sites was a big part of how the application worked. If screen scraping somehow became illegal, my business would fold. But that won't ever happen.

Periodically I read about this hilarious idea that a website has the power to somehow block scrapers. It's an understandable goal from a competitive standpoint -- after all, if you're trying to make a buck, then the last thing you want is any random joe downloading your entire website, re-branding it and taking your customers away. Including anti-scrape clauses in your terms of service makes sense, and typically websites that do will block IPs and suspend accounts after discovering the scraping. But to say they can stop the scraping altogether is ignoring how the web works in the first place.

If you can read a website in a browser, then you can scrape it, period.

The only real way to stop scraping is to not allow access to the website anymore, such as making certain sections of the website public and others requiring the creation of an official account, which is exactly what LinkedIn did -- before getting sued for it.

Now, I'm not going to say that I agree with the lawsuit. In fact, I take LinkedIn's side, considering they are not just protecting their business but protecting the privacy of their users, both of which I value. Plus, the stress that heavy scraping can put on servers can add extra costs to the business model. That could actually be grounds for a lawsuit against those doing the scraping.

However, it really doesn't matter if LinkedIn, or any website, forces users to log in in order to access the data. So long as they don't charge for the access (and LinkedIn's free tier is barely limited at all), all a programmer has to do is create a free account then work from within that account. And with all the proxy solutions nowadays, it's really not difficult at all for any tech startup, or any random joe sitting in their mom's basement, to go to town automating and scraping LinkedIn.

Web crawlers like Googlebot, Bing, and many, many others from major websites that compile data from the web are screen scrapers at the core, it's simply how a large part of the world wide web functions. Couple this with server side JS and you're looking at a vast landscape of screen scraping that is not going anywhere any time soon, no matter how hard big sites try to stop it.


Wednesday, September 6, 2017

Alt-right Bloggers Target Tech Company Tekoso Media After DMCA Takedown Request

Angry alt-right bloggers tend to make things up to support their narrative against liberals, and a recent smear campaign against New York tech company Tekoso Media is par for the course.

The most basic example of how they do this is literally creating a story and posting it a lot all over the internet. It doesn't matter if it's true or not, what matters is that it's out there. The self-described "journalists" run into a very big, but simple problem that anyone with half a brain could see from a distance if they actually thought about what they did before they acted:  when people figure out that you're lying about what you say, your credibility is completely destroyed and it takes years to recover, sometimes never recovering at all.

When it comes to fake news, alt-right vloggers and their baseless conspiracy theories dominated Youtube until Google's ad network got tired of it and stopped showing ads on their videos over the past year. This happened because advertisers decided to boycott fake news and hate speech on the Google network, which caused a significant drop in revenue for the publishers of videos containing this type of content.

The vloggers caught for copyright infringement claim that they did nothing wrong, but Tekoso Media claims it's copyright infringement takedown request was legit. The truth is that the process of a DMCA takedown request through Google is very stringent. First, they ask you to fill out a very detailed report of the infringing content on their network. Next, they ask for proof from the claimant that they are indeed the owner of the infringing content. Once verified, the content is removed from Youtube. It is assumed that Google would not follow through with taking anything down from Youtube if it didn't see the infringing content on the video somewhere, so they must be complying because they saw the infringement happening.

Many so-called "independent journalists" from the alt-right have gotten into libel lawsuits, usually against them, but sometimes attempting to turn things around and claim that other people who write about their lies and conspiracy theories are making things up. No alt-right blogger ever won a libel case, though, while more have lost or settled.

The company says they have contacted the vloggers and bloggers and that legal proceedings are in the works.