Log in Register

Log in

The Art of Invisibility for Ninja Web Spider

Art of Invisibility

Most webmasters welcome webbot from search engines such as Google and Bing. Contents of their websites will be indexed by search engines and users can easily find thier websites. However, they surely not welcome your web spider to extract data from their sites, may be you're up to no good, such as how products from etsy.com were used to drive traffic to a Hong Kong website. Most likely your IP will be blocked if webmasters detect unknown web spider aggressively crawling their websites. Way back to 2001, eBay filed legal action against Bidder's Edge, an auction scraping site, for “deep linking” into its listings and bombarding its service; Craigslist has throttle mechanism to prevent web crawlers from overwhelming the site with requests

Even big spiders like Google has mechanism to prevent others from scraping their content. Try to search a keyword and at the results page, click page 1, then page 2, page 3,...At page 20 (in my case), Google will stop displaying search results and want to find out are you a human or webbot that reading the page. If you unable enter correct captcha, then your IP will be blocked eventually.

captcha from google search

So how can we fly under the radar when scraping websites? First, let's check how we can detect our web spider by runnung earlier script to own website. Login to cPanel and click on "Latest Visitors". 

cPanel

You can see a table with informtion below. For this tutorial, I only display IP, URL, Time and User Agent.

IP, URL, Time and User AgentI executed the web spider script to crawl this website. Click on "Time" menu and here is the output.

Detect web spiders

My dynamic IP was 203.106.151.122 when running the script. So any webmaster can easily detect same IP with no user agent and requesting many pages within a minute. We can do somethings to hide our web spider.

Here are some rules and changes to our previous script:

1) Respect website you performing scraping

Do not overload the targeted server and consume too much bandwidth. Web admin will be able to detect your IP and User Agent from log file. They will see continuous requests to server within very short interval.

We can insert random waiting time between requests, from a few seconds to minutes. Your script will take much longer time to complete but less risk for detection. 

In our previous script, extract.php

for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
	sleep(rand(15, 45));   // delay 15 to 45 seconds to prevent detection by system admin	
}

 

 In this example, sleep(rand(15, 45)); is added into loop and the script will pause around min 15 or max 45 seconds before sending next request.

web spider with delay

As you can see, my dynamic IP was 203.106.151.122, there are other requests (either human or webbots) between my two requests. Now we move to user agent.

2) Randomize user agent for every request 

Many ISP's use Dynamic Host Configuration Protocol (DHCP) where same IP is shared by multiple users. We can made website admin difficult to detect by randomizing user agent name for every requests to the server. This make it looks like multiple users are browsing the same website from one ISP.

	private function getRandomAgent() {
		$agents = array('Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36', 
					 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
					 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0',
					 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 718; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)',
					 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)',
					 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9) Gecko Minefield/3.0',
					 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6',
					 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B206',
					 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MATP; MATP)',
					 'Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20100101 Firefox/22.0');
		 $agent = $agents[array_rand($agents)]; 
		 return $agent;
	}

 

 Add function getRandomAgent() in class HttpCurl in httpcurl.php

    protected function request($url) {
        $ch = curl_init($url);
		$agent = $this->getRandomAgent();
		curl_setopt($ch, CURLOPT_USERAGENT, $agent);		
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }

 

Two lines, $agent = $this->getRandomAgent() and curl_setopt($ch, CURLOPT_USERAGENT, $agent) added to function request(). 

Now the output looks like this.

web spider with fake user agents

Our spider is now just like any other requests in the log file except IP address.

3) Using proxy to hide your IP

We are not able to spoof IP address at PHP programming level. That is we are unable to send fake IP address to targeted server and get response with PHP/cURL. If you don't want your IP to be logged at targeted server, you can run web scraping script through proxy server. Here is a list of free proxies you can use. However, you might encounter connection stability, speed and other issues with free proxies. No issue for one off scraping or small projects but you can use paid proxy if you are serious in this business. Michael Schrenk has a good writing (Chapter 27) on proxy in his book, Webbots, Spider, Screen Scraper (2nd Edition).

If you have a list of proxy servers, then you can randomizing script execution through different server with each request. 

	private function getRandomProxy() {
		$proxies = array('202.187.160.140:3128', 
					 '175.139.208.131:3128',
					 '60.51.218.180:8080');
		 $proxy = $proxies[array_rand($proxies)]; 
			return $proxy;
	}

 

Add a new function getRandomProxy() to class HttpCurl in httpcurl.php.

    protected function request($url) {
        $ch = curl_init($url);
		$agent = $this->getRandomAgent();
		curl_setopt($ch, CURLOPT_USERAGENT, $agent);	
		$proxy = $this->getRandomProxy();
		curl_setopt($ch, CURLOPT_PROXY, $proxy);		
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }

 

Now calling getRandomProxy() in request and configure cURL with curl_setopt($ch, CURLOPT_PROXY, $proxy). Here is the output:

web spider with proxy

Now, system admin can no longer find your IP address!!

But where is IP 60.51.218.180 in the log file? This is because cURL not able to made connection with that proxy. In such case, we are not able to get source page when cURL try to connect this IP. curl_exec($ch) will return 0 in this case. So our script need to take care of this situation and retry with different proxy.

If you don't have proxy server, a non technical method is to buy coffee at free wifi cafe, sit at blind spot of CCTV and run your script quietly!

There are many other steps you can fine tuned, for example, we are sending sequential, from page 1, 2, 3 to last page. We can pre-generate or collected targeted urls and store into MySQL. The script can then randomly fetch urls for request. 

However, please do not think that no one can trace you with above techniques. Well established websites have advanced tools to analyze their traffic. Ninja or no ninja, you still can be found.

Use the tools  constructively and good luck!

Last modified on Thursday, 03 November 2016 06:22
Rate this item
(2 votes)

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.



Anti-spam: complete the task
back to top