14 Nov

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

14 November 2013 |
Written by Chin-Hock Tan
font size
Print
Email

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

crawling pagination

Now we look at the source file of this pagination.

source code of pagination

Before we extract emails from this website, we need to parse last page number. Here is the code for httpcurl.php:

<?php
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);
 
interface HttpScraper
{
    public function parse($body, $head);
}
  
class Scraper implements HttpScraper
{
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    echo "<pre>"; print_r($agent); echo "</pre>";
               }
            }
        }
    }
     
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
}
  
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
    public function getStatus() {
        return $this->_info[http_code];
    }
      
    public function getHeader() {
        return $this->_info;
    }
  
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

How it works:

First, I added two definition:

define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);

The LASTPAGE pattern is to parse last page number from the URL. Note that the is a "?" in the URL, so we need to escape it with "\?".

The PARSE_CONTENT to tell function get() whether to execute runParser() or not. $status by default set to "FALSE" and only execute function request(). If PARSE_CONTENT is passed to $status, then function get() will additionally execute runParser(), that is to scrape the email, name and contact number list in the web page.

    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }

More changes on test.php.

<?php
include 'httpcurl.php';
   
$target = "http://<domain of target website>/negotiators?page=";
$startPage = $target . "1";

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);
}

$pages = new HttpCurl($scrapeContent);

for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}
  
?>

How it works:

First, there are 169 pages of real estates' agent information. Each page is structured at

<domain of target website>/negotiators?page=1, 2, 3, 4,...169. So we can create a loop with $target . $i.

To get the first page, $startPage = $target . 1, which is

<domain of target website>/negotiators?page=1.

We instantiate two objects

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();

When we scrape the first page and match with pattern "LASTPAGE", we get number 169 return to $lastPage. With this info, we can create the loop

for($i=1; $i <= $lastPage; $i++)

Now we instantiate another object $pages and looping on every pages until last page. Now, I am getting 1687 email addresses and related information from this website. It is time to manage large amount of data.

Next, we need to store these information into MySQL. We can collect email addresses, even up to few hundred thousands, from many websites and store them into MySQL. We can also store these information directly into text file under CSV format and import into email management software. Then we blast out mass email to our target (but no SPAM please). This will be discussed in Part 4 of this article.

Last modified on Thursday, 03 November 2016 06:30

Rate this item

(1 Vote)

Read 13955 times

Published in PHP Web Scraper

Tagged under

Related items

More in this category: « HTTP Get Request via PHP/cURL To Request Web Page Source File Email Extractor Script with PHP cURL and Regular Expression (1) »

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

Related items

Comments

Most Viewed Articles

Get IPVanish VPN Today!!