PHP Email Extractor Script with cURL and Regular Expression (2)

targeted email extractor

In this post, we need to make a small modification to previous PHP script for targeted email extraction.

First, we need to revisit the source file of the targeted web page, we can see that there are repeated blocks of agent contacts, with name, email and phone number. Total of 10 blocks per page. 

The plan is to use the script to "cut out" each block and stores into array, then extract name, email and phone number from each block.

As you can see, each block starts with <div class="negotiators-wrapper"> tag and ends with </div></div>. Note that both </div> tags are separated by carriage return and new line feed in this example. 

source file of the targeted web page

So here is the code for this example:

<?php
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');

interface HttpScraper
{
    public function parse($body, $head);
}
 
class Scraper implements HttpScraper
{
    public function parse($body, $head) {
       if ($head == 200) {	   
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);			
			if ($p) {
                foreach($blocks[0] as $block) {
					$agent[name] = $this->matchPattern(NAME, $block, 2);
					$agent[email] = $this->matchPattern(EMAIL, $block, 1);
					$agent[phone] = $this->matchPattern(PHONE, $block, 1);
					echo "<pre>"; print_r($agent); echo "</pre>";
               }
            }
        }
	}
	
	public function matchPattern($pattern, $content, $pos) {
		if (preg_match($pattern, $content, $match)) {
			return $match[$pos];
		}	
	}
}
 
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
     
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        }  
        $this->setParser($p);
    }
 
    public function get($url) {
        return $this->request($url);
    }
 
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);    
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);
 
        $this->runParser($this->_body, $this->getStatus());               
    }
 
    public function getStatus() {
        return $this->_info[http_code];
    }
     
    public function getHeader() {
        return $this->_info;
    }
 
    public function getBody() {
        return $this->_body;
    }
     
    public function __destruct() {
    }  
     
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))   
            $this->_parser = $p;
    }
 
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    }  
}
 
?>

 

How it works:

First, I define TARGET_BLOCK to get the block as discussed earlier.

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');

 The carriage return and new line are matched with \r\n, which works for me on when running XAMPP under Window 7. Also, regular expression "s" modifier is used at the end of the pattern to match multilines code.

To get the name, I define NAME as

define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');

 Note that there is a URL before ">(Name)</a></div>. The URL is different in each block. preg_match() function will grap two set of data, first is partial information of the URL and second is the targeted name. We will ignore the URL information in this case.

 

To get email and phone number, it is straight forward.

define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');

 

 There is just a slight change to class Scraper:

class Scraper implements HttpScraper
{
    public function parse($body, $head) {
       if ($head == 200) {	   
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);			
			if ($p) {
                foreach($blocks[0] as $block) {
					$agent[name] = $this->matchPattern(NAME, $block, 2);
					$agent[email] = $this->matchPattern(EMAIL, $block, 1);
					$agent[phone] = $this->matchPattern(PHONE, $block, 1);
					echo "<pre>"; print_r($agent); echo "</pre>";
               }
            }
        }
	}
	
	public function matchPattern($pattern, $content, $pos) {
		if (preg_match($pattern, $content, $match)) {
			return $match[$pos];
		}	
	}
}

 

First, the function parse() will match and copy the block into array, then we extract the name, email and phone number. The rest of the code remails unchanged.

For this tutorial, we run the test.php and print out the results:

extract the name, email and phone number

That's it!!

 

With all the information, you can write a more personalized email to your targeted receivers. You can use email management software such as ListMailPro or any latest and greatest autoresponder to import your list and send out mass email.

So far our script is able to extract email from one page. To extract large quantity of email, our script need to be able to crawl entire targeted pages and grab as many information as possible. This will be discussed next.

Last modified on Thursday, 03 November 2016 06:31
Rate this item
(0 votes)
back to top