Email Extractor Script with PHP cURL and Regular Expression (1)

email spider

In this post, I will explain how to use PHP/cURL to extract / harvest email addresses from websites. The script will involve regular expression to match HTML tag for extraction.

If we send out email and address the person as "Dear Sir" or "Dear Madam", most likely the email will end up as spam. We do not want to just extract email addresses only, but also other information related to the email addresses, such as name, telephone, company, job position etc. When we send out email from the list collected, we want to be able to address the contact person as detail as possible, such as with his/her name, job position in the company, contact number etc.

Of course, please do not abuse the ability of email extraction and send out unwanted spam mails, products/services advertising, violate copyright law or disturbing network bandwidth etc. If you get into trouble, talk to your lawyer please.

First, we look at a very simple email extractor. We are going to use the HttpCurl class created earlier.

<?php

define('EMAIL_PATTERN', '/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i');

interface HttpScraper
{
	public function parse($body, $head);
}

class Scraper implements HttpScraper
{
	
	public function parse($body, $head)	{
	   if ($head == 200) {
		$p = preg_match_all(EMAIL_PATTERN, $body, $matches);
			if ($p) {
				foreach($matches[0] as $emails) {
					echo "<pre>";
					print_r($emails);	
					echo "<pre>";
				}
			}
		}
	}
}

class HttpCurl {
	protected $_cookie, $_parser, $_timeout;
	private $_ch, $_info, $_body, $_error;
	
	public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        }	
		$this->setParser($p);
	}

	public function get($url) {	
		return $this->request($url);
	}

	protected function request($url) {
        $ch = curl_init($url);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
		curl_setopt($ch, CURLOPT_MAXREDIRS, 5);		
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
		curl_setopt($ch, CURLOPT_URL, $url);
		$this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);

		$this->runParser($this->_body, $this->getStatus());				
    }

	public function getStatus() {
		return $this->_info[http_code];
	}
	
	public function getHeader() {
		return $this->_info;
	}

	public function getBody() {
		return $this->_body;
	}
	
	public function __destruct() {
	}	
	
	public function setParser($p)	{
		if ($p === null || $p instanceof HttpScraper || is_callable($p))	
			$this->_parser = $p;
	}

	public function runParser($content, $header)	{
		if ($this->_parser !== null)
		{
			if ($this->_parser instanceof HttpScraper)
				$this->_parser->parse($content, $header);	
			else
				call_user_func($this->_parser, $content, $header);
		}
	}	
}

?>

 

How it works:

1. First define a regular expression to match email pattern. The pattern will be used in this example only. We will change to working patterns in next article.

define('EMAIL_PATTERN', '/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i');

 2. Then define Interface called HttpScraper with a public method parse().

interface HttpScraper
{
public function parse($body, $head);
}

3.  Next we implement the above interface in a class called Scraper. Two information will be passed to function parse(), one is the content of the webpage ($body) and another is http_code of that web page ($head). If the value to $head is 200, we proceed to match any email pattern (the defined EMAIL_PATTERN) in the web page content using function preg_match_all. For this tutorial, we print the results on the screen.

class Scraper implements HttpScraper
{
	
	public function parse($body, $head)	{
	   if ($head == 200) {
		$p = preg_match_all(EMAIL_PATTERN, $body, $matches);
			if ($p) {
				foreach($matches[0] as $emails) {
					echo "<pre>";
					print_r($emails);	
					echo "<pre>";
				}
			}
		}
	}
}

 

 4. Here you can see some minor changes to the HttpCurl class that we created earlier.

class HttpCurl {
	protected $_cookie, $_parser, $_timeout;
	private $_ch, $_info, $_body, $_error;
	
	public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        }	
		$this->setParser($p);
	}

	public function get($url) {	
		return $this->request($url);
	}

	protected function request($url) {
        $ch = curl_init($url);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
		curl_setopt($ch, CURLOPT_MAXREDIRS, 5);		
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
		curl_setopt($ch, CURLOPT_URL, $url);
		$this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);

		$this->runParser($this->_body, $this->getStatus());				
    }

	public function getStatus() {
		return $this->_info[http_code];
	}
	
	public function getHeader() {
		return $this->_info;
	}

	public function getBody() {
		return $this->_body;
	}
	
	public function __destruct() {
	}	
	
	public function setParser($p)	{
		if ($p === null || $p instanceof HttpScraper || is_callable($p))	
			$this->_parser = $p;
	}

	public function runParser($content, $header)	{
		if ($this->_parser !== null)
		{
			if ($this->_parser instanceof HttpScraper)
				$this->_parser->parse($content, $header);	
			else
				call_user_func($this->_parser, $content, $header);
		}
	}	
}

 

5. The constructor of HttpCurl - we added a function call to setParser(). setParser() stores the object or call back function passed to $p into private variable $_parser.

	public function setParser($p)	{
		if ($p === null || $p instanceof HttpScraper || is_callable($p))	
			$this->_parser = $p;
	}

 

 6. Function runParser() is added into HttpCurl. When called, it will execute the call back function or object in $_parser. In this case, I am using object.

	public function runParser($content, $header)	{
		if ($this->_parser !== null)
		{
			if ($this->_parser instanceof HttpScraper)
				$this->_parser->parse($content, $header);	
			else
				call_user_func($this->_parser, $content, $header);
		}
	}

 

 7. The runParser() function is called in request() after cURL requested web page source file.

	protected function request($url) {
        $ch = curl_init($url);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
		curl_setopt($ch, CURLOPT_MAXREDIRS, 5);		
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
		curl_setopt($ch, CURLOPT_URL, $url);
		$this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);

		$this->runParser($this->_body, $this->getStatus());				
    }

 

 Now that we have our code ready, we can test out the script.

Assuming you have completed development of a property listing website and released to live server. You need property agents to register as members and upload their properties to your website. Getting property agents to find your website through search engine will take forever. We can go to other property websites and find their contacts. For example, this website has more than one thousand agents.

extract email of property agentsIf you right click on browser and select "View Source", you can see the web page source file.

view source of property agents page

To test out our script, here is our new test.php.

<?php
include 'httpcurl.php';
  
$target = "http://<domain name>";
 
$up = new Scraper;
$test = new HttpCurl($up);
 
$test->get($target);
 
?>

Here is the output of test.php.

output of email extraction script

So now we are able to extract email addresses from a web page. However, we should not send out email without ability to address the receivers. Starts your email with "Dear Sir", "Hello Madam" will made our email looks like spam.

In the next article, we will modify the script a bit to grab additional information, such as name, and telephone from the example. Click here to go to next article.

Last modified on Thursday, 03 November 2016 06:34
Rate this item
(3 votes)
back to top