Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)

store extracted email to database

In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.

You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).

First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.

You can download the source file for PHP cURL Email Extractor from here.

Note: Check out the sample code at bottom of this article.

 

Create database for email extractor

Go to "email_collection" database and you will find that it is still empty. Select "SQL", cut and paste query below into blank space and click "Go". 

CREATE TABLE IF NOT EXISTS `contact_info` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) NOT NULL,
  `phone` varchar(128) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

 mysql query to create new database

A table "contact_info" is created under database "email_collection". 

empty new database Click "contact_info" and there are four data columns. Here we create a column "id" that will automatically increment each time a new entry is added to the table. "name", "email" and "phone" will store data collected. The data column is defined as "UNIQUE KEY" so that no the same emails will store more than once in the database. For example, you have collected and stored an email into database, the same email might be found when you run the script on another website. In this case, we just keep the first and ignore the second one.

tables in the database

So now here is the action: I modified the httpcurl.php file.How it works:

<?php

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);
 
interface MySQLTable	{
	public function addData($info);	
}

class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';

	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')';
		return $this->query($sql);
	}

	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}


interface HttpScraper
{
    public function parse($body, $head);
}
  
class Scraper implements HttpScraper
{
	private $_table;

    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	public function getTable()  {
		return $this->_table;
	}
	
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
}
  
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
    public function getStatus() {
        return $this->_info[http_code];
    }
      
    public function getHeader() {
        return $this->_info;
    }
  
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

 A new class "EmailDatabase" has been added to our code. This class implements MySQLTable which has one function addData() in the interface. 

The constructor of EmailDatase class will connect to MySQL through username and password given. Under XAMPP environment, host will be localhost, username is root and no password needed.

	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}

 

The addData() will write data array $info that has name, email and phone collected, into table "contact_info". 

	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')';
		return $this->query($sql);
	}

	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}

 The Scraper class also changed to handle the new database.

	private $_table;

    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }

 The constructor of Scraper class will set the database passed during instantiation. 

Slight changes to function parse() where we insert addData() function to write to database.

$this->_table->addData($agent);

 For our test.php:

<?php
include 'httpcurl.php';
   
$target = "http://<domain name>/negotiators?page=";
$startPage = $target . "1";

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);
}

$db = new EmailDatabase;
$scrapeContent = new Scraper($db);
$pages = new HttpCurl($scrapeContent);

for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}

?>

 Only minor changes with $db object created and passded to $scrapeContent.

$db = new EmailDatabase;
$scrapeContent = new Scraper($db);
$pages = new HttpCurl($scrapeContent);

 When we run the program, all data will be stored into database.

Extracted email stored into databaseIn fact, there are 1600++ data stored in the database!

What that, most likely you just need to change this portion to extract different websites.

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);

 That is all for PHP email extractor. Remember, don't spam your email recipients!

 So far we are scrapping on text. I will use the same code to demonstrate how to download and store images from websites in next article.

 

 

Code:

1. httpcurl.php

<?php

/********************************************************
* These are website specific matching pattern           *
* Change these matching patterns for each websites      *
* Else you will not get any results                     *
********************************************************/
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);
 
// Interface MySQLTable
interface MySQLTable	{
	public function addData($info);	
}

// Class EmailDatabase
// Use the code below to crease table
/*****************************************************
  CREATE TABLE IF NOT EXISTS `contact_info` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) NOT NULL,
  `phone` varchar(128) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
*******************************************************/
class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';     // set default table

	// Connect to database
	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	// Use this function to change to another table	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	// Write data to table
	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')';
		return $this->query($sql);
	}

	// Execute MySQL query here
	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}


// Interface HttpScraper
interface HttpScraper
{
    public function parse($body, $head);
}
  
 // Class Scraper
class Scraper implements HttpScraper	{
	private $_table;	

	// Store MySQL table if want to write to database.
    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	// Delete table info at descructor
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

	// Set table info to private variable $_table
    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	// Get table info
	public function getTable()  {
		return $this->_table;
	}
	
	// Parse function
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
	// Return matched info
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
}
  
 // Class HttpCurl
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
	// Check curl activated
	// Set Parser as well
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
	// Get web page and run parser
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
	// Run cURL to get web page source file
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
	// Get http_code
    public function getStatus() {
        return $this->_info[http_code];
    }
      
	// Get web page header information
    public function getHeader() {
        return $this->_info;
    }
  
	// Get web page content
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
	// set parser, either object or callback function
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
	// Execute parser
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

 

2. test.php

<?php
include 'httpcurl.php';	// include lib file
   
$target = "http://<website domain>/negotiators?page=";	// Set our target's url, remember not to include nu,ber in pagination
$startPage = $target . "1";	// Set first page

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);   // get first page content

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);	// get total page info from first page
}

$db = new EmailDatabase;	// can be excluded if do not want to write to database
$scrapeContent = new Scraper($db);	// // can be excluded as well
$pages = new HttpCurl($scrapeContent);

// Looping from first page to last and parse each and every pages to database
for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}

?>

 

Last modified on Tuesday, 29 December 2020 07:22
Rate this item
(0 votes)
back to top