Log in Register

Log in

Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)

store extracted email to database

In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.

You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).

First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.

You can download the source file for PHP cURL Email Extractor from here.

 

 

Create database for email extractor

Go to "email_collection" database and you will find that it is still empty. Select "SQL", cut and paste query below into blank space and click "Go". 

CREATE TABLE IF NOT EXISTS `contact_info` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) NOT NULL,
  `phone` varchar(128) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

 mysql query to create new database

A table "contact_info" is created under database "email_collection". 

empty new database Click "contact_info" and there are four data columns. Here we create a column "id" that will automatically increment each time a new entry is added to the table. "name", "email" and "phone" will store data collected. The data column is defined as "UNIQUE KEY" so that no the same emails will store more than once in the database. For example, you have collected and stored an email into database, the same email might be found when you run the script on another website. In this case, we just keep the first and ignore the second one.

tables in the database

So now here is the action: I modified the httpcurl.php file.How it works:

<?php

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);
 
interface MySQLTable	{
	public function addData($info);	
}

class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';

	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')';
		return $this->query($sql);
	}

	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}


interface HttpScraper
{
    public function parse($body, $head);
}
  
class Scraper implements HttpScraper
{
	private $_table;

    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	public function getTable()  {
		return $this->_table;
	}
	
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
}
  
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
    public function getStatus() {
        return $this->_info[http_code];
    }
      
    public function getHeader() {
        return $this->_info;
    }
  
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

 A new class "EmailDatabase" has been added to our code. This class implements MySQLTable which has one function addData() in the interface. 

The constructor of EmailDatase class will connect to MySQL through username and password given. Under XAMPP environment, host will be localhost, username is root and no password needed.

	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}

 

The addData() will write data array $info that has name, email and phone collected, into table "contact_info". 

	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\')';
		return $this->query($sql);
	}

	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}

 The Scraper class also changed to handle the new database.

	private $_table;

    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }

 The constructor of Scraper class will set the database passed during instantiation. 

Slight changes to function parse() where we insert addData() function to write to database.

$this->_table->addData($agent);

 For our test.php:

<?php
include 'httpcurl.php';
   
$target = "http://<domain name>/negotiators?page=";
$startPage = $target . "1";

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);
}

$db = new EmailDatabase;
$scrapeContent = new Scraper($db);
$pages = new HttpCurl($scrapeContent);

for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}

?>

 Only minor changes with $db object created and passded to $scrapeContent.

$db = new EmailDatabase;
$scrapeContent = new Scraper($db);
$pages = new HttpCurl($scrapeContent);

 When we run the program, all data will be stored into database.

Extracted email stored into databaseIn fact, there are 1600++ data stored in the database!

What that, most likely you just need to change this portion to extract different websites.

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('PARSE_CONTENT', TRUE);

 That is all for PHP email extractor. Remember, don't spam your email recipients!

 So far we are scrapping on text. I will use the same code to demonstrate how to download and store images from websites in next article.

 

 

Last modified on Thursday, 03 November 2016 06:29
Rate this item
(0 votes)

2 comments

  • ali

    when i try this script i have yhis two error
    Notice: Use of undefined constant http_code - assumed 'http_code' in C:\xampp\htdocs\ff\httpcurl.php on line 159

    Notice: Undefined variable: lastPage in C:\xampp\htdocs\ff\test.php on line 20

    can help me

    posted by ali Monday, 18 July 2016 05:24 Comment Link
  • Chin-Hock Tan

    Hello Ali,
    This is not really a problem.
    Insert this code at the second line at test.php :
    error_reporting(E_ERROR | E_PARSE);

    The notification will go away.

    posted by Chin-Hock Tan Monday, 18 July 2016 11:01 Comment Link

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.



Anti-spam: complete the task
back to top