Download and Save Images with PHP/cURL Web Scraper Script

extract image5

In this article, I will discuss how to download and save image files with PHP/cURL web scraper. I will use email extractor script created earlier as example. With some modification, the same script can then be used to extract product information and images from Internet shopping websites such as ebay.com or amazon.com to your desired database. We can also extract business information from directory websites, both text information and images into your website as well.

There are concerns and considerations before we scrape image file from websites.

1) There could be various file formats (jpeg, png, gif etc) used in a website. Even a single web page could have various file formats. 

If we want to build common database for all collected images (from various websites), then our PHP web scraper script needs to be able to convert to the file format we prefer.

2) Each images could have different file size.

Some images can be very large and some very small. Our PHP web scraping script needs to be able to resize large file to a smaller size. Resize large file to small is not a problem. Small size to large will give us poor quality image.

3) We need a naming convention for image file.

Different websites named image files differently. Some have long name, some short. Before store image files into our folder, we need to rename these files with our naming convention.

4) We need to add one column in MySQL database, to link the images to the related information.

So here we go...

Note: Check out the sample code at bottom of this article.

First, we look at delimiter to match pattern for image file.

delimiter to match pattern for image file

So I added IMAGE pattern into scraper.php file.

define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~');
define('PARSE_CONTENT', TRUE);
define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\');

 I also added IMAGE_DIR, which is the directory I want to store image files. 

A new class Image is added to our script, inside the file image.php, to process the images. The original class was written by Simon Jarvis in 2006, you can find the code at http://www.white-hat-web-design.co.uk/blog/resizing-images-with-php. I modified the code to fit into our script.

<?php

class Image {   
private $_image; 
private $_imageFormat;   

public function load($imageFile) {   
	$imageInfo = getImageSize($imageFile); 
	$this->_imageFormat = $imageInfo[2]; 
	if( $this->_imageFormat === IMAGETYPE_JPEG ) {   
		$this->_image = imagecreatefromjpeg($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_GIF ) {  
		$this->_image = imagecreatefromgif($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_PNG ) {  
		$this->_image = imagecreatefrompng($imageFile); 
	} 
} 

public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) {   
	if( $_imageFormat == IMAGETYPE_JPEG ) { 
		imagejpeg($this->_image,$imageFile,$compression); 
	} elseif ( $_imageFormat == IMAGETYPE_GIF ) {   
		imagegif($this->_image,$imageFile); 
	} elseif ( $_imageFormat == IMAGETYPE_PNG ) {   
		imagepng($this->_image,$imageFile); 
	} 
	if( $permissions != null) {   
		chmod($imageFile,$permissions); 
	} 
} 
	

public function getWidth() {   
	return imagesx($this->_image); 
} 

public function getHeight() {   
	return imagesy($this->_image); 
} 

public function resizeToHeight($height) {   
	$ratio = $height / $this->getHeight(); 
	$width = $this->getWidth() * $ratio; 
	$this->resize($width,$height); 
}   

public function resizeToWidth($width) { 
	$ratio = $width / $this->getWidth(); 
	$height = $this->getheight() * $ratio; 
	$this->resize($width,$height); 
}   

public function scale($scale) { 
	$width = $this->getWidth() * $scale/100; 
	$height = $this->getheight() * $scale/100; 
	$this->resize($width,$height); 
}   

private function resize($width, $height) { 
	$newImage = imagecreatetruecolor($width, $height); 
	imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); 
	$this->_image = $newImage; 
}   

}

?>

 

We can load image file with load() function, get the image width and height with getWidth() and getHeight() functions. Before saving the image file with save() function, we can resize the width and height of image with resizeToWidth(), resizeToHeight() or scale() functions. These will address concern #2.

The save() function is able to convert to image file format we want, addressing concern #1. You can play around the function on your own.

To address concern #4, we add column "image" to table "contact_info" in MySQL.

add column We then add value "$info[image]" to addData() function in class EmailDatabase.

class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';     // set default table

	// Connect to database
	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	// Use this function to change to another table	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	// Write data to table
	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')';
		return $this->query($sql);
	}

	// Execute MySQL query here
	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}

 A new saveImage() function is added to class Scraper.

class Scraper implements HttpScraper	{
	private $_table;	

	// Store MySQL table if want to write to database.
    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	// Delete table info at descructor
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

	// Set table info to private variable $_table
    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	// Get table info
	public function getTable()  {
		return $this->_table;
	}
	
	// Parse function
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
	// Return matched info
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
	

	public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') {		
		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}
		
		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 
	
		
		
		$newImageName = md5($imageUrl). '.' . $fileExt;
			
		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;
	}	
	
	
}

 saveImage() function first check whether image directory exist, if not, create that directory.

		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}

 

The default file format for this tutorial is GIF. (you can change to your preferred default format.) You can change all image format by specifying it in $imageType.

		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 

 

We then rename the file with new extention.

$newImageName = md5($imageUrl). '.' . $fileExt;

 Here, I am using PHP function md5() to hash image url and append file extension to create new file name. You can actually change to <time stamp>.<file extension using time() function, or even <website source>_<name>.<file extension>, whichever applicable to you.

We then load the image to memory with load() function. In this example, the original file size is 130x130. I resized it to 100x100 by calling resizeToWidth(100). Then save the image into desired directory. The fucntion return new file name of the image file.

		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;

 The returned file name will be to store in MySQL through parse() function.

    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
					$this->_table->addData($agent);
               }
            }
        }
    }

 

After you run the script, you can see MySQL stores image file name for each real estate agent.

MySQL stores image file nameThen go to image directory, you will find all images downloaded at the size of 100x100 under GIF format!!

image directory

 

Code: 

1. httpcurl.php

<?php
 
 // Class HttpCurl
class HttpCurl {
    protected $_cookie, $_parser, $_timeout;
    private $_ch, $_info, $_body, $_error;
      
	// Check curl activated
	// Set Parser as well
    public function __construct($p = null) {
        if (!function_exists('curl_init')) {
            throw new Exception('cURL not enabled!');
        } 
        $this->setParser($p);
    }
  
	// Get web page and run parser
    public function get($url, $status = FALSE) {
		$this->request($url);	
		if ($status === TRUE) {
			return $this->runParser($this->_body, $this->getStatus()); 
		}		
    }
  
	// Run cURL to get web page source file
    protected function request($url) {
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 5);   
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $this->_body = curl_exec($ch);
        $this->_info  = curl_getinfo($ch);
        $this->_error = curl_error($ch);
        curl_close($ch);      
    }
  
	// Get http_code
    public function getStatus() {
        return $this->_info[http_code];
    }
      
	// Get web page header information
    public function getHeader() {
        return $this->_info;
    }
  
	// Get web page content
    public function getBody() {
        return $this->_body;
    }
      
    public function __destruct() {
    } 
      
	// set parser, either object or callback function
    public function setParser($p)   {
        if ($p === null || $p instanceof HttpScraper || is_callable($p))  
            $this->_parser = $p;
    }
  
	// Execute parser
    public function runParser($content, $header)    {
        if ($this->_parser !== null)
        {
            if ($this->_parser instanceof HttpScraper)
                $this->_parser->parse($content, $header);
            else
                call_user_func($this->_parser, $content, $header);
        }
    } 
}
  
?>

 

2. image.php

<?php

class Image {   
private $_image; 
private $_imageFormat;   

public function load($imageFile) {   
	$imageInfo = getImageSize($imageFile); 
	$this->_imageFormat = $imageInfo[2]; 
	if( $this->_imageFormat === IMAGETYPE_JPEG ) {   
		$this->_image = imagecreatefromjpeg($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_GIF ) {  
		$this->_image = imagecreatefromgif($imageFile); 
	} elseif( $this->_imageFormat === IMAGETYPE_PNG ) {  
		$this->_image = imagecreatefrompng($imageFile); 
	} 
} 

public function save($imageFile, $_imageFormat=IMAGETYPE_JPEG, $compression=75, $permissions=null) {   
	if( $_imageFormat == IMAGETYPE_JPEG ) { 
		imagejpeg($this->_image,$imageFile,$compression); 
	} elseif ( $_imageFormat == IMAGETYPE_GIF ) {   
		imagegif($this->_image,$imageFile); 
	} elseif ( $_imageFormat == IMAGETYPE_PNG ) {   
		imagepng($this->_image,$imageFile); 
	} 
	if( $permissions != null) {   
		chmod($imageFile,$permissions); 
	} 
} 
	

public function getWidth() {   
	return imagesx($this->_image); 
} 

public function getHeight() {   
	return imagesy($this->_image); 
} 

public function resizeToHeight($height) {   
	$ratio = $height / $this->getHeight(); 
	$width = $this->getWidth() * $ratio; 
	$this->resize($width,$height); 
}   

public function resizeToWidth($width) { 
	$ratio = $width / $this->getWidth(); 
	$height = $this->getheight() * $ratio; 
	$this->resize($width,$height); 
}   

public function scale($scale) { 
	$width = $this->getWidth() * $scale/100; 
	$height = $this->getheight() * $scale/100; 
	$this->resize($width,$height); 
}   

private function resize($width, $height) { 
	$newImage = imagecreatetruecolor($width, $height); 
	imagecopyresampled($newImage, $this->_image, 0, 0, 0, 0, $width, $height, $this->getWidth(), $this->getHeight()); 
	$this->_image = $newImage; 
}   

}

?>

 

3. scraper.php

<?php

/********************************************************
* These are website specific matching pattern           *
* Change these matching patterns for each websites      *
* Else you will not get any results                     *
********************************************************/
define('TARGET_BLOCK','~<div class="negotiators-wrapper">(.*?)</div>(\r\n)</div>~s');
define('NAME', '~<div class="negotiators-name"><a href="/negotiator/(.*?)">(.*?)</a></div>~');
define('EMAIL', '~<div class="negotiators-email">(.*?)</div>~');
define('PHONE', '~<div class="negotiators-phone">(.*?)</div>~');
define('LASTPAGE', '~<li class="pager-last last"><a href="/negotiators\?page=(.*?)"~');
define('IMAGE', '~<div class="negotiators-photo"><a href="/negotiator/(.*?)"><img src="/(.*?)"~');
define('PARSE_CONTENT', TRUE);
define('IMAGE_DIR', 'c:\\xampp\\htdocs\\scraper\\image\\');
 
// Interface MySQLTable
interface MySQLTable	{
	public function addData($info);	
}

// Class EmailDatabase
// Use the code below to crease table
/*****************************************************
  CREATE TABLE IF NOT EXISTS `contact_info` (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) NOT NULL,
  `phone` varchar(128) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
*******************************************************/
class EmailDatabase extends mysqli implements MySQLTable	{
	private $_table = 'contact_info';     // set default table

	// Connect to database
	public function __construct() 	{
		$host = 'localhost';
		$user = 'root';
		$pass = '';
		$dbname = 'email_collection';
		parent::__construct($host, $user, $pass, $dbname);
	}
	
	// Use this function to change to another table	
	public function setTableName($name)  {
		$this->_table = $name;
	}

	// Write data to table
	public function addData($info)	{
		$sql = 'INSERT IGNORE INTO ' . $this->_table . ' (name, email, phone, image) ';
		$sql .= 'VALUES (\'' . $info[name] . '\', \'' . $info[email] . '\', \'' . $info[phone]. '\', \'' . $info[image] .'\')';
		return $this->query($sql);
	}

	// Execute MySQL query here
	public function query($query, $mode = MYSQLI_STORE_RESULT)	{
		$this->ping();
		$res = parent::query($query, $mode);
		return $res;
	}
}


// Interface HttpScraper
interface HttpScraper
{
    public function parse($body, $head);
}
  
 // Class Scraper
class Scraper implements HttpScraper	{
	private $_table;	

	// Store MySQL table if want to write to database.
    public function __construct($t = null) {
        $this->setTable($t);
    }	 
	
	// Delete table info at descructor
	public function __destruct()	{
		if ($this->_table !== null) {
			$this->_table = null;
		}
	}

	// Set table info to private variable $_table
    public function setTable($t)   {
        if ($t === null || $t instanceof MySQLTable)  
            $this->_table = $t;
    }
	
	// Get table info
	public function getTable()  {
		return $this->_table;
	}
	
	// Parse function
    public function parse($body, $head) {
       if ($head == 200) {    
        $p = preg_match_all(TARGET_BLOCK, $body, $blocks);         
            if ($p) {
                foreach($blocks[0] as $block) {
                    $agent[name] = $this->matchPattern(NAME, $block, 2);
                    $agent[email] = $this->matchPattern(EMAIL, $block, 1);
                    $agent[phone] = $this->matchPattern(PHONE, $block, 1);
                    $originalImagePath = $this->matchPattern(IMAGE, $block, 2);		
					$agent[image] = $this->saveImage($originalImagePath, IMAGETYPE_GIF);	
//                    echo "<pre>"; print_r($agent); echo "</pre>";
					$this->_table->addData($agent);
               }
            }
        }
    }
     
	// Return matched info
    public function matchPattern($pattern, $content, $pos) {
        if (preg_match($pattern, $content, $match)) {
            return $match[$pos];
        }  
    }
	

	public function saveImage($imageUrl, $imageType = 'IMAGETYPE_GIF') {		
		if (!file_exists(IMAGE_DIR)) {
			mkdir(IMAGE_DIR, 0777, true);
		}
		
		if( $imageType === IMAGETYPE_JPEG ) { 
			$fileExt = 'jpg';
		} elseif ( $imageType === IMAGETYPE_GIF ) {   
			$fileExt = 'gif';
		} elseif ( $imageType === IMAGETYPE_PNG ) {   
			$fileExt = 'png';
		} 
	
		$path_parts = pathinfo($imageUrl);
		
		$newImageName = md5($imageUrl). '.' . $fileExt;
			
		$image = new Image(); 
		$image->load($imageUrl); 
		$image->resizeToWidth(100); 
		$image->save( IMAGE_DIR . $newImageName,  $imageType );  		
		return $newImageName;
	}	
	
	
}
 

?>

 

4. extract.php

<?php
include 'image.php';
include 'scraper.php';
include 'httpcurl.php';	// include lib file
   
$target = "http://<domain name>/negotiators?page=";	// Set our target's url, remember not to include nu,ber in pagination
$startPage = $target . "1";	// Set first page

$scrapeContent = new Scraper;
$firstPage = new HttpCurl();
$firstPage->get($startPage);   // get first page content

if ($firstPage->getStatus() === 200) {
	$lastPage = $scrapeContent->matchPattern(LASTPAGE, $firstPage->getBody(), 1);	// get total page info from first page
}

$db = new EmailDatabase();	// can be excluded if do not want to write to database
$scrapeContent = new Scraper($db);	// // can be excluded as well
$pages = new HttpCurl($scrapeContent);

// Looping from first page to last and parse each and every pages to database
for($i=1; $i <= $lastPage; $i++) { 
	$targetPage = $target . $i;
	$pages->get($targetPage, PARSE_CONTENT);
}

?>

 

Last modified on Tuesday, 29 December 2020 07:18
Rate this item
(3 votes)
back to top