Download and Save Images with PHP/cURL Web Scraper Script

extract image5

In this article, I will discuss how to download and save image files with PHP/cURL web scraper. I will use email extractor script created earlier as example. With some modification, the same script can then be used to extract product information and images from Internet shopping websites such as ebay.com or amazon.com to your desired database. We can also extract business information from directory websites, both text information and images into your website as well.

There are concerns and considerations before we scrape image file from websites.

1) There could be various file formats (jpeg, png, gif etc) used in a website. Even a single web page could have various file formats. 

If we want to build common database for all collected images (from various websites), then our PHP web scraper script needs to be able to convert to the file format we prefer.

2) Each images could have different file size.

Some images can be very large and some very small. Our PHP web scraping script needs to be able to resize large file to a smaller size. Resize large file to small is not a problem. Small size to large will give us poor quality image.

3) We need a naming convention for image file.

Different websites named image files differently. Some have long name, some short. Before store image files into our folder, we need to rename these files with our naming convention.

4) We need to add one column in MySQL database, to link the images to the related information.

So here we go...

Read more...

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

Read more...

Books on Screen Scraping with PHP

There are a few books that worth reading if you are serious to learn how to write screen scrapers or webbots using PHP/cURL. Of course you can also find lots of information from internet, such as Stack Overflow, GitHub etc...

Currently I have few books on screen scraping and there are three that using PHP/cURL programming. I highly recommend these three books to those who want to learn screen scraping using PHP/cURL.

Webbots, Spiders and Screen Scrapers - Written by Michael Schrenk

Read more...
Subscribe to this RSS feed