Log in Register

Log in

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.


PHP Email Extractor Script with cURL and Regular Expression (2)

targeted email extractor

In this post, we need to make a small modification to previous PHP script for targeted email extraction.

First, we need to revisit the source file of the targeted web page, we can see that there are repeated blocks of agent contacts, with name, email and phone number. Total of 10 blocks per page. 

The plan is to use the script to "cut out" each block and stores into array, then extract name, email and phone number from each block.

As you can see, each block starts with <div class="negotiators-wrapper"> tag and ends with </div></div>. Note that both </div> tags are separated by carriage return and new line feed in this example. 


Email Extractor Script with PHP cURL and Regular Expression (1)

email spider

In this post, I will explain how to use PHP/cURL to extract / harvest email addresses from websites. The script will involve regular expression to match HTML tag for extraction.

If we send out email and address the person as "Dear Sir" or "Dear Madam", most likely the email will end up as spam. We do not want to just extract email addresses only, but also other information related to the email addresses, such as name, telephone, company, job position etc. When we send out email from the list collected, we want to be able to address the contact person as detail as possible, such as with his/her name, job position in the company, contact number etc.

Of course, please do not abuse the ability of email extraction and send out unwanted spam mails, products/services advertising, violate copyright law or disturbing network bandwidth etc. If you get into trouble, talk to your lawyer please.

Subscribe to this RSS feed