PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

Read more...

Email Extractor Script with PHP cURL and Regular Expression (1)

email spider

In this post, I will explain how to use PHP/cURL to extract / harvest email addresses from websites. The script will involve regular expression to match HTML tag for extraction.

If we send out email and address the person as "Dear Sir" or "Dear Madam", most likely the email will end up as spam. We do not want to just extract email addresses only, but also other information related to the email addresses, such as name, telephone, company, job position etc. When we send out email from the list collected, we want to be able to address the contact person as detail as possible, such as with his/her name, job position in the company, contact number etc.

Of course, please do not abuse the ability of email extraction and send out unwanted spam mails, products/services advertising, violate copyright law or disturbing network bandwidth etc. If you get into trouble, talk to your lawyer please.

Read more...
Subscribe to this RSS feed