PHP Web Scraper

PHP Web Scraper (13)

How to avoid Selenium webdriver from being detected as bot or web spider

selenium as bot

 

Before we start to use php-webdrive and Selenium for web scraping and social media auto posting, we need to do some settings in code or file modifications to avoid our script from being detected as web bot or spider. I have listed some ways to hide our automation using Selenium. The methods can be used for any programming languages as well. Please note that this is not a complete list and from time to time web servers companies can find new methods to detect and block our Selenium automation. Anyway, we just have to factor in all known methods in our scripts to reduce chances of detection.

1. Remove browser control flag

2. Remove signature in javascript

3. Set User-Agent

4. Avoid using headless browser

5. Use maximum resolution

6. Follow page flow

7. Use proxy or VPN

8. Insert random delay

9. Use cookies to login

Read more...

How to install php-webdriver + Selenium for screen scrapping and auto-post

phpwebdriver

Today I look at the content of php8legs.com and realize that I have not writing in website for more than 4 years already. It was a busy four years. As Malaysia is implementing MCO (Movement Control Order) due to wide spread of Covid-19 virus, I have the chance to take sometime to discuss the topics of my interest - web scraping and auto posting.

In the articles, I want to discuss about more advanced scraping techniques such as scraping website with infinite scroll, as well as using webdriver to auto login social media websites and perform auto posting. All this can be done using Selenium. There are already so many articles on Selenium + webdrivers in Python/Java/Ruby etc. So I want to write this topic using PHP. To run Selenium with PHP under Windows 10 environment, assuming you already have XAMPP installed (with PHP 7 or above), here are the software packages required: 

1. Java - installation

2. Composer - installation

3. php-webdriver from github.com  - installation

4. Selenium Standalone Server - download

5. Chromedriver - download

If you already have Java and composer installed earlier, then just perform installation at step 3 and download packages at step 4 and 5.

Read more...

The Art of Invisibility for Ninja Web Spider

Art of Invisibility

Most webmasters welcome webbot from search engines such as Google and Bing. Contents of their websites will be indexed by search engines and users can easily find thier websites. However, they surely not welcome your web spider to extract data from their sites, may be you're up to no good, such as how products from etsy.com were used to drive traffic to a Hong Kong website. Most likely your IP will be blocked if webmasters detect unknown web spider aggressively crawling their websites. Way back to 2001, eBay filed legal action against Bidder's Edge, an auction scraping site, for “deep linking” into its listings and bombarding its service; Craigslist has throttle mechanism to prevent web crawlers from overwhelming the site with requests

Even big spiders like Google has mechanism to prevent others from scraping their content. Try to search a keyword and at the results page, click page 1, then page 2, page 3,...At page 20 (in my case), Google will stop displaying search results and want to find out are you a human or webbot that reading the page. If you unable enter correct captcha, then your IP will be blocked eventually.

Read more...

Download and Save Images with PHP/cURL Web Scraper Script

extract image5

In this article, I will discuss how to download and save image files with PHP/cURL web scraper. I will use email extractor script created earlier as example. With some modification, the same script can then be used to extract product information and images from Internet shopping websites such as ebay.com or amazon.com to your desired database. We can also extract business information from directory websites, both text information and images into your website as well.

There are concerns and considerations before we scrape image file from websites.

1) There could be various file formats (jpeg, png, gif etc) used in a website. Even a single web page could have various file formats. 

If we want to build common database for all collected images (from various websites), then our PHP web scraper script needs to be able to convert to the file format we prefer.

2) Each images could have different file size.

Some images can be very large and some very small. Our PHP web scraping script needs to be able to resize large file to a smaller size. Resize large file to small is not a problem. Small size to large will give us poor quality image.

3) We need a naming convention for image file.

Different websites named image files differently. Some have long name, some short. Before store image files into our folder, we need to rename these files with our naming convention.

4) We need to add one column in MySQL database, to link the images to the related information.

So here we go...

Read more...

Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)

store extracted email to database

In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.

You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).

First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.

You can download the source file for PHP cURL Email Extractor from here.

Read more...

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

Read more...

PHP Email Extractor Script with cURL and Regular Expression (2)

targeted email extractor

In this post, we need to make a small modification to previous PHP script for targeted email extraction.

First, we need to revisit the source file of the targeted web page, we can see that there are repeated blocks of agent contacts, with name, email and phone number. Total of 10 blocks per page. 

The plan is to use the script to "cut out" each block and stores into array, then extract name, email and phone number from each block.

As you can see, each block starts with <div class="negotiators-wrapper"> tag and ends with </div></div>. Note that both </div> tags are separated by carriage return and new line feed in this example. 

Read more...

Email Extractor Script with PHP cURL and Regular Expression (1)

email spider

In this post, I will explain how to use PHP/cURL to extract / harvest email addresses from websites. The script will involve regular expression to match HTML tag for extraction.

If we send out email and address the person as "Dear Sir" or "Dear Madam", most likely the email will end up as spam. We do not want to just extract email addresses only, but also other information related to the email addresses, such as name, telephone, company, job position etc. When we send out email from the list collected, we want to be able to address the contact person as detail as possible, such as with his/her name, job position in the company, contact number etc.

Of course, please do not abuse the ability of email extraction and send out unwanted spam mails, products/services advertising, violate copyright law or disturbing network bandwidth etc. If you get into trouble, talk to your lawyer please.

Read more...
Subscribe to this RSS feed