Chin-Hock Tan

Chin-Hock Tan

I am a full time internet retailer, selling physical products through my own websites and various internet marketplaces. I write PHP web bots and screen scraper scripts during my free time for email marketing to increase web traffic, scraping products from one website to another to minimize manual entries, aggregate content for new websites etc.

I am available for hire as freelance PHP coder on web bots, screen scraper and data mining. I quote fixed price for your project if the detail of requirements are clearly outlined. 

I also help customers to build and host Joomla based business/content/blogging website, shopping cart with Virtuemart, Presta Shop, Open Cart, EC Shop, EC Mall etc. First year hosting is free.

I accept payment via Paypal. If you would like to contact me, please write to freeman [a] php8legs.com. TQ.

 

我是一名全职的网上零售商,通过自己的网站和不同的交易平台售卖实物商品。我在空闲时间写PHP机器人和网络资料提取脚本,并用于电子邮件营销以增加网站流量,从一个网站提取产品信息到另一个网站以减少手工输入,为新网站聚合内容等。

您可聘请我编码PHP机器人,网络资料提取脚本及数据挖掘。如果您详细明确阐述您的要求,我会报您一个固定的价格。

我也为客户用Joomla建立业务/内容/博客相关网站,用Virtuemart, Presta Shop, Open Cart, EC Shop, EC Mall等创建购物网站。我提供网页寄存,第一年是免费的。

我接受通过PayPal付款。如果您想联系我,请写信给“freeman [a] php8legs.com”。 谢谢。

Website URL: http://php8legs.com

Price Comparison Engine using SphinxSearch

Sphinx Search Server

My last article was written in Feb this year. I was extremely busy on my internet retail business in the last eight months, as well as putting a lot of effort to release a price comparison engine for Malaysia's ecommerce market, BijakMall.com. Bijak Mall collects product information from various internet malls in Malaysia and indexes into database. Potential buyers then able to search and compare prices through Bijak Mall's search engine. I started this project under XAMPP environment so that I can use my laptop to test the scripts. I wrote web spiders in PHP to collect products' data and store into MySQL database. Sphinx Search Server (Windows version) is used to index the database and response to user query. Sphinx is a free software/open source Fulltext search engine designed to provide full-text search functionality to client applications.

Facebook Remote Status Update with PHP/cURL Bot

  • 04 February 2014 |
  • Published in Facebook

Facebook login form

** The script in this post no longer working properly. I have updated the script and posted at New and Updated! Facebook Remote Status Update with PHP/cURL Bot  **

 

In previous posts, the discussion is mainly focus on getting web page source file and perform scraping to the text file. In this post, I will show you how I use PHP/cURL to login into Facebook account and post status update at Facebook wall. Once we know how to remote posting using PHP/cURL, we can do many things such as auto posting comment into forum or blog, fill up contact form and send email to our target, login to website and pull out information we need.

For this test case, I am using XAMPP in the PC and login to Facebook mobile interface which is much simpler than its desktop version. You can try to compare Facebook login source file for both desktop and mobile version.

From the browser, go to http://m.facebook.com and there is only one login form.

The Art of Invisibility for Ninja Web Spider

Art of Invisibility

Most webmasters welcome webbot from search engines such as Google and Bing. Contents of their websites will be indexed by search engines and users can easily find thier websites. However, they surely not welcome your web spider to extract data from their sites, may be you're up to no good, such as how products from etsy.com were used to drive traffic to a Hong Kong website. Most likely your IP will be blocked if webmasters detect unknown web spider aggressively crawling their websites. Way back to 2001, eBay filed legal action against Bidder's Edge, an auction scraping site, for “deep linking” into its listings and bombarding its service; Craigslist has throttle mechanism to prevent web crawlers from overwhelming the site with requests

Even big spiders like Google has mechanism to prevent others from scraping their content. Try to search a keyword and at the results page, click page 1, then page 2, page 3,...At page 20 (in my case), Google will stop displaying search results and want to find out are you a human or webbot that reading the page. If you unable enter correct captcha, then your IP will be blocked eventually.

USD1.60 To Put Facebook Under Your Feet!

facebook slippers 1

I found these counterfeit products from Danok, Thailand yesterday - Facebook, Yahoo and Line. Asking price is USD1.60 per pair and you can walk away with Facebook or Yahoo. I am looking for Google and Apple and they have no stock at the moment.

Must have Component - Xmap - Dynamic Sitemap Generator for Joomla!

  • 17 December 2013 |
  • Published in Joomla

Joomla Xmap Logo

Search engines like Google and Bing want us to submit sitemap file in webmaster tools so that they can use it to crawl and analyze our website. If you still have pure HTML based website during 90's and early 00's, most likely you have to run sitemap generator software from PC, crawling and collecting urls from your web pages and save into a sitemap file. The major problem is you need to generate new sitemap file every now and then when you update your website, then upload to webmaster tools.

With Joomla, you can use Xmap, a free component created by Guillermo Vargas since Joomla 1.x, to re-generate dynamic sitemap automatically so that you can totally forget about sitemap file after initial setup. Xmap creates sitemap based on the structure of menus in your Joomla website. You can add or remove menus anytime and Xmap will dynamically generate sitemap accordingly with additional metadata. You can also create any number of sitemaps with different options. However, Xmap also comes with poor documentation and poor support, as you can see from JED comments. If you encounter any problems, you need to search the solution from forums, just like Xmap was not functional in recent Joomla 3.2 upgrade.

Download and Save Images with PHP/cURL Web Scraper Script

extract image5

In this article, I will discuss how to download and save image files with PHP/cURL web scraper. I will use email extractor script created earlier as example. With some modification, the same script can then be used to extract product information and images from Internet shopping websites such as ebay.com or amazon.com to your desired database. We can also extract business information from directory websites, both text information and images into your website as well.

There are concerns and considerations before we scrape image file from websites.

1) There could be various file formats (jpeg, png, gif etc) used in a website. Even a single web page could have various file formats. 

If we want to build common database for all collected images (from various websites), then our PHP web scraper script needs to be able to convert to the file format we prefer.

2) Each images could have different file size.

Some images can be very large and some very small. Our PHP web scraping script needs to be able to resize large file to a smaller size. Resize large file to small is not a problem. Small size to large will give us poor quality image.

3) We need a naming convention for image file.

Different websites named image files differently. Some have long name, some short. Before store image files into our folder, we need to rename these files with our naming convention.

4) We need to add one column in MySQL database, to link the images to the related information.

So here we go...

JCE Text Editor - Must Have Component in Joomla

  • 23 November 2013 |
  • Published in Joomla

jce logo

The standard text editor in Joomla since 1.5 until 3.x is TinyMCE. It has all the basic functions you need to write content (text and images) and publish your blog. In fact, text editor is one of the most important components in Joomla. The TinyMCE editor is like a simplified word processor and with limited features.

However, I am very used to a third party text editor, JCE - Joomla Content Editor, which is a much better text editor than TinyMCE. I am using JCE in all my websites. Furthermore, JCE is FREE, so you should give it a try.

Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)

store extracted email to database

In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.

You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).

First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.

You can download the source file for PHP cURL Email Extractor from here.

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

PHP Email Extractor Script with cURL and Regular Expression (2)

targeted email extractor

In this post, we need to make a small modification to previous PHP script for targeted email extraction.

First, we need to revisit the source file of the targeted web page, we can see that there are repeated blocks of agent contacts, with name, email and phone number. Total of 10 blocks per page. 

The plan is to use the script to "cut out" each block and stores into array, then extract name, email and phone number from each block.

As you can see, each block starts with <div class="negotiators-wrapper"> tag and ends with </div></div>. Note that both </div> tags are separated by carriage return and new line feed in this example. 

Subscribe to this RSS feed