Log in Register

Log in

Facebook Remote Status Update with PHP/cURL Bot

Facebook login form

** The script in this post no longer working properly. I have updated the script and posted at New and Updated! Facebook Remote Status Update with PHP/cURL Bot  **

 

In previous posts, the discussion is mainly focus on getting web page source file and perform scraping to the text file. In this post, I will show you how I use PHP/cURL to login into Facebook account and post status update at Facebook wall. Once we know how to remote posting using PHP/cURL, we can do many things such as auto posting comment into forum or blog, fill up contact form and send email to our target, login to website and pull out information we need.

For this test case, I am using XAMPP in the PC and login to Facebook mobile interface which is much simpler than its desktop version. You can try to compare Facebook login source file for both desktop and mobile version.

From the browser, go to http://m.facebook.com and there is only one login form.

Read more...

The Art of Invisibility for Ninja Web Spider

Art of Invisibility

Most webmasters welcome webbot from search engines such as Google and Bing. Contents of their websites will be indexed by search engines and users can easily find thier websites. However, they surely not welcome your web spider to extract data from their sites, may be you're up to no good, such as how products from etsy.com were used to drive traffic to a Hong Kong website. Most likely your IP will be blocked if webmasters detect unknown web spider aggressively crawling their websites. Way back to 2001, eBay filed legal action against Bidder's Edge, an auction scraping site, for “deep linking” into its listings and bombarding its service; Craigslist has throttle mechanism to prevent web crawlers from overwhelming the site with requests

Even big spiders like Google has mechanism to prevent others from scraping their content. Try to search a keyword and at the results page, click page 1, then page 2, page 3,...At page 20 (in my case), Google will stop displaying search results and want to find out are you a human or webbot that reading the page. If you unable enter correct captcha, then your IP will be blocked eventually.

Read more...

Must have Component - Xmap - Dynamic Sitemap Generator for Joomla!

Joomla Xmap Logo

Search engines like Google and Bing want us to submit sitemap file in webmaster tools so that they can use it to crawl and analyze our website. If you still have pure HTML based website during 90's and early 00's, most likely you have to run sitemap generator software from PC, crawling and collecting urls from your web pages and save into a sitemap file. The major problem is you need to generate new sitemap file every now and then when you update your website, then upload to webmaster tools.

With Joomla, you can use Xmap, a free component created by Guillermo Vargas since Joomla 1.x, to re-generate dynamic sitemap automatically so that you can totally forget about sitemap file after initial setup. Xmap creates sitemap based on the structure of menus in your Joomla website. You can add or remove menus anytime and Xmap will dynamically generate sitemap accordingly with additional metadata. You can also create any number of sitemaps with different options. However, Xmap also comes with poor documentation and poor support, as you can see from JED comments. If you encounter any problems, you need to search the solution from forums, just like Xmap was not functional in recent Joomla 3.2 upgrade.

Read more...

Download and Save Images with PHP/cURL Web Scraper Script

extract image5

In this article, I will discuss how to download and save image files with PHP/cURL web scraper. I will use email extractor script created earlier as example. With some modification, the same script can then be used to extract product information and images from Internet shopping websites such as ebay.com or amazon.com to your desired database. We can also extract business information from directory websites, both text information and images into your website as well.

There are concerns and considerations before we scrape image file from websites.

1) There could be various file formats (jpeg, png, gif etc) used in a website. Even a single web page could have various file formats. 

If we want to build common database for all collected images (from various websites), then our PHP web scraper script needs to be able to convert to the file format we prefer.

2) Each images could have different file size.

Some images can be very large and some very small. Our PHP web scraping script needs to be able to resize large file to a smaller size. Resize large file to small is not a problem. Small size to large will give us poor quality image.

3) We need a naming convention for image file.

Different websites named image files differently. Some have long name, some short. Before store image files into our folder, we need to rename these files with our naming convention.

4) We need to add one column in MySQL database, to link the images to the related information.

So here we go...

Read more...

JCE Text Editor - Must Have Component in Joomla

jce logo

The standard text editor in Joomla since 1.5 until 3.x is TinyMCE. It has all the basic functions you need to write content (text and images) and publish your blog. In fact, text editor is one of the most important components in Joomla. The TinyMCE editor is like a simplified word processor and with limited features.

However, I am very used to a third party text editor, JCE - Joomla Content Editor, which is a much better text editor than TinyMCE. I am using JCE in all my websites. Furthermore, JCE is FREE, so you should give it a try.

Read more...

Create MySQL Database for PHP Web Spider Extracted Emails Addresses (4)

store extracted email to database

In this final part of PHP/cURL email extractor, I will show you how to store extracted data into MySQL database. You can store email addresses and contact information collected not just from one website, but also from various websites into the same database.

You might want to store email collected based on your purpose. For example, if you have a real estate website and a internet shopping website, then information collected should be stored into two different categories (tables in MySQL database).

First, you need to activate XAMPP on your PC, both Apache and MySQL. At browser URL, go to "http://localhost/phpmyadmin/". Go to top menu bar and select "Database". To create a new database for our tutorial, enter "email_collection" and press "Create" button, as shown in the picture below.

You can download the source file for PHP cURL Email Extractor from here.

Read more...

PHP Web Spider to Crawl Web Pages Pagination and Extract Emails (3)

email web crawler

In this post, I will show you how to further modify our email extractor script to inject crawling capability and collect as many targeted email list as possible.

The trick is quite simple - we are not crawling entire website and check each and every web page for email addresses. Doing this will consume a lot of bandwidth and our time. We just need to crawl web pages with targeted email listing: so as long as we know total pages to crawl, then looping from first page to last page and job is done!

First, inspect the pagination of our targeted website. In the example we use, it has page 1, 2, 3,... and "Last" page button. Pressing this button will take us to the last page, which is page 169. Each page has 10 email addresses, so we can get almost 1690 emails from this website. The total number of pages (169 at present) can be changed in future. If we want to reuse our email extractor script with minimum manual editing, it needs to be able to auto detect total pages.

Read more...
Subscribe to this RSS feed