I am looking for a PHP script that scrapes all of EzineArticles and saves each article as a MySQL entry that includes the URL, Title, Category, and Article Text.
Your script should find, scrape, and store every single article on EzineArticles (there will be millions of them). So with that in mind there should be some sort of threading to help speed things up.
I have thousands of private proxies, so there should also be the ability for me to provide a text file with proxies. Some of the proxies will have usernames and passwords and some won't (so you will need to account for both). So I would recommend having several hundred threads with some sort of proxy switcher in place.
A good way to do this (without getting IPs banned) is to have a universal list that keeps track of what proxies are being used by a thread and which ones aren't. Then every couple articles you pull a new IP that currently isn't being used.
If a page fails to load properly (either because EzineArticles rate limited you or because the proxy itself was having issues) you should have it try again using a different proxy. If a page fails 10 consecutive times, have it save in the database that it failed (make everything blank but the URL) and then continue.
Lastly it needs to save its progress, so if the script is closed for some reason it can continue where it is left off. This can be controlled by data in a MySQL database as well.
MySQL structure:
| URL | Title | Category | Article |
Proxy document structure:
IP:Port:Username:Password
IP:Port
So : separates IP and Port (and Username and Password if it exists). Proxies are separated by newline.
When testing you do not need to test it all the way until completion (since you won't have the proxies besides a couple ones to do testing) but when it is done I will need to run it myself and make sure it is working before paying (and make sure that it will find every article).
The actual scraping/parsing should be relatively easy as the articles are always in a very well defined tag.
A good way to find every article is just go through each category page and go through every single listing.
I will accept applications for people who don't use PHP. Let me know your language and I will decide. However PHP is preferred.
When contacting me please let me know how you will be scraping the site (what framework).
Seasoned scraper writer with hundreds of scraper scraping millions of pages each day.
Please check my reviews to know more about my work :
http://www.freelancer.com/u/sayno2bugs.html
More in pm.
Cheers,
SayNo2Bugs