The piece of software to be developed scanns rating sites like [login to view URL] and writes user comments into a database. The software looks for certain products on the rating sites so it is not necessary to scann the whole rating site. However the software should be able to refresh data on a daily base.
The software should handle the following sites
[login to view URL]
[login to view URL]
[login to view URL]
[login to view URL]
[login to view URL]
[login to view URL]
[login to view URL]
## Deliverables
Detailed Spec:
The program should parse the category pages of the different rating sites (e.g. MP3 Players - [login to view URL] at [login to view URL]). The program must be able to parse the complete category which may be distributed over several pages (e.g. 504 pages for the MP3 Players section at [login to view URL]).
The program can be called from commandline with the following attributes from a linux and windows machine:
-date: user comments to be extracted must be newer than this date
-categorylink: is the link to the list of products of a certain category
(-ratingtype: is the type of rating site to use the right piece of code for extraction)
The following contents are to be extracted for each user comment
date of entry DATE_TIME
product title VARCHAR(255)
content MEDIUMTEXT
content title VARCHAR(255)
user name VARCHAR(255)
rating site(identifier for rating site) VARCHAR(255)
containslink(whether comment contains links) TINYINT(1)
position(number of comment in the list) INT
For performance reasons another table should be created containing the categorylink with the date of the last recent user comment,
so when re-extracting results it can continue from there
categorylink VARCHAR(255)
lastupdate DATE_TIME
Errorhandling:
There should be a stable error handling. All errors should be logged and the application must be able to finish later on rather than crashing due to errors. Connection errors should be reported. DB statements should be logged for later on verification.
Info messages must be logged when content was sucessfully extracted.