We are looking for a PoC app showing how to extract main content from the random html page, stripping everything else out (navigation, banners, sides, etc) .
Similar to what instapaper does with random content page.
I have attached list of random html pages covering similar topic, result application should intelligently extract only main content from the page.
!!! To be considered for the job, please outline general direction you would take
1) Will the program download the content or not or you have something already doing the download? Just wanted to know
2) The way you go about doing this is use URLConnection to download if you want to download the page
3) Then you need to use HTML to XML convert API. There are many libraries that do a good job of this but some pages might throw error while parsing because they cannot be parsed properly - likelywood less or even 0. But you have to rely on the library - you have no choice.
Then apply XSL to convert to XML to another XML that is more readable (Or maybe this is not needed - depends on you - Lets' not complicate unnecessarily ). Then you use java program to read the XML data out.
For this price or infact any price, you can get a framework that can parse one page. Then you have to learn how to construct the XSLT. I can teach you for an hour or so but it is up to you to learn..Or else no program will do any good.