I need a Lucene or SOLR-based implementation that does the following:
1) A web-based UI to fetch a web document or a PDF from a defined source (eg. a URL)
2) Allows the user to define what to look for (i.e. keywords, phrases)
3) Scan the document for specific information, eg. <Field A>, <Field B>, etc.
4) Extracts the information and outputs it to a CSV file
I have attached a sample PDF file. The script would:
a) Go to the Apple website (<[login to view URL]>). It would click on each link:
- [[login to view URL]][1]
- <[login to view URL]>
etc. etc. Each link has a downloadable PDF document, from which I would like to extract info.
b) It would allow me to define what to look for. For Apple, I want to extract:
- iPhone and Related Products and Services - Units, Revenue for the various periods (Q2 2012, Q32011 and Q32012)
- iPad and Related Products and Services - Units, Revenue for the various periods (Q2 2012, Q32011 and Q32012)
c) The output of the search will be a CSV file