The aim of this project is to create a news aggregator in python.
1. The engine should be build on top of Scrapy and needs to be well structured, scalable and well optimized
2. The engine should crawl websites that will be provided from a JSON file or database (should be flexible because we haven’t decided yet)
3. Spiders should be build in that way that can easily scale up, like a baseline spider that can be inherited
4. When the spider hits the website for the first time, the spider should try and find if the website has RSS feeds or sitemap, in order to track and scrape the latest news/content from. If there is no RSS feed or sitemap should hit the fallback spider, which will find all the latest news
5. Machine Learning model/AI techniques in order to identify the page structure and extract the content automatically
6. The content should be cleaned with best techniques possible and retrieve back all the important information from an article (title, image/images, videos, date, author/authors, content etc.)
7. The output should be flexible, having options to save the articles in files (locally or in a remote storage like Amazon or Microsoft azure blob storage or database).
8. Every spider should be monitored and by gathering information from them (like statistics on how many articles have they scraped from the website, in case a spider fails to crawl a specific website it should raise a flag and report the reason why it has failed etc.).
9. Please do not hesitate to suggest ideas or better ways of doing it, especially using AI and Machine learning.
- The programming language that should be used in this project is Python 3
- The project should be well-structured, clean code and should be build with the scaling strategy in mind
- Modular/component based
- Speed and accuracy is of utmost importance
- The service should be optimized to be able to run in low performance servers and achieve good results
- External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named [login to view URL]