News Aggregator -- 2

Cerrado Publicado hace 3 años Pagado a la entrega
Cerrado Pagado a la entrega

The aim of this project is to create a news aggregator in python.

Specifications:

1. The engine should be build on top of Scrapy and needs to be well structured, scalable and well optimized

2. The engine should crawl websites that will be provided from a JSON file or database (should be flexible because we haven’t decided yet)

3. Spiders should be build in that way that can easily scale up, like a baseline spider that can be inherited

4. When the spider hits the website for the first time, the spider should try and find if the website has RSS feeds or sitemap, in order to track and scrape the latest news/content from. If there is no RSS feed or sitemap should hit the fallback spider, which will find all the latest news

5. Machine Learning model/AI techniques in order to identify the page structure and extract the content automatically

6. The content should be cleaned with best techniques possible and retrieve back all the important information from an article (title, image/images, videos, date, author/authors, content etc.)

7. The output should be flexible, having options to save the articles in files (locally or in a remote storage like Amazon or Microsoft azure blob storage or database).

8. Every spider should be monitored and by gathering information from them (like statistics on how many articles have they scraped from the website, in case a spider fails to crawl a specific website it should raise a flag and report the reason why it has failed etc.).

9. Please do not hesitate to suggest ideas or better ways of doing it, especially using AI and Machine learning.

Requirements:

- The programming language that should be used in this project is Python 3

- The project should be well-structured, clean code and should be build with the scaling strategy in mind

- Modular/component based

- Speed and accuracy is of utmost importance

- The service should be optimized to be able to run in low performance servers and achieve good results

- External dependencies should be kept to a minimum. If you can't avoid external dependencies, please list them in a text file named [login to view URL]

Python Arquitectura de software Extracción de datos web Scrapy Machine Learning (ML)

Nº del proyecto: #29367861

Sobre el proyecto

8 propuestas Proyecto remoto Activo hace 3 años

8 freelancers están ofertando un promedio de $343 por este trabajo

umg536

Hi there, I'm bidding on your project "News Aggregator -- 2" I am a data scientist and Being an expert machine learning and artificial intelligence I can do this project for you. please leave a message on my chat so Más

$450 USD en 5 días
(40 comentarios)
7.2
Venkat2011sri

Hello, This is sree, I'm a programmer. I have worked for many clients in this type of platforms. I write programs, that can read the data from websites or files and produce the exact output you want. I will write pro Más

$500 USD en 4 días
(143 comentarios)
7.3
zivkovicdevelop1

Hello. Which site do you want to scrape? I am a Web Scraping Expert and have finished many scraping jobs in Python I have many ideas to scrape the website. I am sure to scrape the data perfectly as you wish Hope y Más

$200 USD en 3 días
(33 comentarios)
6.2
Demenntor

Dear Employer, I have seen your Job description - News Aggregator. I have good experience in Python Machine Learning Scrapy and have done similar projects. Kindly message me so that we can discuss more about the wo Más

$300 USD en 5 días
(26 comentarios)
4.9
MohammedJARROU

Hello, I m a Data Scientist and Python Developer with an extensive experience and a lot of Web scraping, Machine learning, Deep learning projects with Python. I have several certificates in the same field (you can cons Más

$275 USD en 5 días
(11 comentarios)
4.1
gobyweb2

Hello, I have read your description and I can - 1. The engine should be build on top of Scrapy and needs to be well structured, scalable and well optimized 2. The engine should crawl websites that will be provided fr Más

$400 USD en 7 días
(5 comentarios)
4.2