Website scraper

$750-1500 USD

Cerrado

Publicado

hace más de 10 años

$750-1500 USD

Pagado a la entrega

We are looking for a scraper to extract main text content from the site and ignore things like ads, menus, etc. The scraper should be a generic one such that we do not need to alter it in any way for each additional website. The goal is to build a scraper that is both smart and efficient. The details of the implementation will be up to the developer. But on a high level, we have a number of requests and recommendations regarding architecture and minimum requirements. The main requirement is that we should be able to give this scraper almost any URL to an HTML/XHTML-based website and have it automatically scrape the entire site. Relevant Pages While doing so, it needs to understand which pages are relevant (contain primary/useful content), which pages are secondary (e.g. “about us”, “terms of service”, “careers”, etc), and which are completely irrelevant. Most importantly, we need to know which ones are relevant – basically to filter out everything else. Information Architecture The scraper needs to make note of the hierarchy of data on the website and how it is organized. For example, it should take into account any hierarchical categorization and how pages on the site fit into those categories. Keep in mind that a page could be associated with multiple categories, and there could be more than one kind of taxonomy per site. Parsing Page Content The scraper should determine which parts of the webpage are actual unique content, what the page title is, and what all the “irrelevant” portions are (sidebars, headers, footers, ads, miscellaneous widgets). This is highly important because we want to save only the title and unique page content in our database (and of course categories that it fits into, etc). In other words, we want to filter out sidebars, headers, footers, ads, etc. Suggested Architecture We have a recommendation about how to parse page content as described above – to determine which parts of the webpage are unique and which should be filtered out (sidebars, headers, footers, ads, widgets). It would probably make most sense to maintain a hierarchical object of the DOM so that every node in the HTML (and the HTML within it) would have its own object inside this DOM array/object, recursively. Basically the equivalent of [login to view URL] (except we don’t want to use PHP because it’s very inefficient at parsing). Then our goal is to determine which content is unique on each page, and which content reoccurs regularly on other pages (sidebars, headers, etc). One way to achieve this is to generate an MD5 hash for the HTML contained within each HTML node and create a database of these hashes (and how many times each of them occurs). Then we can look at each hash on the current page and compare it with all existing hashes in the database to see if this is something unique or commonly reoccurring. More specifically, when generating MD5s (if that’s the method we use), we should strip from the HTML tags most of their attributes because things like “class”, “id”, “style”, etc might appear in menus of the site and could actually vary from page to page such as when needing to mark a menu item as current/ selected. So to avoid these irregularities, we want to look at mainly the structure of the DOM and the content inside it, ignoring many or most of the HTML attributes inside tags. However, this is just a suggestion, and other ideas are welcome if you can think of something better. In terms of architectural requirements, please remember that performance is a very high priority for us. The choice of programming language is largely up to the developer. However, we highly recommend using a lower level language like C, C++, or maybe something a bit higher level like Java or C#. PHP, Ruby, etc probably wouldn't work. We need as much power as we can get.

C Programming

C++ Programming

Java

ID del proyecto: 5049946

Información sobre el proyecto

15 propuestas

Proyecto remoto

Activo hace 10 años

¿Buscas ganar dinero?

Dirección de email

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto

Cobra por tu trabajo

Describe tu propuesta

Es gratis registrarse y presentar ofertas en los trabajos

15 freelancers están ofertando un promedio de $1.207 USD por este trabajo

@samitXI

Hi Sir, I am ready to work for you.I have 9 years of experience in C/C++ , java . please see some of my works also check my reviews you will get better idea about my skill.I deliver quality work within time frame. Please visit my profile once. Thanks with regards, Amit

$1.443 USD en 15 días

4,9

(196 comentarios)

7,4

@szymszteinsl

Hi! I am professional C/C++/C#/Java programmer. I can do this project with highest quality. Best regards, Szymszteinsl

$1.500 USD en 3 días

4,9

(121 comentarios)

7,3

@nani01029x

I have done some project in Web scraper got some positive feedbacks from clients. I have very high Completion Rate. You can check my profile for more information. Let me help you. Tinh Nguyen.

$1.000 USD en 7 días

5,0

(145 comentarios)

6,0

@dimplex

HI, Thank you for considering my bid. Based on my experience with YP across various countries, I can offer a proven pattern that solves a number of problems not mentioned here. The language is Java and I'd recommend storing the results in a database. Regarding content recognition, I am highly suspicious this can be done in the general case. My approach is to leave the implementation open and create plugins for the particular sites. Kind regards, Rumen

$1.500 USD en 30 días

5,0

(23 comentarios)

4,6

@artursharipov

Hi, I'm a professional software developer with machine learning skills. I've created several generic web site scrapers and know what you want to achieve. I suggest to apply machine learning, in this case the scraper will be able to learn different aspects from different web-sites. It is not possible to create a 100% correct generic scraper but it is possible to control and improve its accuracy (in case of machine learning)

$1.444 USD en 15 días

5,0

(12 comentarios)

4,6

@omanasoft

APPLICATION: The application will implement scraping and extraction of relevant-information, from multiple Web-sites of different genres, use the data to populate database. Relevance to be determined on the basis of being core content of the pages ignoring content common across pages (like menus, ads etc). MY WORK: I have developed (in the Java Programming Language) a Generic Web-Scraper Tool - called OpenMana Web Information Miner (OmanaWIM) - that can be configured to scrape any information from any website. It can do log-in, process JavaScript / AJAX call results, chase multi-level links, post search-forms and handle pagination; can accept / process response in XML; can download images and files; is multi-threaded in a configurable way; can use proxies; supports user-specifiable filters; scraped info can be delivered in JSON or XML / posted to database or Excel/CSV. In case of large number of sites,hand-writing the configuration will also not be feasible. SO, AUTOMATED GENERATION OF configuration IS DEVELOPED. THE configuration-GENERATOR implements a relevance-algorithm and INVOLVES NATURAL LANGUAGE PROCESSING (NLP). MY SOLUTION: I propose a custom solution in Java, based on some components from my OmanaWIM Tool. The solution uses a open-source Html-parser library like HtmlUnit or Selenium. ME: 1. Am a full-time freelancer, with 15+ years of rich experience in software development. 2. Have expertise in UML/OOAD, Java and NLP. 3. Certified in Business English.

$1.333 USD en 20 días

4,3

(3 comentarios)

3,9

@indianpws

Hi, I am an IT professional with more than 15 years of experience. I am a SCJP (Sun Certified Java Professional), OCEJWCD (Oracle Certified Enterprise Java Web Components Developer), SCEA (Sun Crtified Enterprise Architect) and PMP (PMI Certified Project Management Professional). I have extensive experience in web scraping using Java, using HtmlUnit. HtmlUnit provides a DOM of the web page and using that most of the work is done by default. I have taken up and completed many such jobs, and assure you of a quick and quality delivery. Regards Nitin

$888 USD en 10 días

5,0

(2 comentarios)

2,2

@Askarali82

Hi, I am a C++ programmer. I may try to write your web scraper in C++ using socket programming. I haven't written a scraper but I have experience in socket programming. I wrote Windows service using sockets. So I can work with you if you like. Best regards, Askarali Azimov.

$1.333 USD en 30 días

0,0

(0 comentarios)

0,0

@skivsoft

Hi! This is quite interesting problem. I can implement requested features as multi-threaded application in C# or Java. Best regards, Skiv

$1.500 USD en 30 días

0,0

(0 comentarios)

0,0

@rohithr1990

I have already done something like this in java using Simplescrap library.I can modify the code according to your problem statement,I also have experience in scrapping using python with scrapy framework.

$1.250 USD en 12 días

0,0

(0 comentarios)

1,0

@damo303030

Hi, I am a Java and C# expert. I will have no problem developing a web scrapper to extract data you require. I can complete the project in less than 15 days. Let me know if you want to discuss requirements further.

$750 USD en 15 días