A selection of webscraping tools

$30-250 USD

Cerrado

Publicado

hace más de 14 años

$30-250 USD

Pagado a la entrega

We require several webscraping tools to be designed and built, in order to obtain various information about multiple URLs. The tools will be designed to be used by the client, so the interface must be clear and simple. Ideally the various tools will have a similar interface. All tools are likely to follow a similar format: a list of URLs entered, possibly with the tool referring to a .txt or similar document. The tool will scrape relevant data and present the scraped data in an Excel or CSV format, with the extracted data in columns alongside the original input data. Please see the attached Excel file “example output – see note different [login to view URL]” which gives possible examples of what the data might look like. We need a quick turnaround on this work – the completed, working tools must be sent to us and approved by 23 December. If the quality of the tools is high, there may be further opportunities for similar ad hoc projects in the future. Technorati Authority Webscraper The inputted data will be a list of URLs, perhaps in a .txt document, in the following format separated by carriage returns: [login to view URL] [login to view URL] [login to view URL] [login to view URL] [login to view URL] [login to view URL] The tool will need to be capable of giving up to 10,000 results at a time. The scraper will need to scrape the Technorati database for the Technorati Authority, and present the results with the URLs in the first column, and corresponding Technorati Authority number in the second column. See the “Technorati Authority” tab on the “example output” spreadsheet file. Not all blogs are registered with Technorati; the output should return a blank value when the blog URL is not recognised by the Technorati database. You may need to use proxies. Google PageRank calculator tool Similar to the Technorati Authority tool, this tool will take a raw list of URLs and calculate the Google Pagerank, and generate an output file with the URL in the first column and corresponding PageRank score in the second column. Inbound/Outbound link tool We require a scraping tool to compile numbers of inbound and outbound links scores for multiple URLs and present them in an Excel or CSV format. The inputted data will be a list of URLs, perhaps in a .txt document or similar, in the following format separated by carriage returns: [login to view URL] [login to view URL] [login to view URL] [login to view URL] [login to view URL] [login to view URL] The output spreadsheet file will present the URL in the first column, with the number of inbound links for that URL in the second column, and number of outbound links in the third column. See the “inbound & outbound links” tab on the “example output” spreadsheet file. The tool will need to be capable of giving up to 10,000 results at a time. Email scraping tool We require a tool to extract all email address information from a blog page, or anything that looks like an email; for example, many websites disguise their email contacts in the form “example at example dot com. The tool will need to perform several iterations, as the information may be on the page in different formats: 1. Extract any mailto: hyperlink and format only as the hyperlinked text 2. Extract any word which contains letters, followed by “@” followed by more letters ie _____@______ - this will have to take into account any dots, hyphens or similar and NOT take these as “separators” but instead treat a space or carriage return as the separator. 3. Extract anything in the following format: “WORD WORD WORD AT WORD WORD WORD” and format as “wordwordword@wordwordword” 4. Extract anything in the following format: “WORD WORD WORD [AT] WORD WORD WORD” and format as “wordwordword@wordwordword” 5. Extract anything in the following format: “WORD WORD WORD @ WORD WORD WORD” and format as “wordwordword@wordwordword” The tool will run from a list of URLs and propagate the results onto the same row as the originating URL – as follows: URL email email URL email URL email email email email URL URL email See the email tab on the “example output” file. Blogflux scraping tool This tool will scrape the Blogflux directory ([login to view URL]) and extract all the URLs which are contained in the directory, along with the corresponding “categories” to which they are assigned within the directory. As the directory will update as time goes on, we would need access to the tools – this will not be a single-use tool. Each blog within the directory has an assigned page within Blogflux, for example [login to view URL] for which the blog address is http://ornamentalist.net. The output file will only present the outbound link (eg [login to view URL]). The data would be presented in a spreadsheet with all the URLs on different rows, and the categories to which the blogs have been assigned populated across columns. See the “Blogflux” tab on the “example output” spreadsheet file. Blog Catalog scraping tool See [login to view URL] – a directory which we would like to scrape for a complete list of URLs, with associated information. For an example, see [login to view URL] the information which we require is: Blog URL (in this case [login to view URL]) Rating of blog from users (scroll down to comments...in this case currently 5.00) Number of fans on blogcatalog (see [login to view URL] in this case 13) Description (in this case “The fine arts, decorative arts and architecture of Europe, North America and Australia, 1650-1933” “Listed in” tags (in this case Art History, Architecture). These will use a different cell for each category (see example on attached file) The tool will scrape the entire directory at once and therefore give a large output file with over 100,000 rows of data. See the “Blog catalog” tab on the “example output” spreadsheet file. Note in particular the fact that “Art History” and “Architecture” are in different cells. Google search tool We need a tool to search Google for given search terms and scrape the results into a spreadsheet. The tool would need to be capable of searching multiple terms at once. The input would be a list, perhaps in a .txt file, of search terms, for example: food “mobile phones” technology blogs “luxury jewellers” London The tool will then populate a spreadsheet with columns containing: (1) search term searched for, (2) Metatitle of link; (3) URL; (4) home URL or parent website; (5) PageRank; (6) Google description. The tool will need a feature whereby the user can limit the number of search results scraped per search term (for example, the user might want to specify 100 results for each of the 4 example terms given above...giving a total of 400 results). The tool should give the option of searching using either [login to view URL] or [login to view URL] See the “Google search tool” on the “example output” file; in this case, the user would have specified 3 results per search term on google.com. The tool will need to work so that search terms like related:[login to view URL] (Google related search) also work. [login to view URL] scraper We need a tool to scrape the entire [login to view URL] database of profiles and populate all the information on the profiles pages into an output file. This information will be: (1) Profile URL, (2) Name, (3) gender, (4) Email address if shown, (5) My Web Page URL, (6) IM username, (7)City/Town, (8) Region/state, (9) Country, (10) Industry, (11) Occupation, (12) About Me, (13) My Blogs, populated across multiple columns if applicable. See the “[login to view URL] scraper” tab on the attached file for an example (note that in this example, not all the columns are populated as some of the information does not exist). I need a VERY QUICK TURNAROUND for these projects - with tools to be delivered by 23 December. Please contact me with your proposal and costings.

A selection of webscraping tools

$30-250 USD

$30-250 USD

Información sobre el proyecto

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Sobre este cliente

Verificación del cliente

Otros trabajos de este cliente

Trabajos similares