distinguishing gender and filtering emails using hdfs/nltk/python

Cerrado Publicado hace 7 años Pagado a la entrega
Cerrado Pagado a la entrega

Problem Statement 1:

We are thinking to take the live streaming data such as twitter tweets and analyze the tweet, differentiate the male and female tweets and represent them graphically. Generally, names ending with a, e and i are most likely the female names. So, we'd use this concept to distinguish male and female names.

Instructions:

Extract data from twitter. (using flume)

Store it in text file (store the extracted files in hdfs)

Do sentiment Analysis using nltk (use the files in hdfs to do sentiment analysis using nltk)

Distinguish the gender by the twitter names

Create graphical report using visualization

Store visualization report in hdfs using sandbox

Note: steps 3,4,5 should be written in python

Problem Statement 2:

Generally, people get hesitated when they are likely to see a spam email in their mailbox as they are expecting an important email. Emails consisting of words such as "won", "rewards", "lottery", "lucky" are mostly spam emails. So we'd like to use this concept and filter emails. We are planning to use Gmail data for streaming.

Instructions:

Extract data from a selected gmail account

Store it in hdfs using sandbox

Extract the text data from sandbox and analyze the data using nltk

Here filteration should be done using keywords such as won, rewards, lottery, jackpot, rebate, mailin rebate, lucky, winner and distinguish the email as spam or not

Use the python and nltk to visualize the filtered data

Using sandbox save data in hdfs.

Conditions:

If the mail contains the words won lottery, won jackpot, lucky winner, winner of the day are possibly 100% spam

If the mail contains keywords lucky, rebate is possibly 90% spam

If the mail contains keywords mailin rebate, rewards are 80% spam

Implementation:

We will

1. Collect the live streaming data

2. Write into a text file

3. Give the text file as input to python mapper and reducer

4. Use nltk, python for visualization of data.

Note:

Project should be completed in 3 days.

Execution should be shown using teamviewer.

Hadoop Lenguajes naturales Python

Nº del proyecto: #12250178

Sobre el proyecto

4 propuestas Proyecto remoto Activo hace 7 años

4 freelancers están ofertando un promedio de $67 por este trabajo

origami07

Hello, I am an experienced developer with knowledge of machine learning/statistics and text analytics. I have completed work areas of finance, news and sports. I have recently wrote an academic paper on sentiment an Más

$77 USD en 3 días
(6 comentarios)
3.1
Valuesolutions

Hello, how are you? I hope you have a bright day/evening from your side. I have read the details provided, but please contact me so that we can discuss more on the project. I believe I have the required skills in this Más

$34 USD en 1 día
(3 comentarios)
4.3
din7esh

As i am working the same project and technology this is similar to my existing project. Also i am much familiar with python and flume which make my work easy. as i don't have time to develop this project, i have re Más

$25 USD en 2 días
(0 comentarios)
0.0