Data wrangling is the process of transforming and mapping data from “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
Collecting DataData had to be collected from three data sources:A file at hand that was available as it in the resources tab of the Udacity Nanodegree classroom.
It had the major chunk of the data about tweets of the WeRateDogs account from 2015 to 2017.
A file that was to be programmatically downloaded from the Udacity servers which had the results of the machine learning algorithm performed on the images from the WeRateDogs account.
I downloaded this file using the Python library requests.
How to download a file with the given URLThe third source for gathering data was web scrapping off Twitter using its Tweepy API using the tweet IDs found in the file at hand.
The Tweepy API is an easy to use Python based API which connects to a twitter account using secret and public keys.
Once authenticated, one can easily scrap tweets off twitter.
To get started: follow the documentation.
For those who are new to web scraping, here’s a very good guide written by a fellow Bertelsmann Scholar.
Over all, I had three files with ample amount of data to analyze.
Cleaning DataOf course, since there was three different data sources, there had to be problems between the three files.
The task at hand was to find and clean at least 8 data quality and tidiness issues.
I managed to find and clean 12 issues.
Heat map plotting the Null Values in the main data file.
My approach to finding these data issues was to first find basic information about the three data sets.
Then I did a visual analysis i.
I just looked at specific columns of the data and found issues, such as there being a lot of missing values in a few columns which was then validated via programmatic analysis such as using the info() function.
The issues I found were major such as missing values, bad data types, bad data within the entries and smaller issues such as dog stages in 4 different columns which can be fit into one column and different number of records within the three data files including duplication and retweets that had to be removed.
The final list of issues found can be divided into two main types:Quality Issues found because of dirty data i.
the data has issues with its content.
Common data quality issues include: missing data, invalid data, inaccurate data, and inconsistent data.
Tidiness Issues are issues found due to the structure of the data.
It can also be referred to as messy data.
A good guide to tidy and untidy data is found here.
AnalysisOnce I had completed my assessment of the data and fixed all the issues I found within my data set, I was left with two clean and tidy data sets, one on tweet data and one on tweet images.
I had a lot of fun with doing analysis on this clean data.
My approach to analysis was to ask a question and then try to answer the question with the data that I had.
The burning question on my mind was that how well did the model perform?.A describe() on the model predictions gave the following result:Running describe on the model predictions.
Tweet_id and img_num is to be ignored.
Oddly, that there was a probability of a 1.
0 as max seen in p1_conf.
This means that the model was a 100% sure in its predictions.
The following is the record with the prediction next to the probability as ‘jigsaw puzzle’Row with a 1.
0 confidence levelSource: WeRateDogs TwitterSo to evaluate if this was the correct prediction or not, I checked the URL and was led to the image on the left.
The model seems to have completely overlooked the dog in the picture and only considers the jigsaw puzzle which makes up most of the image.
A few things to note here is that I have not made the model, nor do I know if any optimization techniques have been run on it, nor do I even know if this was treated as a single or multi class problem.
This is analysis from the data provided.
Thus, based on this information, this model really needs can vastly be improved.
To validate this conclusion, a few other rows of interest i.
e the ones with a prediction of ‘not dog’ were checked.
Out of 2075 entries, there are 832 entries with the chances of the image not being a dog.
However, there were many instances within these parameters where there was a false negative such as the following which had the prediction of ‘shopping cart’:Source: WeRateDogs Twitter.
Obviously the there are major problems with the model that have to be fixed to give better results as the dog in the above picture has captured most of the picture and the prediction is still to be not a dog.
Top ten names for dogs, apart from None.
The next question: What are the most common dog names?.This was easy.
The dog names had been corrected during the data cleaning part.
‘An’, ‘a’ and ‘the’ were all in the list of names which had to be cleaned before any analysis could be done on it.
The resulting list showed the top ten names people have kept for their dogs.
The none is there when no names are found.
VisualizationThe last requirement of the project was to create visualizations from the data.
A similar approach to the analysis part was used i.
e a question was asked with hopes of finding an answer via visualization.
Question 1: How did the retweet count and favorite count improve over time?.Did it increase as the popularity of the account increased?Scatter Plot of retweets and favorites over time.
An obvious trend is seen : In the beginning, the favorite counts and the retweet counts are at a similar level, yet the number of tweets per time are more.
As the 2016 and 2017 progress, the number of tweets per time decrease (seen via the low number of blue and red dots), but the the number of the favorite counts and retweet counts becomes higher and higher.
Another trend noticed is that favorite counts seem to increase drastically going up to 10000 for a few tweets, yet the retweet counts remain less than 5000 for the entire duration.
Question 2: what type of dogs are there in the tweets?Top ten dog types predicted.
The major predictions seen is of the golden retriever, which seems to be a popular choice for pet dogs, followed by Pembroke and the Labrador.
It is a mix of small and big dogs!ConclusionSince this project was primarily a data wrangling project, there were a few major takeaways from this project:New functions within the pandas library such as the melt() and pivot() which help combine and de-clutter the dataRating DistributionThe importance of visually checking the data you have.
For a lot of people, checking or looking at data manually is task they struggle with.
However, it was only when I looked at the actual data and not an abstract version of it did I find faults in it.
It also helped me become an expert at my data set so by the time I moved my analysis, I knew my data set inside out, which is highly recommended by expert data analysts.
Usually finding duplicates is easy as using the Pandas duplicated() function.
However, for this data that did not work!.I found duplicates by the URLs as they were the only unique identity we had.
Each tweet was supposed to have a unique URL associated with it, so any two records with the same URL meant the underlying tweet was the same and the record had to be removed.
How important it is to have an initial analysis to find any and all data issues with the data set.
This analysis helps in further planning and understanding of the data set and helps in optimizing the data cleaning process.
The data wrangling project was one of the most fun projects that I have done to date.
It is always fun to sit down with a data set and try to gauge what it is telling you.
Along with the fun bit, there was also quite a bit of learning in this project.
It is available on my Github, and I would love to hear what you think of it!.