The Data Science course focused on data science and machine learning in Python, so importing it to python (I used anaconda/Jupyter notebooks) and cleaning it seemed like a logical next step.
Speak to any data scientist, and they’ll tell you that cleaning data is a) the most tedious part of their job and b) the part of their job that takes up 80% of their time.
Cleaning is dull, but is also critical to be able to extract meaningful results from the data.
I created a folder, into which I dropped all 9 data files, then wrote a little script to cycle through these, import them to the environment and add each JSON file to a dictionary, with the keys being each person’s name.
I also split the “Usage” data and the message data into two separate dictionaries, so as to make it easier to conduct analysis on each dataset separately.
Problem 4: Different email addresses lead to different datasetsWhen you sign up for Tinder, the vast majority of people use their Facebook account to login, but more cautious people just use their email address.
Alas, I had one of these people in my dataset, meaning I had two sets of files for them.
This was a bit of a pain, but overall not too difficult to deal with.
Having imported the data into dictionaries, I then iterated through the JSON files and extracted each relevant data point into a pandas dataframe, looking something like this:Usage Data with names removedMessage Data with names removedBefore anyone gets worried about including the id in the above dataframe, Tinder published this article, stating that it is impossible to lookup users unless you’re matched with them:https://www.
com/hc/en-us/articles/115003359366-Can-I-search-for-a-specific-person-on-Tinder-Now that the data was in a nice format, I managed to produce a few high level summary statistics.
The dataset contained:2 girls7 guys9 participants502 one message conversations1330 unique conversations6,344 matches6,750 messages received8,755 messages sent34,233 app opens94,027 right swipes403,149 left swipesGreat, I had a decent amount of data, but I hadn’t actually taken the time to think about what an end product would look like.
In the end, I decided that an end product would be a list of recommendations on how to improve one’s chances of success with online dating.
And thus, with the data in a nice format, the exploration could begin!The ExplorationI started off looking at the “Usage” data, one person at a time, purely out of nosiness.
I did this by plotting a few charts, ranging from simple aggregated metric plots, such as the below:to more involved, derived metric plots, such as the aptly-named ‘Loyalty Plot’, shown below:The first chart is fairly self explanatory, but the second may need some explaining.
Essentially, each row/horizontal line represents a unique conversation, with the start date of each line being the date of the first message sent within the conversation, and the end date being the last message sent in the conversation.
The idea of this plot was to try to understand how people use the app in terms of messaging more than one person at once.
Whilst interesting, I didn’t really see any obvious trends or patterns that I could interrogate further, so I turned to the aggregate “Usage” data.
I initially started looking at various metrics over time split out by user, to try to determine any high level trends:but nothing immediately stood out.
I then decided to look deeper into the message data, which, as mentioned before, came with a handy time stamp.
Having aggregated the count of messages up by day of week and hour of day, I realised that I had stumbled upon my first recommendation.
The First Recommendation:9pm on a Sunday is the best time to ‘Tinder’, shown below as the time/date at which the largest volume of messages was sent within my sample.
Here, I have used the volume of messages sent as a proxy for number of users online at each time, so ‘Tindering’ at this time will ensure you have the largest audience.
I then started looking at length of message in terms of both words and letters, as well as number of messages per conversation.
Initially, you can see below that there wasn’t much that jumped out… (here a ‘success’ is red)But once you start to digging, there are a few clear trends:longer messages are more likely to generate a success (up to a point)The average number of messages into a conversation a ‘success’ is found is 27, with a median of 21.
These observations lead to my second and third recommendations.
The Second Recommendation:Spend more time constructing your messages, and for the love of god don’t use text speak… generally longer words are better words.
One caveat here is that the data contains links, which count as long words, so this may skew the results.
The Third Recommendation:Don’t be too hasty when trying to get a number.
‘hey, ur fit, what’s ur number’ is probably the worst thing you can say in terms of your chances.
Equally, don’t leave it too long.
Anywhere between your 20th and 30th message is best.
Average message count of successful vs un-successful conversationsHaving looked into length of word/message/conversation rather extensively, I then decided to look into sentiment.
But I knew absolutely nothing about how to do that.
During the course, we’d covered a bit of natural language processing (bag of words, one hot encoding, all the pre-processing required etc.
along with various classification algorithms), but hadn’t touched on sentiment.
I spent some time researching the topic, and discovered that the nltk sentiment.
vader SentimentIntensityAnalyzer would be a pretty good shout.
This works by giving the user four scores, based on the percentage of the input text that was:positiveneutralnegativea combination of the threeLuckily, it also deals with things such as word context, slang and even emojis.
As I was looking at sentiment, no pre-processing was done (lower-casing, removal of punctuation etc.
) in order to not remove any hidden context.
I started this analysis by feeding each whole conversation into the analyser, but quickly realised this didn’t really work, as the conversation sentiment quickly tended to 1 after the first few messages, and I struggle to believe that a conversation of 100 messages was 100% confident the whole time.
I then split the conversations down into their constituent messages and fed them through one at a time, averaging the scores up to conversation level.
This produced a much more realistic outcome in my opinion:Split this data up by ‘Success’ or ‘No Success’, and I quickly saw a pattern emerging:This tee’d up my fourth recommendation.
The Fourth Recommendation:Be positive, but not too positive.
The average sentiment for a successful conversation was 0.
31 vs 0.
20 for a non-successful conversation.
Having said that, being too positive is almost as bad as being too negative.
The final alley I explored was what effect various details about the first message had on the success of the conversation.
Initial thoughts of things that could have an effect were:lengthwhether a name was usedsentimentpresence of emojisexplicit contentAs expected, the longer the first message, the greater the likelihood that that conversation will continue to a ‘Success’.
As an extension, you double your probability of success by not just using a one word opener e.
not just saying ‘hey’ or ‘hi’ or ‘daayyuumm’ (real example).
Somewhat more surprisingly, using a name in the first message had very little effect on the ‘Success Ratio’ (No.
First message sentiment turned out to be about 0.
09 higher for “Successful” conversations than “Unsuccessful” conversations, which wasn’t really a surprise… if you insult someone in a first message, they’re intuitively less likely to reply.
Analysing Emojis was a task I hadn’t really thought about, and had the potential of being tricky.
Luckily, a package called ‘emoji’ exists, which automatically picks up the presence of emojis within text.
Unfortunately, and much to my dismay, it appears using an emoji in a first message increases one’s probability of obtaining a ‘Success’.
Now onto explicit content… Another one that had the potential of being quite tricky, as there are no built in libraries that pick up use of expletives etc.
(that I know of).
Luckily I stumbled upon this:I can assure you, there are some absolute crackers contained within it.
I then checked to see which first messages contained a word from this list, 40 of which did.
As is always the case with things like this, i found some interesting edge-cases:FYI this was a bloke talking about his rowing leggings…Results?.It turns out that none of the first messages that contained explicit content lead to a ‘Success’This lead me to my fifth and final recommendation.
The Fifth Recommendation:When sending a first message:Be positive8 words is optimalUse an emoji or twoDon’t be explicitSO TO SUM UPUse Tinder at 9pm on a Sunday for maximum audienceSpend time constructing messages and don’t use text speakPrepare to ask for a number or a date between the 20th and 30th messageBe positive, but not too positiveSend something other than ‘hey’ as a first message, aim for around 8 words, maybe use an emoji and don’t be explicitA few pitfalls of the data:My dataset is a very, very small sample, rendering most insights uselessThe dataset is biased towards the type of people I know, as well as being biased towards menThe dataset only contains one side of the conversationThe message and usage stats don’t necessarily line up due to users uninstalling and reinstalling the appNo NLP technique will be perfect due to sarcasm/variations in the way people speakA few ideas for future work:Gather more dataDo more to determine statistically significant results vs observationsLook into conversation analysis by topic — what type of messages make up the good and bad sentimentTry to look into sarcasmInvestigate other apps (Bumble, Hinge etc.
)Some sort of classification analysis if more data was included, as we only had 70ish successesLook more into gender splits if more data was includedA few interesting factoids from the data:Most swipes by a single person in a single day: 8096Guys are more likely to leave a long time (7ish days) before sending a second messageAsking a question in a first message actually decreases your chance of a successWomen swipe right on average 1% of the time, whereas men do ~50% of the timePer app open, women swipe 3x as many times as menFurther reading:A paper was published called ‘A First Look at User Activity on Tinder’, link hereThere is a Tinder API, but unfortunately it is only for people using the app rather than giving access to a database of some kind.
Anyhow, using it to test certain hypotheses could be interesting.
Tinderbox is a piece of software that can learn who you’re attracted to via dimensionality reduction.
It also has a chatbot built in if you really want to automate the process…Thanks for reading, any ideas for future work would be much appreciated!.. More details