" as the splitting character.
Lastly, we’ll convert the return value of split to a list using toList and show the number of lines.
You can go back to the ScalaFiddle widget and modify our processing code to be as follows.
You’ll see that we have roughly 11,000 different Reddit posts in this data.
Now we still have some more work to do in preparing our data because each Reddit submissions contains multiple attributes.
Let’s look at a few examples by modifying our processing code in ScalaFiddle to be:Note we’re using the list methods take and foreach to create a sublist with just the first five elements and to call println on each post, respectively.
Here’s the output.
MarioAI,nicolasrene,recreation of Mario facing left since i havent got the original picture not the real thing but he did face left once in 1-2.
,1u_seksualios,seksualios,Seksuali gundanti ištvirkėlė Alektra Blue juodais drabužiais,1Cuphead,[deleted],Just finished Cuphead in an hour and fifteen minutes.
,0videos,lonemonk,Trump On The Traps – Calvin Dick (2017),1CryptoCurrency,Pseudoname87,Why does binance show a different price than other sites?,1We can see that each post is a comma-separated string of values.
In order, the fields are:subreddit: Subreddit to which post was submittedauthor: The user account who posted the submissiontitle: The title of the postscore: The voting score of the submission at the time the data was pulledWe can again use String.
split to access the individual fields of each post, this time using "," as the splitting character.
We’ll want a way to organize together all of the different fields of a single post and for that, we can create a Scala class.
At a high level, classes just allow us to organize together related data elements like the different attributes of a Reddit post.
Here’s how we define a class for this data.
Here’s an example of how we can use the Post class.
We’ll cover classes in more detail in the future.
For now, it’s good enough to know that this class defines a new data type to represent Reddit posts, with each post having four attributes: subreddit, author, title, and score.
Let’s now modify our processing code to parse our raw data into a list of posts.
, create aList[Post].
Now we have a list of posts in var posts: List[Post].
So let’s start exploring this Reddit post data!First, let’s just count the number of posts for a single subreddit.
You can add the following code to our analysis function in the last ScalaFiddle widget.
Here we’re using the filter method ofList to create a new list that just contains the posts that are from the subreddit “AskReddit”.
In general, we can use filter with functions we write to select elements of a List and thereby create a new List that just contains those elements.
Feel free to modify the code to count the number of posts in our sample for any subreddit you’re interested in.
Building off that exercise, let’s compute the number of posts for everything single subreddit.
In this computation, we’ll build a build a data structure that associates the number of posts with each subreddit name.
This will require us to introduce a new Scala data structure called Map.
Map[K, V] is an association between keys of type K to values of type V.
In this exercise, we’ll be building a Map[String, Int] to associate subreddit name to count of posts.
Here are some examples of how we can work with such a map.
For people familiar with other programming languages, you may be surprised to see that Map.
updated returns another newMap.
In general, Scala encourages us to use immutable data structures and therefore we avoid modifying anything in place.
Instead, we create new data structures to represent the results of any change.
Scala does some really clever things behind the scenes to make such updates efficient in both processing time and memory usage.
Note that Map[K, V].
get(key) returns an Option[V].
Option is used to account for the fact that it’s possible we don’t have a value for the given key.
Option is a general data type that can take two forms, Some(value) and None.
Some(value) denotes that we do have a value for the key and we can access that value through Some.
Whereas None denotes that there was no value for the key.
Option has a very useful method orElse(alterantive).
When called on Some it returns the value contained in Someand ignores alterantive.
orElse(alternative) returns the value of alternative.
We’ll use this method in our code to replace None with zero when we fetch the count for a subreddit that we currently don’t yet have a count for.
Using Map, here is how we can compute the number of posts for each subreddit.
Note how we’re using foldLeft, a variant of fold, to process each post and increment the count for the corresponding subreddit in the map.
You may recall that fold is used to aggregate all values in a list down to a single value.
The folding function is called for each value in the list and in each call, the folding function also receives the current value of the aggregation.
Our folding function returns the updated value of the aggregate.
fold functions return the final value of the aggregation after processing every element in the list using our folding function.
At the end, we get a Map[String, Int] that contains the post count for each subreddit.
Spend some time reviewing this code to see if you can reason through how we’re computing the number of posts for each subreddit.
As an exercise, can you modify this example code to instead count the number of posts for each user?While the current results are nice, we’re more interested in knowing the results for the top subreddits with the most posts in this sample of Reddit posts.
To that end, we’ll need a way to order the subreddits by the number of posts so that we can select the top few subreddits to show.
In computer science terminology, such a process is referred to as sorting.
You can add the following code to the preceding example to sort the subreddits by post count and then show the top 10 in this sample of Reddit posts.
There are a few new things going on here.
First, we’re converting subredditCount from Map[String, Int] to List[(String, Int)] using the Map.
This introduces a new concept called tuples in that (String, Int) is the type for a length-two tuple where the first element is a string and the second element is an integer.
Tuples are a general data type in Scala that can be used to represent fixed length collection of elements whereby each position has a fixed type.
, (String, String, String, int) is a length-four tuple.
We could’ve used this tuple type instead of the class Post to represent the data in a single Reddit post.
In general, classes are a more legible way to group together related elements.
Tuples can be useful in some cases, particularly in cases where we want to write generic algorithms that use placeholder types.
This is the case in wanting a general method to convert a Map[K, V] in a list of associated pairs, List[(K, V)].
Next, we’re using the method sortBy to sort our list of tuples.
The method takes a function that computes a ranking score for each element of the list.
The elements of the list are sorted by score and a new list is returned by sortBy in which the elements are ordered.
You can see that our ranking function just fetches the count for each subreddit by accessing the second element of the tuple, Tuple.
The results for our sample of Reddit posts are as follows.
(AskReddit,254)(AutoNewspaper,214)(The_Donald,84)(CryptoCurrency,71)(SteamTradingCards,69)(RocketLeagueExchange,65)(newsbotbot,65)(videos,64)(GlobalOffensiveTrade,59)(PewdiepieSubmissions,58)In thinking about the numbers, we should remember that this is a small random sample of all Reddit posts so the counts are going to be much smaller than the full number of posts.
From a statistical perspective, the numbers are still useful because we know which subreddits are submitted to more frequently from this sample of data.
With my passing familiarity with Reddit, I’d say these results seem consistent with my intuition about popular subreddits like “AskReddit”.
Can you modify your earlier exercise code that computes the number of posts per author so that the results are sorted?.Who are the top authors in this sample of Reddit posts?.Are those results reasonable?Next, let’s see if our sample includes any posts with Scala in the title.
You can add the following snippet to the previous ScalaFiddle widget to answer this question.
What do you think about these posts?.Is every one of them about Scala or is there a deficiency in using this heuristic to identify relevant posts?What other words are interesting to you?.Modify the code as you’d like to look for other posts that have certain keywords.
In many ways, we’re building a simple, custom search engine to find posts relevant to our interests across this small sample.
Note, I myself have already discovered a non-trivial amount of obscene language.
As an exercise, you could write some Scala code to compute the frequency of curse words in the titles for each subreddit to identify the subreddits with the highest frequency of obscenity.
I’m not including example code for this because I don’t want to have a list of curse words on my blog.
In general, we’d be interested in computing the frequency of different words in post titles across each subreddit.
Here’s some moderately sophisticated code that accomplishes such an analysis.
Take some time and read through this example.
It’s a good review of many of the concepts we’ve considered so far in current and past articles.
Reminder, that you can click “Edit on ScalaFiddle” on the widget to open the example in a separate window that doesn’t have the horizontal compression of the widget to better read the code.
The example code includes some concepts we haven’t yet explored.
If you’d like, you could explore these concepts on your own — ahead of our shared journey through Scala — using resources at scala-lang.
In general, that website is a great place to learn about Scala concepts.
And, of course, a general Google search can also turn up some useful resources, including StackOverflow questions and answers.
One thing I’d like to explain at present is how this code example uses two cases of using pattern matching to deconstruct data structures.
In these cases, we’re accessing the elements of a tuple through pattern matching deconstruction.
Here’s an isolated example of deconstructing tuples in a function.
You can see we’ve defined the function in a non-standard fashion.
In general, we’d primarily use this pattern in anonymous functions, which is what we’ve done in the word frequency example.
In addition to analyzing the Scala code, what do you think of these results?.Do the word frequency results seem appropriate given your knowledge of Reddit?.Are there any surprising results in highly frequent words for certain subreddits?You can modify and extend these example however you’d like to compute anything that interests you about Reddit posts.
Here are some ideas for things you might want to compute for this sample of Reddit posts.
What posts have the highest score?Which Redditors have the highest average scores?Who are the most prolific-posting Redditors in each subreddit?What words are frequent in high scoring posts?.Versus what words are frequent in low scoring posts?Which words have a generally low frequency across all posts, but a high frequency in specific subreddits?.I.
, what words are uniquely characteristic of a given subreddit.We can quantify this by taking the ratio of subredditWordFrequency/generalWordFrequency for each subreddit/word pair and looking for high ratios.
(This one is a good challenge to further develop your Scala proficiency.
)I hope you’re enjoying applying Scala to analyze Reddit and learn a bit about Redditors.
In the near future, I’ll be showing you how to perform this analysis on the full set of Reddit posts for the month of October 2018.
There are 11,306,843 posts in this month so we’ll need to learn how to apply Scala using the “big data” technology Spark.
I’m gonna see if I can find a way to do that through the browser.
You’ll be surprised to see how the code we write in those examples is no more complicated than the code we’ve written in today’s exercises.
Processing “big data” can be just as easy as small data when we use powerful technologies like Spark and Scala.
And thank you for working through another series of Scala exercises with me.
I hope this one has been particularly fun because we’re getting to learn about real-world data.
I’ll do what I can to create more exercises with this structure.
I’d like to thank pushshift.
io for hosting Reddit data dumps.
This is a really interesting source of data and it was easy for me to take a small sample from the full dataset for October 2018 for these examples and exercises.
You can download this data yourself and write your own Scala code to start performing more sophisticated analyses.