At this point it was time to get to work.
DatasetI was looking specifically for r/MachineLearning submissions, so only option in my mind was to scrape titles directly from this subreddit.
There are two great APIs to do this:PRAWPushshiftI’ve decided to go with latter, since it enabled me to scrape submission titles from any time period.
All you need to do is forge url with required search parameters.
In my case, I had to specify time period and subreddit.
Also, to get both high and low scoring submissions, I’ve sorted results by score in descending order and asked for batch of 500 posts from each period.
Pushshift API is limited to 200 requests per minute, but I went with one request per second, just to be on the safe side.
This way, I’ve scraped post titles and scores from last three years.
PreprocessingThis is always the most time consuming part in any ml/dl project, but it can make or break your model.
My aim was to build binary classifier-recommendation system which would suggest submissions worth reading, therefore I’ve decided to split dataset into two categories — submissions with 10 or more votes (labeled as 1) and rest (labeled as 0).
Those binary labels would become my target variable.
However, such setup made my dataset unbalanced with 83% of observations in class 0 and 17% in class 1.
Number of observations in each classIn order to fix this, I’ve used simple majority downsampling to match minority class.
All observations from minority class were retained and majority class got resampled to get final ratio of 50:50Now for the fun part.
Each submission title in dataset looked like this:[D] What is the best ML paper you read in 2018 and why?Notice the tag at the beginning of the sentence.
Well, as regular readers of r/MachineLearning know, this submission applies to rule number 3 of this subreddit, which states:Posts without appropriate tag will be removed.
and current available tags are:“[Discussion]” “[D]”“[News]”, “[N]”“[Research]”, “[R]”“[Project]”, “[P]”My initial thought was just to tokenize words in each title and then use Word2Vec embedding as first layer of network, but seeing this, I was wondering if additional feature could improve accuracy (spoiler alert: yes) at relatively low cost (remember, we have to run this on Pi!).
In order to encode tags as features, I had to use regular expressions to get text within square brackets and then reformat it to check if tag is correct.
All incorrect or missing tags were encoded as ‘[X]’.
Last step was to remove all tags from original titles.
As I mentioned, my plan was to use Word2Vec embeddings, so final step of preprocessing was to tokenize words in each title.
NetworkWith data prepared, I could finally get to the neat parts of this project — training network and running it on Pi.
Well, first things first — neural net.
As I mentioned before, I wanted to use Word2Vec embeddings and decided to utilize weights from pretrained model, like this one from Google.
With vector representations of 3 million words it was just a matter of building an embedding layer — a lookup table which would transform words tokenized in preprocessing stage into vector representations with weights transferred from Google’s model.
This layer would be then followed by GlobalAveragePooling layer, fully connected layer and dropout.
But this is just half the story.
I still had feature vector with encoded tags.
Those would be fed into network by separate input, followed by fully connected layer with LeakyReLU activation and dropout.
Those two tensors would be concatenated and fed into final layer.
I’ve trained this setup with early stopping for 8 epochs, which resulted in ~80% accuracy on test set.
Not bad for model running only on submission titles and encoded tags.
At this point my plan was to convert this model into .
tflite flat buffer file using TensorFlow Lite Converter, but eventually I’ve ran into problems caused by unsupported operation.
Training and validation lossRegardless of the conversion failure, I’ve decided to use .
h5 model and see if it works on Pi and get back to flat buffer model once I’m going to try to improve accuracy.
Running this on PiI was starting with Raspberry Pi Zero v1.
3 with fresh Raspbian install and needed to get necessary libraries.
Keras, pandas, praw, gensim and h5py were not causing any issues, I could get them with simple pip3 install, but TensorFlow had to be compiled and there is just no memory available on Pi for this to handle.
Fortunately, there are pre-built binaries available, and all you have to do is:sudo apt install libatlas-base-devpip3 install tensorflowI’ve prepared my own code as simple package (btw it’s available on my GitHub), which I could then copy to site-packages and use standard import statement.
Last thing to do was to create main.
py, which every now and then scans new page in search of fresh submissions, runs them through model and outputs recommendations.
To avoid any delays while scraping for new submissions, I’ve decided to switch API to PRAW.
In order to use it, you must register your application on Reddit and get your API key.
I’ve used this doc as guide.
And here is the result !Were those recommendations correct?Yes!ConclusionsI was really glad to see that Pi could handle inference, although limited to two submissions at once.
This encourages me to search for better solutions and new ways to optimize my models (tflite conversion is definitely still in my plans).
All in all working with limited resources was very valuable learning experience for me.
I hope you enjoyed reading this short write-up as much as I enjoyed writing it.