Github Autocompletion with Machine Learning

The answer is yes, especially, if we have some historical data from a GitHub repository.

Problem statementThe question we try to address in this article is: can we create an ML model to suggest the squad and owner of a GitHub work item based upon its title and other characteristics?ToolsThroughout this article, we will use the R programming language.

The following R packages are required:suppressWarnings({ library(tm) library(zoo) library(SnowballC) library(wordcloud) library(plotly) library(rword2vec) library(text2vec) library("reshape") library(nnet) library(randomForest)})The datasetGitHub provides different work item characteristics such as the id, title, type, severity, squad, author, state, date, etc.

The title will be our main data source since it is always required and probably has the highest relevance; it’s not hard to imagine that, for example, if the work item title is "Installer fails when trying to deploy Docker instance", it should probably be assigned to the installer squad.

Or, a title such as "Documentation is missing for feature XYZ", suggests that the work item is likely to be assigned to the documentation squad.

Below’s a sample of the GitHub dataset.

# Load the dataset from a CSV fileworkItems <- read.



csv')# Show the datasetshow(workItems)Note that both the squad and assignee (i.


, owner), which are the ground truths, are given in the historical data.

This means, we can approach this as a classification problem.

Now, since the work item title is given as free text, some of the Natural Language Processing techniques could be used to derive some features.

Natural Language Processing (NLP) basicsLet us introduce some NLP terminology:Our dataset (a collection of work item titles) will be called the corpus.

Each work item title is a document.

The set of all distinct words in the corpus is the dictionary.

A very simple way to extract features from free text is to compute term frequency (TF), i.


, count how many times each word of the dictionary appears in each of the documents.

The higher the occurrence, the more relevance such word will have.

This results into a document-term matrix (DTM), which has one row per document and as many columns as words in the dictionary.

Position (i, j) of this matrix represents how many times the word j appears in title i.

You can immediately see that the resulting feature set will be very sparse (i.


, having lots of zero values) as you may have thousands of words in the dictionary but each document (i.


, title) will only contain a few dozens of them.

A common issue with TF is that words such as “the”, “a”, “in”, etc.

, tend to appear very frequently yet they may not be relevant.

This is why TF-IDF rather normalizes the frequency of a word in the document by dividing it by a function of its frequency in the entire corpus.

In this way, the most relevant words will be the ones that appear in the document but are not common in the entire corpus.

Data curationNow, before applying any of the NLP techniques, some text curation is needed.

This includes removing stop words (e.


, prepositions, articles, etc.

), case, punctuation, and stemming the document, which refers to reducing inflected/derived words to their base or root form.

The code below performs the required text preprocessing:preprocess <- function(text) { corpus <- VCorpus(VectorSource(tolower(text))) corpus <- tm_map(corpus, PlainTextDocument) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords('english')) corpus <- tm_map(corpus, stemDocument) data.

frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)}curatedText <- preprocess(workItems$TITLE)Table 2: Results of text curation after removing stop words, punctuation, and case, as well as stemming the documents.

Feature extractionThe following code will create features by applying TF-IDF to our curated text.

The resulting DTM will have one column per word in the dictionary.

# Create a tokenizerit <- itoken(curatedText$text, progressbar = FALSE)# Create a vectorizerv <- create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 0.

1, term_count_min = 5)vectorizer <- vocab_vectorizer(v)# Create a document term matrix (DTM)dtmCorpus <- create_dtm(it, vectorizer)tfidf <- TfIdf$new()dtm_tfidf <- fit_transform(dtmCorpus, tfidf)featuresTFIDF <- as.



matrix(dtm_tfidf))# Add prefix to column names since there could be names starting # with numberscolnames(featuresTFIDF) <- paste0("word_", colnames(featuresTFIDF))# Append the squad and type to the feature set for classificationfeatureSet <- cbind(featuresTFIDF, "SQUAD"=workItems$SQUAD, "TYPE"=workItemsCurated$TYPE)Now we have a feature set where each row is a work item and each column is its TF-IDF score .

We also have the type of work item (i.


, either a task or a defect) and the ground truth (i.


, the squad).

Text classificationNext, we will create splits for training and testing sets:random <- runif(nrow(featureSet))train <- featureSet[random > 0.

2, ]trainRaw <- workItemsFiltered[random > 0.

2, ]test <- featureSet[random < 0.

2, ]testRaw <- workItemsFiltered[random < 0.

2, ]Random ForestsR offers the randomForest package, which allow to train a Random Forest classifier as follows:# Train a Random Forest model> model <- randomForest(SQUAD ~ .

, train, ntree = 500)# Compute predictions> predictions <- predict(model, test)# Compute overall accuracy> sum(predictions == test$SQUAD) / length(predictions)[1] 0.

59375Note accuracy is below 60%, which is, for most purposes, pretty bad.

However, predicting the exact squad a work item should be assigned to, based upon its title only, is a very challenging task, even for humans.

Therefore, let’s rather provide the user with two or three suggestions of the most likely squads for a given work item.

To this end, let us use the probabilities of each individual class, which are provided by the randomForest.

All we need to do is rank these probabilities and pick the classes with the highest values.

The following code does exactly so:# A function for ranking numbersranks <- function(d) { data.

frame(t(apply(-d, 1, rank, ties.

method='min'))) }# Score the Random Forest model and return probabilitiesrfProbs <- predict(model, test, type="prob")# Compute probability ranksprobRanks <- ranks(rfProbs)cbind("Title" = testRaw$TITLE, probRanks, "SQUAD" = testRaw$SQUAD, "PRED" = predictions)rfSquadsPreds <- as.


frame(t(apply(probRanks, MARGIN=1, FUN=function(x) names(head(sort(x, decreasing=F), 3)))))# Compute accuracy of any of the two recommendations to be correct> sum(rfSquadsPreds$V1 == rfSquadsPreds$SQUAD | rfSquadsPreds$V2 == rfSquadsPreds$SQUAD) / nrow(rfSquadsPreds)[1] 0.

76# Compute accuracy of any of the three recommendations to be correct> sum(rfSquadsPreds$V1 == rfSquadsPreds$SQUAD | rfSquadsPreds$V2 == rfSquadsPreds$SQUAD | rfSquadsPreds$V3 == rfSquadsPreds$SQUAD) / nrow(rfSquadsPreds)[1] 0.

87Note that having two suggestions, the probability of any of them to be correct is 76%, while with three this probability becomes 87%, which makes the model much more useful.

Other algorithmsWe also explored Logistic Regression, XGBoost, Glove, and RNNs/LSTMs.

However, the results were not significantly better than for Random Forests.

Feature importanceFeature importance (Given by XGBoost)DeploymentTo put this model in production, we first need to export (1) the model itself and (2) the TF-IDF transformations.

The former will be used for scoring whereas the latter is to extract the same features (i.


, words) that were used for training.

Exporting the assets# Save TF-IDF transformationssaveRDS(vectorizer, ".




rds")# Save DTMsaveRDS(model, "squad_prediction_rf.

rds")Docker and plumbrDocker can be a very useful tool to turn our assets into a containerized application.

This will help us to ship, build, and run the application anywhere.

As for most of the software services, an API end point serves the best way to consume a predictive model.

We explored options like OpenCPU and plumbr.

Plumber seemed simpler yet quite powerful to read CSV files and run analytics smoothly, hence it was our choice.

Plumber’s code style (i.


, using decorators) was also more intuitive, which allowed for an easier time managing endpoint URLs, HTTP headers, and response payloads.

Sample docker file is below:FROM trestletech/plumber# Install required packagesRUN apt-get install -y libxml2-dev# Install the randomForest packageRUN R -e ‘install.

packages(c(“tm”,”text2vec”, ”plotly”,”randomForest”,”SnowballC”))’# Copy model and scoring scriptRUN mkdir /modelWORKDIR /model# plumb and run serverEXPOSE 8000ENTRYPOINT [“R”, “-e”, “pr <- plumber::plumb(‘/model/squad_prediction_score.

R’); pr$run(host=’0.



0', port=8000)”]A snippet of the scoring file squad_prediction_score.

R is below:x <- c(“tm”,”text2vec”,”plotly”,”randomForest”,”SnowballC”)lapply(x, require, character.

only = TRUE)# Load tf-idfvectorizer = readRDS(“/model/vectorizer.

rds”)dtmCorpus_training_data = readRDS(“/model/dtmCorpus_training_data.

rds”)tfidf = TfIdf$new()tfidf$fit_transform(dtmCorpus_training_data)# Load the modelsquad_prediction_rf <- readRDS(“/model/squad_prediction_rf.

rds”)#* @param df data frame of variables#* @serializer unboxedJSON#* @post /scorescore <- function(req, df) { curatedText <- preprocess(df$TITLE) df$CURATED_TITLE <- curatedText$text featureSet <- feature_extraction(df) rfProbs <- predict(squad_prediction_rf, featureSet,type=”prob”) probRanks <- ranks(rfProbs) rfSquadsPreds <- as.


frame(t(apply(probRanks, MARGIN=1, FUN=function(x) names(head(sort(x, decreasing=F), 3))))) result <- list(“1” = rfSquadsPreds$V1, “2” = rfSquadsPreds$V2, “3” = rfSquadsPreds$V3) result}#* @param df data frame of variables#* @post /traintrain <- function(req, df) { .

}preprocess <- function(text) { .

}feature_extraction <- function(df) { .

}Now, to run the model against your own repository, you just need to build your own docker image and hit the end points:docker build -t squad_pred_image .

docker run — rm -p 8000:8000 squad_pred_imageOnce the docker image is ready, a sample API call would look like this:curl -X POST http://localhost:8000/score -H ‘Content-Type: application/json’ -H ‘cache-control: no-cache’ -d ‘{ “df”: [{ “ID”: “4808”, “TITLE”: “Data virtualization keeps running out of memory”, “TYPE”: “type: Defect” }] }’A sample API call output is below:{ “1”: “squad.

core”, “2”: “squad.

performance”, “3”: “squad.

dv”}Try it on your ownWould you like to help your development organization to be more productive with GitHub?.Give our code a try with your own dataset.

Let us know your results.

About the authorsÓscar D.

Lara Yejas is Senior Data Scientist and one of the founding members of the IBM Machine Learning Hub.

He works closely with some of the largest enterprises in the world on applying ML to their specific use-cases, including healthcare, financial, manufacturing, government, and retail.

He has also contributed to the IBM Big Data portfolio, particularly in the Large-scale Machine Learning area, being an Apache Spark and Apache SystemML contributor.

Óscar holds a Ph.


in Computer Science and Engineering from University of South Florida.

He is the author of the book “Human Activity Recognition: Using Wearable Sensors and Smartphones”, and a number of research/technical papers on Big Data, Machine Learning, Human-centric sensing, and Combinatorial Optimization.

Ankit Jha is a Data Scientist working on IBM Cloud Private For Data platform.

He is also part of the platform’s serviceability team and works log collection and analysis using ML techniques.

Ankit is a seasoned software professional who also holds Masters in Analytics from University Of Cincinnati.


. More details

Leave a Reply