R Pokemon Legendary?

Mewtoo [Image [0] Credit: http://pavbca.

com]R Pokemon Legendary?Akshaj VermaBlockedUnblockFollowFollowingFeb 9Here we will train machine learning models for classification of Pokemon as legendary.

We will use machine learning models such as SVM and XGBoost as well as a deep neural network for binary classification.

We’ll implement the machine learning models using libraries such as Caret and Keras in R.

Let’s dig in!Import the libraries.


table)library(Matrix)library(keras)Correct the spelling of the classification column.

All non-numerical columns as imported as factors instead of characters by default.

df = read.


csv")df = tbl_df(df)colnames(df)[25] <- "classification"head(df)First 5 rows of the df [Image [1]]Pre-processingSelect the required columns.

Select a subset of df as variable classify_legendary and convert the capture_rate into numeric type.

classify_legendary = select(df, is_legendary, hp, weight_kg, height_m, speed, attack, defense, sp_attack, sp_defense, type1, type2, generation, capture_rate, experience_growth, percentage_male, base_happiness, base_egg_steps)classify_legendary$is_legendary <- as.

factor(classify_legendary$is_legendary)classify_legendary$generation <- as.

factor(classify_legendary$generation)classify_legendary$capture_rate <- as.

numeric(classify_legendary$capture_rate)head(classify_legendary)classify_legendary dataframe [Image [2]]View the number of NA values in the dataframe.


na(classify_legendary))Number of NA values in the df [Image [3]]We observe that the NA values are only present in the height_m, weight_kg, and percentage_male column.

We will replace the NA values in the height and weight columns by 0.

It makes sense because the NA values exists for Pokemons that do not have a discernible height and weight like Gaseous Pokemon.

For the percentage_male column, we’ll predict the value based on the other attributes using kNN.


na(classify_legendary$weight_kg)] <- 0classify_legendary$height_m[is.

na(classify_legendary$height_m)] <- 0colSums(is.

na(classify_legendary))NA values handled [Image [4]]For the percentage_male column, there are 98 missing values.

There are multiple ways to handle missing values.


Delete the rows with NA valuesWe usually do this if we a lot of data and all the columns are sufficiently represented.

This is not the case here.

So, we won’t do this.


Delete the Variable (Column)percentage_male might turn out to be an important parameter during prediction.

Also, there are only 98 missing values.

We might as well drop the rows with missing values instead of the entire column.


Impute missing values with the mean/median/mode value of the columnThis is a crude way of handling missing values.

This would work if the variation in the data is very low.


Predict the missing valuesThis is an advanced method of handling the missing values.

This will give the best results.

We will predict the missing values using the K Nearest Neighbours (kNN) algorithm.

What kNN imputation does in simpler terms is as follows: For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ observations.

pre_process_missing_data <-preProcess(classify_legendary, method=c("knnImpute"))# You can also use the "bagImpute" algorithm.

pre_process_missing_dataImputed df [Image [5]]Let’s now use this model to predict the missing values in classify_legendary data frame.

classify_legendary <- predict(pre_process_missing_data, newdata = classify_legendary)# Check for any NA values in the data frameanyNA(classify_legendary)NAs in Classify_legendary [Image [6]]head(classify_legendary)First 6 rows of classify_legendary df [Image [7]]Next up, we’ll one hot encode the categorical attributes.

We are saving the column that is to be predicted in a variable y.

After one hot encoding, we’ll put this variable back in the data frame.

y <- classify_legendary$is_legendary# One hot encode the columnsdummies_model <- dummyVars(is_legendary~.

, data=classify_legendary)data_mat <- predict(dummies_model, newdata = classify_legendary)classify_legendary_ohe <- data.

frame(data_mat)classify_legendary_oheOne Hot Encoded Vars [Image [8]]At this point, we have our categorical data one-hot-encoded and missing data filled.

We’ll normalize our dataset so that the values are between 0 and 1.

pre_process_normalize <- preProcess(classify_legendary, method="range")pre_process_normalize_ohe <- preProcess(classify_legendary_ohe, method="range")classify_legendary <- predict(pre_process_normalize, newdata = classify_legendary)classify_legendary_ohe <- predict(pre_process_normalize_ohe, newdata = classify_legendary_ohe)classify_legendary$is_legendary <- y # add the is_legendary column back into the dfclassify_legendary_ohe$is_legendary <- y # add the is_legendary column back into the dfhead(classify_legendary)Non One Hot Encoded df [Image [9]]str(classify_legendary)Structure of the dataset [Image [10]]Train/Test SplitDivide the dataset(Both normal and one hot encoded) into train and test.

80% of the data is in train-set while the other 20% of the data is in test-set.

You should never standardize/normalize the data and then split it into train/test set.

Always split the data first and then preprocess the train and test set separately, else it can lead to data leakage.

For the purposes of this blogpost however, we’ll do the former to keep things light and breezy.

train_row_numbers <- createDataPartition(classify_legendary$is_legendary, p=0.

8, list=FALSE)train_classify_legendary <- classify_legendary[train_row_numbers, ]test_classify_legendary <- classify_legendary[-train_row_numbers, ]train_row_numbers_ohe <- createDataPartition(classify_legendary_ohe$is_legendary, p=0.

8, list=FALSE)train_classify_legendary_ohe <- classify_legendary_ohe[train_row_numbers, ]test_classify_legendary_ohe <- classify_legendary_ohe[-train_row_numbers, ]Machine Learning ModelsOnto the good stuff now.

Naive BayesFor Naive Bayes, the independent variables should not be highly correlated.

We’ll use the naive bayes model as our baseline.


panels(classify_legendary[-1])Correlation plot [Image[11]]Trainnb_model <- naiveBayes(is_legendary ~.

, data = train_classify_legendary_ohe)predict_train_nb <- predict(nb_model,train_classify_legendary_ohe)Training Confusion matrixconfmat_train_nb <- table(predict_train_nb, train_classify_legendary_ohe$is_legendary)confmat_train_nbTraining Confusion Matrix — Naive Bayes [Image [12]]Training Accuracy(confmat_train_nb[1, 1] + confmat_train_nb[2, 2])/ sum(confmat_train_nb) * 100Train Accuracy — Naive Bayes [Image [13]]Testpredict_test_nb <- predict(nb_model, test_classify_legendary_ohe)Test Confusion Matrixconfmat_test_nb <- table(predict_test_nb, test_classify_legendary_ohe$is_legendary)confmat_test_nbTest Confusion Matrix — Naive Bayes [Image [14]]Test Accuracy(confmat_test_nb[1, 1] + confmat_test_nb[2, 2])/ sum(confmat_test_nb) * 100Test Accuracy — Naive Bayes [Image [15]]Support Vector MachineTrainmodel_svm <- svm(is_legendary~.

, data = train_classify_legendary_ohe)summary(model_svm)SVM model summary [Image [16]]predict_train_svm <- predict(model_svm, train_classify_legendary_ohe)Train Confusion Matrixconfmat_train_svm <- table(Predicted = predict_train_svm, Actual = train_classify_legendary_ohe$is_legendary)confmat_train_svmTrain Confusion Matrix — SVM [Image [17]]Train Accuracy(confmat_train_svm[1, 1] + confmat_train_svm[2, 2]) / sum(confmat_train_svm) * 100Train Accuracy — SVM [Image [18]]Testpredict_test_svm <- predict(model_svm, test_classify_legendary_ohe)Test Confusion Matrixconfmat_test_svm <- table(Predicted = predict_test_svm, Actual = test_classify_legendary_ohe$is_legendary)confmat_test_svmTest Confusion Matrix — SVM [Image [19]]Test Accuracy(confmat_test_svm[1, 1] + confmat_test_svm[2, 2]) / sum(confmat_test_svm) * 100Test Accuracy — SVM [Image [20]]Random ForestTrainmodel_rf <- randomForest(is_legendary~.

, data = train_classify_legendary_ohe)model_rfTrain — RF [Image 21]]predict_train_rf <- predict(model_rf, train_classify_legendary_ohe)confusionMatrix(predict_train_rf, train_classify_legendary_ohe$is_legendary)Train Confusion Matrix — RF [Image [22]]Testpredict_test_rf <- predict(model_rf, test_classify_legendary_ohe)confusionMatrix(predict_test_rf, test_classify_legendary_ohe$is_legendary)Test Confusion Matrix — RF [Image [23]]XGBoosttrainm <- sparse.



-1, data = train_classify_legendary)testm <- sparse.



-1, data = test_classify_legendary)train_label <- as.

matrix(train_classify_legendary[,"is_legendary"])test_label <- as.

matrix(test_classify_legendary[,"is_legendary"])train_matrix <- xgb.

DMatrix(data = as.

matrix(trainm), label = train_label)test_matrix <- xgb.

DMatrix(data = as.

matrix(testm), label = test_label)nc <- length(unique(train_label))Trainmodel_xgb <- xgboost(data = train_matrix, # the data nround = 26, # max number of boosting iterations objective = "binary:logistic") # the objective functionTrain — xgboost [Image [24]]Testpred_xgb <- predict(model_xgb, test_matrix)acc <- mean(as.

numeric(pred_xgb > 0.

5) == test_label)print(paste("test-accuracy=", acc))Test — xgboost [Image [25]]Neural NetworkNetwork Architecturemodel_nn <- keras_model_sequential()model_nn %>% layer_dense(units = 10, activation = "relu", input_shape = c(57)) %>% layer_dense(units = 20, activation = "relu") %>% layer_dense(units = 20, activation = "relu") %>% layer_dense(units = 20, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")Model Compilationmodel_nn %>% compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = c('accuracy') )Convert data frames to matrix form.

X_train <- train_classify_legendary_ohe[, 1:57]y_train <- train_classify_legendary_ohe[, 58]X_test <- test_classify_legendary_ohe[, 1:57]y_test <- test_classify_legendary_ohe[, 58]X_train = as.

matrix(X_train)X_test = as.

matrix(X_test)y_train = as.

matrix(y_train)y_test = as.

matrix(y_test)Trainhistory <- model_nn %>% fit( as.

matrix(X_train), as.

matrix(y_train), epochs = 20, batch_size = 4, validation_data = list(as.

matrix(X_test), as.

matrix(y_test)))plot(history)Train/Test loss — NN [Image [26]]cat('Test accuracy:', score$acc, ".")Test Accuracy — NN [Image [27]]Thank you for reading! Constructive feedback is welcome.


. More details

Leave a Reply