Active Learning on MNIST — Saving on Labeling

The idea is to select the next batch for labeling exactly from the LESS CONFIDENT samples.#do the same in a loop for 400 samplesfor i in range(39): label_manually(10) fit() res =, feed_dict={x:x_unlabeled}) pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) x_unlabeled = x_unlabeled[pidx] y_unlabeled = y_unlabeled[pidx]Labels: 20 Accuracy: 0.4975Labels: 30 Accuracy: 0.535Labels: 40 Accuracy: 0.5475Labels: 50 Accuracy: 0.59Labels: 60 Accuracy: 0.64Labels: 70 Accuracy: 0.6475Labels: 80 Accuracy: 0.6925Labels: 90 Accuracy: 0.6975Labels: 100 Accuracy: 0.73Labels: 110 Accuracy: 0.745Labels: 120 Accuracy: 0.7625Labels: 130 Accuracy: 0.7725Labels: 140 Accuracy: 0.7725Labels: 150 Accuracy: 0.7725Labels: 160 Accuracy: 0.7875Labels: 170 Accuracy: 0.7875Labels: 180 Accuracy: 0.8175Labels: 190 Accuracy: 0.8225Labels: 200 Accuracy: 0.8225Labels: 210 Accuracy: 0.825Labels: 220 Accuracy: 0.8425Labels: 230 Accuracy: 0.8425Labels: 240 Accuracy: 0.845Labels: 250 Accuracy: 0.8525Labels: 260 Accuracy: 0.8525Labels: 270 Accuracy: 0.8525Labels: 280 Accuracy: 0.8525Labels: 290 Accuracy: 0.8525Labels: 300 Accuracy: 0.8525Labels: 310 Accuracy: 0.86Labels: 320 Accuracy: 0.86Labels: 330 Accuracy: 0.86Labels: 340 Accuracy: 0.86Labels: 350 Accuracy: 0.865Labels: 360 Accuracy: 0.8775Labels: 370 Accuracy: 0.8825Labels: 380 Accuracy: 0.8825Labels: 390 Accuracy: 0.885Labels: 400 Accuracy: 0.8975After running such a procedure for the 40 batches of 10 samples, we can see that the resulting accuracy is almost 90%..This is far more than the 83.75% achieved in the case with randomly labeled data.What to do with the rest of the unlabeled data#pass rest unlabeled data through the model and try to autolabelres =, feed_dict={x:x_unlabeled})y_autolabeled = res.argmax(axis=1)x_labeled = np.concatenate([x_labeled, x_unlabeled])y_labeled = np.concatenate([y_labeled, y_autolabeled])#train on 400 labeled by active learning and 3600 stochasticly autolabeled datafit()Labels: 4000 Accuracy: 0.8975The classical way would be to run the rest of the dataset through the existing model and automatically label the data..Then, pushing it in the training process would maybe help to better tune the model..In our case though, it did not give us an any better result.My approach is to do the same but, as in the active learning, taking in consideration the confidence:#pass rest of unlabeled (3600) data trough the model for automatic labeling and show most confident samplesres =, feed_dict={x:x_unlabeled})y_autolabeled = res.argmax(axis=1)pmax = np.amax(res, axis=1)pidx = np.argsort(pmax)#sort by confidencyx_unlabeled = x_unlabeled[pidx]y_autolabeled = y_autolabeled[pidx]plt.plot(pmax[pidx])[<matplotlib.lines.Line2D at 0x7f58cf918fd0>]#automatically label 10 most confident sample and train for itx_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]])y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]])x_unlabeled = x_unlabeled[:-10]fit()Labels: 410 Accuracy: 0.8975Here we run the rest of unlabeled data through model evaluation and we still can see that the confidence differs for the rest of the samples..Thus, the idea is to take a batch of ten MOST CONFIDENT samples and train the model.#run rest of unlabelled samples starting from most confidentfor i in range(359): res =, feed_dict={x:x_unlabeled}) y_autolabeled = res.argmax(axis=1) pmax = np.amax(res, axis=1) pidx = np.argsort(pmax) x_unlabeled = x_unlabeled[pidx] y_autolabeled = y_autolabeled[pidx] x_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]]) y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]]) x_unlabeled = x_unlabeled[:-10] fit()Labels: 420 Accuracy: 0.8975Labels: 430 Accuracy: 0.8975Labels: 440 Accuracy: 0.8975Labels: 450 Accuracy: 0.8975Labels: 460 Accuracy: 0.8975Labels: 470 Accuracy: 0.8975Labels: 480 Accuracy: 0.8975Labels: 490 Accuracy: 0.8975Labels: 500 Accuracy: 0.8975Labels: 510 Accuracy: 0.8975Labels: 520 Accuracy: 0.8975This process takes some time and gives us extra 0.8% of accuracy.ResultsExperiment Accuracy4000 samples 92.25%400 random samples 83.75%400 active learned samples 89.75% + auto-labeling 90.50%ConclusionOf course, this approach has its drawbacks, like the heavy use of computation resources and the fact that a special procedure is required for data labeling mixed with early model evaluation..Also, for the testing purposes data needs to be labeled as well..However, if the cost of a label is high (especially for NLP, CV projects), this method can save a significant amount of resources and drive better project results.Author:Andy Bosyi, CEO/Lead Data Scientist MindCraftInformation Technology & Data Science. More details

Leave a Reply