ktrain: A Lightweight Wrapper for Keras to Help Train Neural Networks

ktrain: A Lightweight Wrapper for Keras to Help Train Neural NetworksArun MaiyaBlockedUnblockFollowFollowingJun 3ktrain is a library to help build, train, debug, and deploy neural networks in the deep learning software framework, Keras.

Inspired by the fastai library, with only a few lines of code, ktrain allows you to easily:estimate an optimal learning rate for your model given your data using a learning rate finderemploy learning rate schedules such as the triangular learning rate policy, 1cycle policy, and SGDR to more effectively train your modelemploy fast and easy-to-use pre-canned models for both text classification (e.


, NBSVM, fastText, GRU with pretrained word embeddings) and image classification (e.


, ResNet, Wide Residual Networks, Inception)load and preprocess text and image data from a variety of formatsinspect data points that were misclassified to help improve your modelleverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw dataktrain is open-source and available on GitHub here.

It requires Python 3 and can be installed with pip as follows: pip3 install ktrainWe will demonstrate a few use cases for ktrain by example.

Wrapping Your Model and Data in a Learner Objectktrain is designed to work seamlessly with Keras.

Here, we load data and define a model just as you would normally do in Keras.

The following code was copied directly from the Keras fastText text classification example.

It loads the IMDb movie review dataset and defines a simple text classification model to infer the sentiment of a move review.

# load and prepare data as you normally would in Kerasfrom keras.

preprocessing import sequencefrom keras.

datasets import imdbNUM_WORDS = 20000MAXLEN = 400def load_data(): (x_train, y_train), (x_test, y_test) = imdb.

load_data(num_words=NUM_WORDS) x_train = sequence.

pad_sequences(x_train, maxlen=MAXLEN) x_test = sequence.

pad_sequences(x_test, maxlen=MAXLEN) return (x_train, y_train), (x_test, y_test)(x_train, y_train), (x_test, y_test) = load_data()# build a fastText-like model as you normally would in Kerasfrom keras.

models import Sequentialfrom keras.

layers import Dense, Embedding, GlobalAveragePooling1Ddef get_model(): model = Sequential() model.

add(Embedding(NUM_WORDS, 50, input_length=MAXLEN)) model.

add(GlobalAveragePooling1D()) model.

add(Dense(1, activation='sigmoid')) model.

compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return modelmodel = get_model()To use ktrain, we simply wrap our model and data in a ktrain.

Learner object using the get_learner function:import ktrainlearner = ktrain.

get_learner(model, train_data=(x_train, y_train), val_data = (x_test, y_test))The default batch size is 32, but this can be changed by supplying a batch_size argument to get_learner.

The Learner object facilitates training your neural network in various ways.

For instance, invoking the fit method of the Learner object allows you to train interactively at different learning rates:# train for three epochs at 0.


fit(5e-3, 3)# train for additional three epochs at 0.


fit(5e-4, 3)The underlying Keras model wrapped by the Learner object is always directly accessible as follows: learner.

modelNext, we show that the Learner object can also be used to find a good initial learning rate and easily employ a variety of different learning rate schedules that vary the learning rate automatically during training.

Tuning the Learning RateThe learning rate is one of the most important hyperparameters to set in a neural network.

The default learning rates for various optimizers like Adam and SGD may not always be appropriate for a given problem.

Training in a neural network involves minimizing a loss function.

If the learning rate is too low, training will be slow or can stall.

If the learning rate is too high, loss will not be minimized.

Both cases adversely affect a model’s performance.

To find an optimal learning rate for your model, one can simulate the training by starting with a low learning rate and gradually increasing it.

Leslie Smith showed that, when plotting the learning rate versus the loss, the maximal learning rate associated with a still falling loss is a good choice for training.

Following a similar syntax to that of the fastai library, this can be done in ktrain as follows:learner.


lr_plot()The code above will display the following plot for the model and data loaded above:We must select the maximal learning rate where the loss is still falling prior to divergence.

Based on the plot, a learning rate of 0.

005 appears to be a reasonable choice, as the loss begins to diverge at higher learning rates.

Learning Rate SchedulesA number of studies have shown that varying the learning rate during training in various ways can improve performance of your neural model in terms of both loss minimization and better validation accuracy.

For instance, the benefits of a 1cycle learning rate schedule with cyclical momentum were demonstrated in this experiment by Sylvain Gugger.

ktrain allows you to easily employ several different learning rate policies.

Here, we show some examples:Different ways to train a model in ktrain:# employs a static learning rate of 0.

005 for 3 epochslearner.


005, 3)# employs an SGDR schedule with a cycle length of one epoch.

# learning rate is varied between 0.

005 and near-zero value.



005, 3, cycle_len=1)# employs an SGDR schedule with a cycle length# that increases by a factor of 2 each cyclelearner.


005, 3, cycle_len=1, cycle_mult=2)# employs the 1cycle learning rate policylearner.


005, 3)# employs a triangular learning rate policy with automatic stoppinglearner.


005)# employs a triangular learning rate policy with both maximum# and base learning rates reduced when validation loss stallslearner.


005, 20, reduce_on_plateau=3)We cover each of these methods in more detail below and begin with the SGDR learning rate policy.

The SGDR Learning Rate PolicyStochastic Gradient Descent with Restarts (or SGDR) cycles the learning rate between an initial learning rate identified with the aforementioned learning rate finder and a near-zero learning rate.

The learning rate is decayed using cosine annealing.

The fitmethod allow you to easily employ the use of an SGDR learning rate policy in a similar syntax to that of the fastai library.

When the cycle_len argument is supplied, cosine annealing is used to decay the learning rate for the duration of the cycle.

Here, we show two cycles with a length of one epoch:SGDR: learner.


005, 2, cycle_len=1)The cycle_mult argument increases the length of the cycle by a specified factor.

Here, the length of the cycle doubles with each cycle (cycle_mult=2):SGDR: learner.


005, 3, cycle_len=1, cycle_mult=2)The 1cycle and Triangular Learning Rate PoliciesIn addition to fit, there is also the autofit method (which employs a triangular learning rate policy) and thefit_onecyclemethod (which employs a 1cycle policy).

Both were proposed by Leslie Smith of the Naval Research Laboratory (NRL).

The fit_onecyclemethod increases the learning rate from a base rate to a maximum rate for the first half of training and decays the learning rate to a near-zero value for the second half of training.

The maximum learning rate is set using the aforementioned learning rate finder.

1cycle policy: learner.


005, 3)In addition, if using either the Adam, Nadam, or Adamax optimizer with fit_onecycle, the momentum is cycled between 0.

95 and 0.

85 such that the momentum is high with low learning rates and the momentum is low with high learning rates.

Varying the momentum in this way was proposed in this paper and was shown to speed up convergence.

cyclical momentum in the 1cycle policyThe autofit method simply executes a 1cycle policy each epoch (which can be considered a variation of the triangular policy):Triangular Policy: learner.


005, 2)Executing one cycle per epoch like this is better suited for use with demonstrably effective built-in Keras training callbacks.

Such Keras callbacks can easily be enabled through method arguments to autofitsuch as early_stopping (EarlyStopping callback), reduce_on_plateau (ReduceLROnPlataeu), and checkpoint_folder (ModelCheckpoint).

For instance, when reduce_on_plateau is enabled, the peak and base learning rates are both reduced (or annealed) periodically if there is no improvement in validation loss, which can help improve performance:Triangular Policy with ReduceLROnPlateau: learner.


005, 8, reduce_on_plateau=2)If the number of epochs is not supplied to autofit, an EarlyStopping callback is automatically enabled and training will continue until the validation loss no longer improves.

There are also additional arguments to autofit to fine-tune the training process even further.

Type help(learner.

autofit) in a Jupyter notebook for more details.

Finally, although not shown here, the autofitmethod (like the 1cycle policy) cycles the momentum between 0.

95 and 0.


In the previous sections, we manually defined a model and loaded data outside of ktrain.

ktrain exposes a number of convenience functions to easily load data from a variety of sources and effortlessly employ the use of some very strong baseline models.

We will show an example for both image classification and text classification — each of which requires only a few lines of code.

Image Classification: Classifying Dogs and CatsA standard dataset used in introductions to image classification and deep learning is the Dogs vs.

Cats dataset.

We will use this dataset as an example of image classification in ktrain.

In the following code block, the images_from_folder function is used to load the training and validation images as Keras Directory Iterator objects with data augmentation for training images.

The image_classifier function is then used to build a ResNet50 model pretrained on ImageNet.

We select 7e-5 as the learning rate after visual inspection of the plot generated by lr_plot.

Since we have not specified the number of epochs when invoking autofit in this example, training will automatically stop when the validation loss fails to improve.

By default, EarlyStopping patience is 5 and ReduceLROnPlateau patience is only 2.

These can be changed using the early_stopping and reduce_on_plateau argumetns to autofit.

This code block typically achieves an accuracy of between 99.

35% and 99.

55% , as shown in this notebook.

# import ktrain modulesimport ktrainfrom ktrain import vision as vis# get default data augmentation with # horizontal_flipping as only modificationdata_aug = vis.

get_data_aug(horizontal_flip=True)# load the data as Keras DirectoryIterator generators(trn, val, preproc) = vis.

images_from_folder( datadir='data/dogscats', data_aug=data_aug, train_test_names=['train', 'valid'], target_size=(224,224), color_mode='rgb')# build a pre-trained ResNet50 model and freeze first 15 layersmodel = vis.

image_classifier('pretrained_resnet50', trn, val, freeze_layers=15)# wrap model and data in a Learner objectlearner = ktrain.

get_learner(model=model, train_data=trn, val_data=val, workers=8, use_multiprocessing=False, batch_size=64)learner.

lr_find() # simulate training to find good learning ratelearner.

lr_plot() # visually identify best learning rate# train with triangular learning rate policy# ReduceLROnPlateau and EarlyStopping automatically enabled.

# ModelCheckpoint callback explicitly enabled.


autofit(7e-5, checkpoint_folder='/tmp')By invoking learner.

view_top_losses(preproc, n=3) after training, we can view the top n examples in the validation set that are the most severely misclassified.

This can shed light on how to improve your model or data-processing pipeline and whether to prune the dataset of “garbage” data.

For instance, in the Dogs vs.

Cats dataset, the following image is one of the most misclassified examples in the validation set:A misclassified example in the validation setAs can be seen, the image is labeled as “cat” despite featuring both a dog and a cat with the dog being featured more prominently.

This can be problematic, given that this dataset treats classes as mutually exclusive.

Datasets in which the classes are not mutually-exclusive are called multi-label classification problems and are discussed later in this article.

Predictions on New DataWith a trained model in hand, we can wrap our model and the preproc object returned by images_from_folder in a Predictor object to easily classify new raw images:The Predictor object automatically preprocesses raw data before making predictions.

The preproc object automatically preprocesses and appropriately transforms raw data in order to accurately make predictions.

The Predictor object can be saved to disk and re-loaded later as part of a deployed application:For detailed explanations and results, please see our tutorial notebook on image classification.

Text Classification: Identifying Toxic Online CommentsThe Toxic Comment Classification Challenge on Kaggle involves classifying Wikipedia comments into one or more categories of so-called toxic comments.

Categories of toxic online behavior include toxic, severe_toxic, obscene, threat, insult, and identity_hate.

Unlike the previous examples, this is a multi-label classification problem in that the classes are not mutually-exclusive.

For instance, a single comment can belong to multiple categories of toxic online behavior.

ktrain automatically detects multi-label classification problems from the data and configures built-in models appropriately.

The dataset can be downloaded from the competition site in the form of a CSV file (i.


, download the file train.


We will load the data using the texts_from_csv method, which assumes the label_columns fields are already one-hot-encoded in the spreadsheet (as is the case with the train.

csv from Kaggle).

Then, we will use the text_classifier method to load a fastText-like model.

Finally, we use the autofit method to train our model.

In this second example, we explicitly specify the number of epochs as 8.

A triangular learning rate policy is used, so 8 triangular-shaped cycles are executed.

import ktrainfrom ktrain import text as txtDATA_PATH = 'data/toxic-comments/train.

csv'NUM_WORDS = 50000MAXLEN = 150label_columns = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"](x_train, y_train), (x_test, y_test), preproc = txt.

texts_from_csv(DATA_PATH, 'comment_text', label_columns=label_columns, val_filepath=None, max_features=NUM_WORDS, maxlen=MAXLEN, ngram_range=1)# define model a fastText-like architecture using ktrainmodel = txt.

text_classifier('fasttext', (x_train, y_train))# wrap model and data in Learner objectlearner = ktrain.

get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))# find a good learning ratelearner.


lr_plot()# train using triangular learning rate policylearner.


0007, 8)The code block above achieves a ROC-AUC of roughly 0.

98 with only 6 minutes of training on a Titan V GPU.

As shown in this example notebook on our GitHub project, even better results can be obtained using a Bidirectional GRU with pretrained word vectors (called ‘bigru’ in ktrain).

As in the previous example, we can instantiate a Predictor object to easily make predictions on new raw data:More InformationFor more information and details on ktrain, please see the tutorial notebooks on GitHub:Tutorial Notebook 1: Introduction to ktrainTutorial Notebook 2: Tuning Learning RatesTutorial Notebook 3: Image ClassificationTutorial Notebook 4: Text ClassificationTutorial Notebook A1: Additional Tricks on miscellaneous topics such as inspecting misclassifications and the use of built-in callbacks in ktrainAdditional Examples.

. More details

Leave a Reply