Choosing between TensorFlow/Keras, BigQuery ML, and AutoML Natural Language for text classification

Choosing between TensorFlow/Keras, BigQuery ML, and AutoML Natural Language for text classificationComparing text classification done three ways on Google Cloud PlatformLak LakshmananBlockedUnblockFollowFollowingJan 4Google Cloud Platform offers you three¹ ways to carry out machine learning:Keras with a TensorFlow backend to build custom, deep learning models that are trained on Cloud ML EngineBigQuery ML to build linear models on structured data using just SQLAuto ML to train state-of-the-art deep learning models on your data without writing any codeChoose between them based on your skill set, how important additional accuracy is, and how much time/effort you are willing to devote to the problem.

Use BigQuery ML for quick experimentation and easy, low-cost machine learning.

Once you identify a viable ML problem using BQML, use Auto ML for code-free, state-of-the-art models.

Hand-roll your own custom models only for problems where you have lots of data and enough time/effort to devote.

Choosing the ML method that is right for you depends on how much time and effort you are willing to put in, what kind of accuracy you need, and what your skillset is.

In this article, I will compare the three approaches on a text classification problem so that you can see why I’m recommending what I am recommending.

1.

CNN + Embedding + Dropout in KerasI explain the problem and the deep learning solution in detail elsewhere, so this section will be very brief.

The task is that given the title of an article, I want to be able to identify where it was published.

The training dataset comes from articles posted on Hacker News (there’s a public dataset of these in BigQuery).

For example, here are some of the titles whose source is GitHub:Training datasetThe model code to create a Keras model that uses a word embedding layer, convolutional layers, and dropout:model = models.

Sequential()num_features = min(len(word_index) + 1, TOP_K)model.

add(Embedding(input_dim=num_features, output_dim=embedding_dim, input_length=MAX_SEQUENCE_LENGTH))model.

add(Dropout(rate=dropout_rate))model.

add(Conv1D(filters=filters, kernel_size=kernel_size, activation='relu', bias_initializer='random_uniform', padding='same'))model.

add(MaxPooling1D(pool_size=pool_size))model.

add(Conv1D(filters=filters * 2, kernel_size=kernel_size, activation='relu', bias_initializer='random_uniform', padding='same'))model.

add(GlobalAveragePooling1D())model.

add(Dropout(rate=dropout_rate))model.

add(Dense(len(CLASSES), activation='softmax'))# Compile model with learning parameters.

optimizer = tf.

keras.

optimizers.

Adam(lr=learning_rate)model.

compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['acc'])estimator = tf.

keras.

estimator.

model_to_estimator(keras_model=model, model_dir=model_dir, config=config)This is then trained on Cloud ML Engine as shown in this Jupyter notebook:gcloud ml-engine jobs submit training $JOBNAME –region=$REGION –module-name=trainer.

task –package-path=${PWD}/txtclsmodel/trainer –job-dir=$OUTDIR –scale-tier=BASIC_GPU –runtime-version=$TFVERSION — –output_dir=$OUTDIR –train_data_path=gs://${BUCKET}/txtcls/train.

tsv –eval_data_path=gs://${BUCKET}/txtcls/eval.

tsv –num_epochs=5It took me a couple of days to develop the original TensorFlow model, my colleague vijaykr a day to modify it to use Keras, and maybe a day to train it and troubleshoot it.

We got about 80% accuracy.

To do better, we’d probably need a lot more data (92k examples is insufficient to gain the benefits of using a custom deep learning model) and perhaps incorporate more preprocessing (such as removing stop words, stemming words, using a reusable embedding, etc.

).

2.

BigQuery ML for text classificationWhen using BigQuery ML, convolutional neural networks, embeddings, etc.

are (not yet anyway) an option, so I dropped down to using a linear model on a bag-of-words.

The point of BigQuery ML is to provide a quick, convenient way to build ML models on structured and semi-structured data.

Splitting the titles word-by-word and training a logistic regression model (i.

e.

, a linear classifier) on the first 5 words of the title (using more words doesn’t help all that much):#standardsqlCREATE OR REPLACE MODEL advdata.

txtclassOPTIONS(model_type='logistic_reg', input_label_cols=['source'])ASWITH extracted AS (SELECT source, REGEXP_REPLACE(LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.

-]', ' ')), " ", " ") AS title FROM (SELECT ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.

*://(.

[^/]+)/'), '.

'))[OFFSET(1)] AS source, title FROM `bigquery-public-data.

hacker_news.

stories` WHERE REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.

*://(.

[^/]+)/'), '.

com$') AND LENGTH(title) > 10 )), ds AS (SELECT ARRAY_CONCAT(SPLIT(title, " "), ['NULL', 'NULL', 'NULL', 'NULL', 'NULL']) AS words, source FROM extractedWHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch'))SELECT source, words[OFFSET(0)] AS word1, words[OFFSET(1)] AS word2, words[OFFSET(2)] AS word3,words[OFFSET(3)] AS word4,words[OFFSET(4)] AS word5FROM dsThis was fast.

The SQL query above is the full enchilada.

There is nothing more to it.

The model training itself took only a few minutes.

I got 78% accuracy which compares quite favorably to the 80% I got with the custom Keras CNN model.

Once trained, batch predictions using BigQuery are easy:SELECT * FROM ML.

PREDICT(MODEL advdata.

txtclass,(SELECT 'government' AS word1, 'shutdown' AS word2, 'leaves' AS word3, 'workers' AS word4, 'reeling' AS word5))BigQuery ML identifies the New York Times as the most likely source of an article that starts with the words “Government shutdown leaves workers reeling”.

Online predictions using BigQuery can be accomplished by exporting the weights into a web application.

3.

AutoMLThe third option I tried is the code-free option that, nevertheless, uses state-of-the-art models and techniques underneath.

Because this is a text classification problem, the Auto ML approach to use is Auto ML Natural Language.

3a.

Launch AutoML Natural LanguageThe first step is to launch Auto ML Natural Language from the GCP web console:Launch AutoML Natural Language from the GCP web consoleFollow the prompts and a bucket will be created to hold the dataset that you will use to train the model.

3b.

Create CSV file and have it available on Google Cloud StorageWhere BigQuery ML requires you to know SQL, AutoML just requires that you create a dataset in one of the formats the tool understands.

The tool understands CSV files arranged as follows:text, labelThe text itself can either be a URL to a file containing the actual text (this is useful if you have multi-line text, such as reviews or entire documents) or it can be the plain text item itself.

If you are providing the text item string directly, you need to put it in quotes.

So, our first step is export a CSV file from BigQuery in the right format.

This was my query:WITH extracted AS (SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.

-]', ' ')) AS title FROM (SELECT ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.

*://(.

[^/]+)/'), '.

'))[OFFSET(1)] AS source, title FROM `bigquery-public-data.

hacker_news.

stories` WHERE REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.

*://(.

[^/]+)/'), '.

com$') AND LENGTH(title) > 10 ))SELECT CONCAT('"', title, '"') AS title, source FROM extractedWHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch') AND LENGTH(title) > 0Which yields the following dataset:Dataset for AutoMLNote that I have stripped out punctuation and special characters (AutoML can accommodate these things, but I already had the cleanup code from the BigQuery section, so I ended up continuing to use it).

Because I’m going to be putting the text item directly in the CSV (instead of putting it in separate files on Cloud Storage), I have surrounded it with quotes.

I saved the result of the query as a table using the BigQuery UI:Save the query results as a tableand then exported the table to a CSV file:Export the table data to the Auto ML bucket3c.

Create Auto ML datasetNext step is to use the Auto ML UI to create a dataset from the CSV file on Cloud Storage:Create a dataset from the CSV file on Cloud StorageThe dataset takes about 20 minutes to ingest.

At the end, we get a screen full of text items:The dataset after loadingThe current Auto ML limit is 100k rows, so our 92k dataset is definitely pushing some boundaries.

A smaller dataset will get ingested faster.

Why do we have a label called “source” with only example?.The CSV file had a header line (source, title) and that too has been ingested!.Fortunately, AutoML allows us to edit the text items in the GUI itself.

So, I deleted the extra label and its corresponding text.

3d.

TrainTraining is as easy as clicking on a button.

Auto ML then proceeds to try various embeddings, and various architectures and does hyperparameter tuning to come up with a good solution to the problem.

It takes 5 hours.

3e.

EvaluationOnce the model is trained, we get a bunch of evaluation statistics: precision, recall, AUC curve, etc.

But we also get the actual confusion matrix from which we can compute anything else we want:The overall accuracy is about 86% — higher even than our custom Keras CNN model.

Why?.Because Auto ML is able to take advantage of transfer learning from models built on Google datasets on language use, i.

e.

includes data that we did not have available to our Keras model.

Also, because of the availability of all that data to transfer learn from, the model architecture can be more complex (read: more deep).

3f.

PredictionThe trained AutoML model is already deployed and available for prediction.

We can send it a request and get back the predicted source of the article:Predictions from Auto MLNotice that the model is much more confident than the BQML one (although both gave the same correct answer), a confidence driven by the fact that this Auto ML model was trained on more data and is built specifically for text classification problems.

I tried another article title from today’s headlines and the model nailed it as being from TechCrunch:Correctly identifies the title as being from a TechCrunch article.

SummaryWhile this article is primarily about text classification, the general conclusions and advice carry over to most ML problems:Use BigQuery ML for easy, low-cost machine learning and quick experimentation to see if ML is viable on your data.

Sometimes, the accuracy you get with BQML is sufficient, and you will simply stop here.

Once you identify a viable ML problem using BQML, use Auto ML for code-free, state-of-the-art models.

Text classification, for example, is a very specialized field with high-dimensional inputs.

So, you can do better with a customized solution (i.

e.

, Auto ML Natural Language) than with a structured data approach that just uses bag-of-words.

Hand-roll your own custom models only for problems where you have lots of data and enough time/effort to devote.

Use AutoML as a benchmark.

If, you can not beat Auto ML after some reasonable effort, stop wasting time.

Just go with Auto ML.

¹ There are a few other ways to do machine learning on GCP.

You can do xgboost or scikit-learn in ML Engine.

The Deep Learning VM supports PyTorch.

Spark ML works well on Cloud Dataproc.

And of course, you can use Google Compute Engine or Google Kubernetes Engine and install any ML framework you want.

But in this article, I’ll focus on these three.

.

. More details

Leave a Reply