How to build production-scale, cloud ML model in few minutes!Build, train and deploy a machine learning model in literally few minutes using only SQL on Google Cloud.
Nezar AssawielBlockedUnblockFollowFollowingMay 31Photo by Markus Spiske on UnsplashAs part of its effort to gain a larger market share in the cloud computing market, Google Cloud has been adding powerful tools to its platform, especially around AI.
One of these tools is BigQuery ML, which enables a user to develop and train ML models on large datasets within few minutes!If you are not familiar with it already, BigQuery is a serverless data warehouse that enables SQL queries where you only pay for query-data-usage and storage (It is “similar” to Redshift on AWS).
Since mid 2018, BigQuery has enabled users to build ML models with few lines of SQL code within few minutes.
This post will walk you through the steps needed to build such ML models.
Note:The example used in illustrating the steps is inspired by one of BigQuery’s official tutorials.
If you wish to execute the code presented, you can find it here.
1) Ingest your data into BigQuery:There are several ways for loading data into BigQuery if it is not already.
In fact, you don’t need to load data into BigQuery to query it as you can query external data.
However, if you are concerned about speed, it is a good idea to upload your data to BigQuery.
Here, we will use the natality dataset, which has details about the United States births registered between 1969 and 2008, to predict the weight of a new born baby.
It is already available in BigQuery as a public dataset here.
2) Explore and preprocess the data:Let us consider the last 10 years in the dataset, where the values are meaningful (i.
> 0) by running the following query:which yields the following table:BigQuery runtime: 2.
3s elapsed, 21.
9 GB processedAs you can see from the table above, there are several columns that seem to be suitable for creating features to predict the weight of a new born, namely:is_male,plurality,(one baby, twins, .
etc)mother_age,gestation_weeks,(duration of pregnancy in weeks)weight_gain_pounds (weight gained by the mother during the pregnancy in pounds)To examine whether these columns can be leveraged into features or not, one can plot these columns against the weight_pounds column (target of the prediction) in a Jupyter or a Datalab notebook, for example, and explore.
Checking for missing data, sparsity …etc.
is essential for good feature engineering.
For simplicity, let us skip this step and go with our intuition!3) Define and engineer features:With the columns that will be leveraged into features selected in the previous step, let us see how we can code our features in BigQuery.
Note that in BigQuery ML all strings are considered categorical features and all numeric values are considered continuous features.
As such, our 5 features will be defined as follows:Running the above query yields:BigQuery runtime: 2.
7s elapsed, 6.
82 GB processedNote the hashmonth column.
We made a hash value for every month in the last 10 years of the dataset (year>1998) to make a training/evaluation datasets split in the next step without leakage (i.
babies with the same birthdate should be either in the training or evaluation set).
3) Create and train ML model:Before creating the ML model, a BigQuery dataset is needed to store the model, which is easily created by clicking “create a dataset” from BiQuery web UI.
Let us name this dataset BigQDemo.
The SQL command CREATE MODEL creates and trains the model.
Since we are predicting the weight of a new born which is a continuous value, a linear regression model was chosen for the model_type in the following query.
Running the this query (which takes about 4 minutes) creates and trains a ML model named NewBornWeight_Model!The trained model appears under the BigQDemo dataset as a table:Training statistics, such as the loss and learning rate, can be viewed by exploring the NewBornWeight_Model table or by running the following query:SELECT * FROM ML.
NewBornWeight_Model);You can find all the other model types and optional parameters for BigQuery ML in the official documentations here.
4) Evaluate the trained model:The command ML.
EVALUATE is used to evaluate the trained model.
In our case, let us evaluate the model on the remaining 20% of the data that was not used during training (see the condition AND MOD (.
, 5)>= 4 in the query below):which yields the following results:BigQuery runtime: 2.
5s elapsed, 6.
82 GB processed5) Predict with your modelWhen you are satisfied with your model, you can use it to predict using the command ML.
Let us predict the baby weight for 50 examples from the last 10 years of the natality dataset:As you can see in the table below, when the ml.
PREDICT command is used, the prediction column is automatically named predicted_<label_column_name>.
BigQuery runtime: 1.
0s elapsed, 5.
80 GB processedWith that, you have a fully functioning ML model!BigQuery ML is not only useful in building ML models as part of a Google Cloud data pipeline or enabling non-technical staff to run ML models; but also in quickly testing the effects of newly designed complex ML features, for example, on large datasets.
Thus, BigQuery ML is particularly useful for ML developers when executed directly from the development environment, such as Jupyter notebooks, for example.
.. More details