Use BigQuery ML to predict penguin weightStay organized with collectionsSave and categorize content based on your preferences.
In this tutorial, you use alinear regression modelin BigQuery ML to predict the weight of a penguin based on the
penguin's demographic information. A linear regression is a type of regression
model that generates a continuous value from a linear combination of input
features.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains theserviceusage.services.enablepermission.Learn how to grant
roles.
Create a linear regression model using the Analytics sample
dataset for BigQuery.
SQL
You can create a linear regression model by using theCREATE MODELstatementand specifyingLINEAR_REGfor the model type. Creating the model includes
training the model.
The following are useful things to know about theCREATE MODELstatement:
Theinput_label_colsoption specifies which column in theSELECTstatement
to use as the label column. Here, the label column isbody_mass_g. For
linear regression models, the label column must be real-valued, that is,
the column values must be real numbers.
This query'sSELECTstatement uses the following columns in thebigquery-public-data.ml_datasets.penguinstable to predict a penguin's
weight:
species: the species of penguin.
island: the island that the penguin resides on.
culmen_length_mm: the length of the penguin's culmen in millimeters.
culmen_depth_mm: the depth of the penguin's culmen in millimeters.
flipper_length_mm: the length of the penguin's flippers in millimeters.
sex: the sex of the penguin.
TheWHEREclause in this query'sSELECTstatement,WHERE body_mass_g IS
NOT NULL, excludes rows where thebody_mass_gcolumn isNULL.
Run the query that creates your linear regression model:
In the Google Cloud console, go to theBigQuerypage.
frombigframes.ml.linear_modelimportLinearRegressionimportbigframes.pandasasbpd# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Drop rows with nulls to get training datatraining_data=bq_df.dropna(subset=["body_mass_g"])# Specify your feature (or input) columns and the label (or output) column:feature_columns=training_data.drop(columns=["body_mass_g"])label_columns=training_data[["body_mass_g"]]# Create the linear modelmodel=LinearRegression()model.fit(feature_columns,label_columns)model.to_gbq(your_model_id,# For example: "bqml_tutorial.penguins_model"replace=True,)
It takes about 30 seconds to create the model. To see the model, follow these
steps:
In the left pane, clickexploreExplorer:
In theExplorerpane, expand your project and clickDatasets.
Click thebqml_tutorialdataset.
Click theModelstab.
Get training statistics
To see the results of the model training, you can use theML.TRAINING_INFOfunction,
or you can view the statistics in the Google Cloud console. In this
tutorial, you use the Google Cloud console.
A machine learning algorithm builds a model by examining many examples and
attempting to find a model that minimizes loss. This process is called empirical
risk minimization.
Loss is the penalty for a bad prediction. It is a number indicating
how bad the model's prediction was on a single example. If the model's
prediction is perfect, the loss is zero; otherwise, the loss is greater. The
goal of training a model is to find a set of weights and biases that have low
loss, on average, across all examples.
See the model training statistics that were generated when you ran theCREATE MODELquery:
In the left pane, clickexploreExplorer:
In theExplorerpane, expand your project and clickDatasets.
Click thebqml_tutorialdataset.
Click theModelstab.
To open the model information pane, clickpenguins_model.
Click theTrainingtab, and then clickTable. The results should look
similar to the following:
TheTraining Data Losscolumn represents the loss metric calculated
after the model is trained on the training dataset. Since you performed a
linear regression, this column shows themean squared errorvalue. Anormal_equationoptimization strategy is automatically used for this training, so only one
iteration is required to converge to the final model. For more information
on setting the model optimization strategy, seeoptimize_strategy.
Evaluate the model
After creating the model, evaluate the model's performance by using theML.EVALUATEfunctionor thescoreBigQuery DataFrames functionto evaluate the predicted values generated by the model against the actual data.
SQL
For input, theML.EVALUATEfunction takes the trained model and a dataset
that matches the schema of the data that you used to train the model. In
a production environment, you should
evaluate the model on different data than the data you used to train the model.
If you runML.EVALUATEwithout providing input data, the function retrieves
the evaluation metrics calculated during training. These metrics are calculated
by using the automatically reserved evaluation dataset:
importbigframes.pandasasbpd# Select the model you will be evaluating. `read_gbq_model` loads model data from# BigQuery, but you could also use the `model` object from the previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model")# Score the model with input data defined in an earlier step to compare# model predictions on feature_columns to true labels in label_columns.score=model.score(feature_columns,label_columns)# Expected output results:# index mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance# 0 227.012237 81838.159892 0.00507 173.080816 0.872377 0.872377# 1 rows x 6 columns
The results should look similar to the following:
Because you performed a linear regression, the results include the following
columns:
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score
explained_variance
An important metric in the evaluation results is theR2score.
The R2score is a statistical measure that determines if the linear
regression predictions approximate the actual data. A value of0indicates
that the model explains none of the variability of the response data around the
mean. A value of1indicates that the model explains all the variability of
the response data around the mean.
You can also look at the model's information pane in the Google Cloud console
to view the evaluation metrics:
Use the model to predict outcomes
Now that you have evaluated your model, the next step is to use it to predict
an outcome. You can run theML.PREDICTfunctionor thepredictBigQuery DataFrames functionon the model to predict the body mass in grams of all penguins that reside on
the Biscoe Islands.
SQL
For input, theML.PREDICTfunction takes the trained model and a dataset that
matches the schema of the data that you used to train the model, excluding the
label column.
Run theML.PREDICTquery:
In the Google Cloud console, go to theBigQuerypage.
# Select the model you'll use for predictions. `read_gbq_model` loads# model data from BigQuery, but you could also use the `model` object# object from previous steps.model=bpd.read_gbq_model(your_model_id,# For example: "bqml_tutorial.penguins_model",)# Load data from BigQuerybq_df=bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")# Use 'contains' function to filter by island containing the string# "Biscoe".biscoe_data=bq_df.loc[bq_df["island"].str.contains("Biscoe")]result=model.predict(biscoe_data)# Expected output results:# predicted_body_mass_g species island culmen_length_mm culmen_depth_mm body_mass_g flipper_length_mm sex# 23 4681.782896 Gentoo penguin (Pygoscelis papua) Biscoe <NA> <NA> <NA> <NA> <NA># 332 4740.7907 Gentoo penguin (Pygoscelis papua) Biscoe 46.2 14.4 214.0 4650.0 <NA># 160 4731.310452 Gentoo penguin (Pygoscelis papua) Biscoe 44.5 14.3 216.0 4100.0 <NA>
The results should look similar to the following:
Explain the prediction results
SQL
To understand why the model is generating these prediction results, you can use
theML.EXPLAIN_PREDICTfunction.
ML.EXPLAIN_PREDICTis an extended version of theML.PREDICTfunction.ML.EXPLAIN_PREDICTnot only outputs prediction results, but also outputs
additional columns to explain the prediction results. In practice, you can runML.EXPLAIN_PREDICTinstead ofML.PREDICT. For more information, seeBigQuery ML explainable AI overview.
Run theML.EXPLAIN_PREDICTquery:
In the Google Cloud console, go to theBigQuerypage.
# Use 'predict_explain' function to understand why the model is generating these prediction results.# 'predict_explain'is an extended version of the 'predict' function that not only outputs prediction results, but also outputs additional columns to explain the prediction results.# Using the trained model and utilizing data specific to Biscoe Island, explain the predictions of the top 3 featuresexplained=model.predict_explain(biscoe_data,top_k_features=3)# Expected results:# predicted_body_mass_g top_feature_attributions baseline_prediction_value prediction_value approximation_error species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex# 0 5413.510134 [{'feature': 'island', 'attribution': 7348.877... -5320.222128 5413.510134 0.0 Gentoo penguin (Pygoscelis papua) Biscoe 45.2 16.4 223.0 5950.0 MALE# 1 4768.351092 [{'feature': 'island', 'attribution': 7348.877... -5320.222128 4768.351092 0.0 Gentoo penguin (Pygoscelis papua) Biscoe 46.5 14.5 213.0 4400.0 FEMALE# 2 3235.896372 [{'feature': 'island', 'attribution': 7348.877... -5320.222128 3235.896372 0.0 Adelie Penguin (Pygoscelis adeliae) Biscoe 37.7 16.0 183.0 3075.0 FEMALE# 3 5349.603734 [{'feature': 'island', 'attribution': 7348.877... -5320.222128 5349.603734 0.0 Gentoo penguin (Pygoscelis papua) Biscoe 46.4 15.6 221.0 5000.0 MALE# 4 4637.165037 [{'feature': 'island', 'attribution': 7348.877... -5320.222128 4637.165037 0.0 Gentoo penguin (Pygoscelis papua) Biscoe 46.1 13.2 211.0 4500.0 FEMALE
For linear regression models, Shapley values are used to generate feature
attribution values for each feature in the model. The output includes
the top three feature attributions per row of thepenguinstable becausetop_k_featureswas set to3. These attributions are sorted by
the absolute value of the attribution in descending order. In all examples, the
featuresexcontributed the most to the overall prediction.
Globally explain the model
SQL
To know which features are generally the most important to determine penguin
weight, you can use theML.GLOBAL_EXPLAINfunction.
In order to useML.GLOBAL_EXPLAIN, you must retrain the model with theENABLE_GLOBAL_EXPLAINoption set toTRUE.
Retrain and get global explanations for the model:
In the Google Cloud console, go to theBigQuerypage.
# To use the `global_explain()` function, the model must be recreated with `enable_global_explain` set to `True`.model=LinearRegression(enable_global_explain=True)# The model must the be fitted before it can be saved to BigQuery and then explained.training_data=bq_df.dropna(subset=["body_mass_g"])X=training_data.drop(columns=["body_mass_g"])y=training_data[["body_mass_g"]]model.fit(X,y)model.to_gbq("bqml_tutorial.penguins_model",replace=True)# Explain the modelexplain_model=model.global_explain()# Expected results:# attribution# feature# island 5737.315921# species 4073.280549# sex 622.070896# flipper_length_mm 193.612051# culmen_depth_mm 117.084944# culmen_length_mm 94.366793
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this
tutorial, either delete the project that contains the resources, or keep the project and
delete the individual resources.
You can delete the project you created.
Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you
prefer to reuse the project, you can delete the dataset you created in this
tutorial:
If necessary, open the BigQuery page in the
Google Cloud console.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-04-14 UTC."],[],[]]