BigQuery ML model evaluation overview

This document describes how BigQuery ML supports machine learning (ML) model evaluation.

Overview of model evaluation

You can use ML model evaluation metrics for the following purposes:

To assess the quality of the fit between the model and the data.
To compare different models.
To predict how accurately you can expect each model to perform on a specific dataset, in the context of model selection.

Supervised and unsupervised learning model evaluations work differently:

For supervised learning models, model evaluation is well-defined. An evaluation set, which is data that hasn't been analyzed by the model, is typically excluded from the training set and then used to evaluate model performance. We recommend that you don't use the training set for evaluation because this causes the model to perform poorly when generalizing the prediction results for new data. This outcome is known as overfitting .
For unsupervised learning models, model evaluation is less defined and typically varies from model to model. Because unsupervised learning models don't reserve an evaluation set, the evaluation metrics are calculated using the whole input dataset.

Model evaluation offerings

BigQuery ML provides the following functions to calculate evaluation metrics for ML models:

Model category

Model types

Model evaluation functions

What the function does

Supervised learning

Linear regression

Boosted trees regressor

Random forest regressor

DNN regressor

Wide-and-deep regressor

AutoML Tables regressor

 ML.EVALUATE

Reports the following metrics:

mean absolute error
mean squared error
mean squared log error
median absolute error
r2 score
explained variance

Logistic regression

Boosted trees classifier

Random forest classifier

DNN classifier

Wide-and-deep classifier

AutoML Tables classifier

 ML.EVALUATE

Reports the following metrics:

precision
recall
accuracy
F1 score
log loss
roc auc

 ML.CONFUSION_MATRIX

Reports the confusion matrix .

 ML.ROC_CURVE

Reports metrics for different threshold values, including the following:

recall
false positive rate
true positives
false positives
true negatives
false negatives

Only applies to binary-class classification models.

Unsupervised learning

K-means

 ML.EVALUATE

Reports the Davies-Bouldin index , and the mean squared distance between data points and the centroids of the assigned clusters.

Matrix factorization

 ML.EVALUATE

For explicit feedback -based models, reports the following metrics:

mean absolute error
mean squared error
mean squared log error
median absolute error
r2 score
explained variance

For implicit feedback -based models, reports the following metrics:

PCA

 ML.EVALUATE

Reports the total explained variance ratio.

Autoencoder

 ML.EVALUATE

Reports the following metrics:

mean absolute error
mean squared error
mean squared log error

Time series

ARIMA_PLUS

 ML.EVALUATE

Reports the following metrics:

mean absolute error
mean squared error
mean absolute percentage error
symmetric mean absolute percentage error

This function requires new data as input.

 ML.ARIMA_EVALUATE

Reports the following metrics for all ARIMA candidate models characterized by different (p, d, q, has_drift) tuples:

log_likelihood
AIC
variance

It also reports other information about seasonality, holiday effects, and spikes-and-dips outliers.

This function doesn't require new data as input.

Automatic evaluation in `CREATE MODEL` statements

BigQuery ML supports automatic evaluation during model creation. Depending on the model type, the data split training options, and whether you're using hyperparameter tuning, the evaluation metrics are calculated upon the reserved evaluation dataset, the reserved test dataset, or the entire input dataset.

For k-means, PCA, autoencoder, and ARIMA_PLUS models, BigQuery ML uses all of the input data as training data, and evaluation metrics are calculated against the entire input dataset.
For linear and logistic regression, boosted tree, random forest, DNN, Wide-and-deep, and matrix factorization models, evaluation metrics are calculated against the dataset that's specified by the following CREATE MODEL options:
When you train these types of models using hyperparameter tuning, the DATA_SPLIT_TEST_FRACTION option also helps define the dataset that the evaluation metrics are calculated against. For more information, see Data split .
For AutoML Tables models, see how data splits are used for training and evaluation.

To get evaluation metrics calculated during model creation, use evaluation functions such as ML.EVALUATE on the model with no input data specified. For an example, see ML.EVALUATE with no input data specified .

Evaluation with a new dataset

After model creation, you can specify new datasets for evaluation. To provide a new dataset, use evaluation functions like ML.EVALUATE on the model with input data specified. For an example, see ML.EVALUATE with a custom threshold and input data .

What's next

For more information about supported SQL statements and functions for models that support evaluation, see the following documents:

BigQuery ML model evaluation overview

Overview of model evaluation

Model evaluation offerings

Automatic evaluation in CREATE MODEL statements

Evaluation with a new dataset

What's next

Automatic evaluation in `CREATE MODEL` statements