The CREATE MODEL statement
To create a model in BigQuery, use the BigQuery ML CREATE
MODEL
statement. This statement is similar to the CREATE TABLE
DDL statement. When you run a query that contains a CREATE MODEL
statement, a query job
is generated for you that processes
the query. You can also use the Google Cloud console user interface to create a model by using a UI
( Preview
).
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model .
Required permissions
-
To create a dataset to store the model, you need the
bigquery.datasets.create
IAM permission. -
To create a model, you need the following permissions:
-
bigquery.jobs.create
-
bigquery.models.create
-
bigquery.models.getData
-
bigquery.models.updateData
-
bigquery.connections.delegate
(for remote models)
-
The following predefined IAM roles grant these permissions:
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM .
CREATE MODEL
syntax
{ CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL } model_name [TRANSFORM ( select_list )] [INPUT ( field_name field_type ) OUTPUT ( field_name field_type )] [REMOTE WITH CONNECTION {` connection_name ` | DEFAULT}] [OPTIONS( model_option_list )] [ AS { query_statement | ( training_data AS ( query_statement ), custom_holiday AS ( holiday_statement ) )}] model_option_list: MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'MATRIX_FACTORIZATION' | 'PCA' | 'AUTOENCODER' | 'AUTOML_CLASSIFIER' | 'AUTOML_REGRESSOR' | 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' | 'RANDOM_FOREST_CLASSIFIER' | 'RANDOM_FOREST_REGRESSOR' | 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' | 'DNN_LINEAR_COMBINED_CLASSIFIER' | 'DNN_LINEAR_COMBINED_REGRESSOR' | 'ARIMA_PLUS' | 'ARIMA_PLUS_XREG' | 'TENSORFLOW' | 'TENSORFLOW_LITE' | 'ONNX' | 'XGBOOST' | 'CONTRIBUTION_ANALYSIS' } [, MODEL_REGISTRY = { 'VERTEX_AI' } ] [, VERTEX_AI_MODEL_ID = string_value ] [, VERTEX_AI_MODEL_VERSION_ALIASES = string_array ] [, INPUT_LABEL_COLS = string_array ] [, MAX_ITERATIONS = int64_value ] [, EARLY_STOP = { TRUE | FALSE } ] [, MIN_REL_PROGRESS = float64_value ] [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ] [, DATA_SPLIT_EVAL_FRACTION = float64_value ] [, DATA_SPLIT_TEST_FRACTION = float64_value ] [, DATA_SPLIT_COL = string_value ] [, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ] [, L1_REG = float64_value ] [, L2_REG = float64_value ] [, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ] [, LEARN_RATE = float64_value ] [, LS_INIT_LEARN_RATE = float64_value ] [, WARM_START = { TRUE | FALSE } ] [, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ] [, CLASS_WEIGHTS = struct_array ] [, INSTANCE_WEIGHT_COL = string_value ] [, NUM_CLUSTERS = int64_value ] [, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ] [, KMEANS_INIT_COL = string_value ] [, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ] [, STANDARDIZE_FEATURES = { TRUE | FALSE } ] [, MODEL_PATH = string_value ] [, BUDGET_HOURS = float64_value ] [, OPTIMIZATION_OBJECTIVE = { string_value | struct_value } ] [, FEEDBACK_TYPE = {'EXPLICIT' | 'IMPLICIT'} ] [, NUM_FACTORS = int64_value ] [, USER_COL = string_value ] [, ITEM_COL = string_value ] [, RATING_COL = string_value ] [, WALS_ALPHA = float64_value ] [, BOOSTER_TYPE = { 'gbtree' | 'dart'} ] [, NUM_PARALLEL_TREE = int64_value ] [, DART_NORMALIZE_TYPE = { 'tree' | 'forest'} ] [, TREE_METHOD = { 'auto' | 'exact' | 'approx' | 'hist'} ] [, MIN_TREE_CHILD_WEIGHT = float64_value ] [, COLSAMPLE_BYTREE = float64_value ] [, COLSAMPLE_BYLEVEL = float64_value ] [, COLSAMPLE_BYNODE = float64_value ] [, MIN_SPLIT_LOSS = float64_value ] [, MAX_TREE_DEPTH = int64_value ] [, SUBSAMPLE = float64_value ] [, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } ] [, BATCH_SIZE = int64_value ] [, DROPOUT = float64_value ] [, HIDDEN_UNITS = int_array ] [, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } ] [, TIME_SERIES_TIMESTAMP_COL = string_value ] [, TIME_SERIES_DATA_COL = string_value ] [, TIME_SERIES_ID_COL = { string_value | string_array } ] [, HORIZON = int64_value ] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER = int64_value ] [, AUTO_ARIMA_MIN_ORDER = int64_value ] [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | ... } ] [, FORECAST_LIMIT_LOWER_BOUND = float64_value ] [, FORECAST_LIMIT_UPPER_BOUND = float64_value ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ] [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ] [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ] [, DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ] [, HIERARCHICAL_TIME_SERIES_COLS = { string_array } ] [, ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } ] [, APPROX_GLOBAL_FEATURE_CONTRIB = { TRUE | FALSE }] [, INTEGRATED_GRADIENTS_NUM_STEPS = int64_value ] [, CALCULATE_P_VALUES = { TRUE | FALSE } ] [, FIT_INTERCEPT = { TRUE | FALSE } ] [, CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING' | 'DUMMY_ENCODING' | 'LABEL_ENCODING' | 'TARGET_ENCODING' } ] [, ENDPOINT = string_value ] [, REMOTE_SERVICE_TYPE = { 'CLOUD_AI_VISION_V1' | 'CLOUD_AI_NATURAL_LANGUAGE_V1' | 'CLOUD_AI_TRANSLATE_V3' } ] [, XGBOOST_VERSION = { '0.9' | '1.1' } ] [, TF_VERSION = { '1.15' | '2.8.0' } ] [, NUM_TRIALS = int64_value , ] [, MAX_PARALLEL_TRIALS = int64_value ] [, HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ] [, HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... } ] [, NUM_PRINCIPAL_COMPONENTS = int64_value ] [, PCA_EXPLAINED_VARIANCE_RATIO = float64_value ] [, SCALE_FEATURES = { TRUE | FALSE } ] [, PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' } ] [, TIME_SERIES_LENGTH_FRACTION = float64_value ] [, MIN_TIME_SERIES_LENGTH = int64_value ] [, MAX_TIME_SERIES_LENGTH = int64_value ] [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ] [, SEASONALITIES = string_array ] [, PROMPT_COL = string_value ] [, LEARNING_RATE_MULTIPLIER = float64_value ] [, ACCELERATOR_TYPE = { 'GPU' | 'TPU' } ] [, EVALUATION_TASK = { 'TEXT_GENERATION' | 'CLASSIFICATION' | 'SUMMARIZATION' | 'QUESTION_ANSWERING' | 'UNSPECIFIED' } ] [, DOCUMENT_PROCESSOR = string_value ] [, SPEECH_RECOGNIZER = string_value ] [, KMS_KEY_NAME = string_value ] [, CONTRIBUTION_METRIC = string_value ] [, DIMENSION_ID_COLS = string_array ] [, IS_TEST_COL = string_value ] [, MIN_APRIORI_SUPPORT = float64_value ] [, PRUNING_METHOD = {'NO_PRUNING', 'PRUNE_REDUNDANT_INSIGHTS'} ] [, TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = int64_value ]
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model does not exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
model_name
is the name of the model you're creating or replacing. The model
name must be unique per dataset: no other model or table can have the same name.
The model name must follow the same naming rules as a BigQuery table. A
model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, prepend the project ID to the
model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]`
; for example, `myproject.mydataset.mymodel`
.
TRANSFORM
TRANSFORM lets you specify all preprocessing during model creation and have it automatically applied during prediction and evaluation.
For example, you can create the following model:
CREATE
OR
REPLACE
MODEL
`
myproject
.
mydataset
.
mymodel
`
TRANSFORM
(
ML
.
FEATURE_CROSS
(
STRUCT
(
f1
,
f2
))
as
cross_f
,
ML
.
QUANTILE_BUCKETIZE
(
f3
)
OVER
()
as
buckets
,
label_col
)
OPTIONS
(
model_type
=
'linear_reg'
,
input_label_cols
=
[
'label_col'
])
AS
SELECT
*
FROM
t
During prediction, you don't need to preprocess the input again, and the same transformations are automatically restored:
SELECT
*
FROM
ML
.
PREDICT
(
MODEL
`
myproject
.
mydataset
.
mymodel
`
,
(
SELECT
f1
,
f2
,
f3
FROM
table
))
When the TRANSFORM
clause is present, only output columns from the TRANSFORM
clause are used in training. Any results from query_statement
that don't appear in the TRANSFORM
clause are ignored.
The input columns of the TRANSFORM
clause are the result of query_statement
.
So, the final input used in training is the set of columns generated by the
following query:
SELECT
(
select_list
)
FROM
(
query_statement
);
Input columns of the TRANSFORM
clause can be of any SIMPLE type or ARRAY of
SIMPLE type. SIMPLE types are non-STRUCT and non-ARRAY data types.
In prediction ( ML.PREDICT
), users only need to pass in the original
columns from the query_statement
that are used inside the TRANSFORM
clause.
The columns dropped in TRANSFORM
don't need to be provided during prediction. TRANSFORM
is automatically applied to the input data during prediction,
including the statistics used in ML analytic functions (for example, ML.QUANTILE_BUCKETIZE
).
To learn more about feature preprocessing, see Feature preprocessing overview , or try the Feature Engineering Functions notebook.
To try using the TRANSFORM
clause, try the Use the BigQuery ML TRANSFORM
clause for feature engineering
tutorial or the Create Model With Inline Transpose
notebook.
select_list
You can pass columns from query_statement
through to model training without
transformation by either using *
, * EXCEPT()
, or by listing
the column names directly.
Not all columns from query_statement
are required to appear in the TRANSFORM
clause, so you can drop columns appearing in query_statement
by omitting
them from the TRANSFORM
clause.
You can transform inputs from query_statement
by using expressions in select_list
. select_list
is similar to a normal SELECT
statement. select_list
supports the following syntax:
-
*
-
* EXCEPT()
-
* REPLACE()
-
expression
-
expression.*
The following cannot appear inside select_list
:
- Aggregation functions.
- Non-BigQuery ML analytic functions. For more information about supported functions, see Manual feature preprocessing .
- UDFs.
- Subqueries.
- Anonymous columns. For example,
a + b as c
is allowed, whilea + b
isn't.
The output columns of select_list
can be of any BigQuery
supported data type.
If present, the following columns must appear in select_list
without
transformation:
-
label
-
data_split_col
-
kmeans_init_col
-
instance_weight_col
If these columns are returned by query_statement
, you must reference them in select_list
by column name outside of any expression, or by using *
. You
can't use aliases with these columns.
INPUT
and OUTPUT
INPUT
and OUTPUT
clauses are used to specify input and output format for remote models
or XGBoost models
.
field_name
For remote models, INPUT
and OUTPUT
field names must be identical as the
field names of the Vertex AI endpoint request and response. See examples in remote model INPUT
and OUTPUT
clause
.
For XGBoost models, INPUT
field names must be identical to the names in the feature_names
field if feature_names
field is populated in the XGBoost model file. See XGBoost INPUT OUTPUT clause
for more details.
field_type
Remote models
support the
following BigQuery data types for INPUT
and OUTPUT
clauses:
XGBoost models
support the
following BigQuery data types for INPUT
field type:
- Numeric type
- ARRAY <Numeric type>
XGBoost models
only support FLOAT64
for OUTPUT
field type.
connection_name
BigQuery uses a CLOUD_RESOURCE
connection
to interact with your Vertex AI endpoint. You need to grant Vertex AI User role
to connection's service account on your Vertex AI endpoint project.
See examples in remote model CONNECTION
statement
.
To use a default connection
, specify
specify DEFAULT
instead of the connection name.
model_option_list
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE
=
{
'LINEAR_REG'
|
'LOGISTIC_REG'
|
'KMEANS'
|
'PCA'
|
'MATRIX_FACTORIZATION'
|
'AUTOENCODER'
|
'AUTOML_REGRESSOR'
|
'AUTOML_CLASSIFIER'
|
'BOOSTED_TREE_CLASSIFIER'
|
'BOOSTED_TREE_REGRESSOR'
|
'RANDOM_FOREST_CLASSIFIER'
|
'RANDOM_FOREST_REGRESSOR'
|
'DNN_CLASSIFIER'
|
'DNN_REGRESSOR'
|
'DNN_LINEAR_COMBINED_CLASSIFIER'
|
'DNN_LINEAR_COMBINED_REGRESSOR'
|
'ARIMA_PLUS'
|
'ARIMA_PLUS_XREG'
|
'TENSORFLOW'
|
'TENSORFLOW_LITE'
|
'ONNX'
|
'XGBOOST'
|
'CONTRIBUTION_ANALYSIS'
}
Description
Specify the model type. This argument is required.
Arguments
The argument is in the model type column.
'LINEAR_REG'
'BOOSTED_TREE_REGRESSOR'
'RANDOM_FOREST_REGRESSOR'
'DNN_LINEAR_COMBINED_REGRESSOR'
'LOGISTIC_REG'
'BOOSTED_TREE_CLASSIFIER'
'RANDOM_FOREST_CLASSIFIER'
'DNN_CLASSIFIER'
'DNN_LINEAR_COMBINED_CLASSIFIER'
'AUTOML_CLASSIFIER'
'KMEANS'
'MATRIX_FACTORIZATION'
'PCA'
'AUTOENCODER'
'ARIMA_PLUS'
(previously 'ARIMA'
)'ARIMA_PLUS_XREG'
'CONTRIBUTION_ANALYSIS'
'TENSORFLOW'
'TENSORFLOW_LITE'
'ONNX'
'XGBOOST'
Other model options
The following table provides a comprehensive list of model options, with a brief
descriptions and their applicable model types. You can find detailed description
in the model specific CREATE MODEL
statement by clicking the model type in the
"Applied model types" column.
When the applied model types are supervised learning models, unless "regressor" or "classifier" is explicitly listed, it means that model options apply to both the regressor and the classifier. For example, the "boosted tree" means that model option applies to both boosted tree regressor and boosted tree classifier, while the "boosted tree classifier" only applies to the classifier.
Name | Description | Applied model types |
---|---|---|
MODEL_REGISTRY
|
The MODEL_REGISTRY option specifies the Model Registry destination. | All model types are supported. |
VERTEX_AI_MODEL_ID
|
The Vertex AI model ID to register the model with. | All model types are supported. |
VERTEX_AI_MODEL_VERSION_ALIASES
|
The Vertex AI model alias to register the model with. | All model types are supported. |
INPUT_LABEL_COLS
|
The label column names in the training data. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , AutoML |
MAX_ITERATIONS
|
The maximum number of training iterations or steps. | Linear & logistic regression
, Boosted trees , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
EARLY_STOP
|
Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for `MIN_REL_PROGRESS`. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
MIN_REL_PROGRESS
|
The minimum relative loss improvement that is necessary to continue training when `EARLY_STOP` is set to true. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
DATA_SPLIT_METHOD
|
The method to split input data into training and evaluation sets when not running hyperparameter tuning, or into training, evaluation, and test sets when running hyperparameter tuning. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep Matrix factorization |
DATA_SPLIT_EVAL_FRACTION
|
Specifies the fraction of the data used for evaluation. Accurate to two decimal places. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep Matrix factorization |
DATA_SPLIT_TEST_FRACTION
|
Specifies the fraction of the data used for testing when you are running hyperparameter tuning. Accurate to two decimal places. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep Matrix factorization |
DATA_SPLIT_COL
|
Identifies the column used to split the data. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep Matrix factorization |
OPTIMIZE_STRATEGY
|
The strategy to train linear regression models. | Linear regression |
L1_REG
|
The amount of L1 regularization applied. | Linear & logistic regression
, Boosted trees Random forest |
L2_REG
|
The amount of L2 regularization applied. | Linear & logistic regression
, Boosted trees , Random forest , Matrix factorization , ARIMA_PLUS_XREG |
LEARN_RATE_STRATEGY
|
The strategy for specifying the learning rate during training. | Linear & logistic regression |
LEARN_RATE
|
The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. | Linear & logistic regression |
LS_INIT_LEARN_RATE
|
Sets the initial learning rate that LEARN_RATE_STRATEGY=LINE_SEARCH uses. | Linear & logistic regression |
WARM_START
|
Retrain a model with new training data, new model options, or both. | Linear & logistic regression
, DNN , Wide & Deep , Kmeans , Autoencoder |
AUTO_CLASS_WEIGHTS
|
Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class. | Logistic regression
, Boosted tree classifier , Random forest classifier , DNN classifier , Wide & Deep classifier |
CLASS_WEIGHTS
|
The weights to use for each class label. This option cannot be specified
if AUTO_CLASS_WEIGHTS is specified. It takes an ARRAY of STRUCTs; each STRUCT is a (STRING, FLOAT64) pair representing a class label and the corresponding weight. A weight must be present for every class label. The weights are not required to add up to one. For example: CLASS_WEIGHTS = [STRUCT('example_label', .2)]. |
Logistic regression
, Boosted tree classifier , Random forest classifier , DNN classifier , Wide & Deep classifier |
INSTANCE_WEIGHT_COL
|
Identifies the column used to specify the weights for each data point in the training dataset. | Boosted trees
, Random forest |
NUM_CLUSTERS
|
The number of clusters to identify in the input data. | Kmeans |
KMEANS_INIT_METHOD
|
The method of initializing the clusters. | Kmeans |
KMEANS_INIT_COL
|
Identifies the column used to initialize the centroids. | Kmeans |
DISTANCE_TYPE
|
The type of metric to compute the distance between two points. | K-means |
STANDARDIZE_FEATURES
|
Whether to standardize numerical features . | Kmeans |
BUDGET_HOURS
|
Sets the training budget hours. | AutoML |
OPTIMIZATION_OBJECTIVE
|
Sets the optimization objective function to use for AutoML. | AutoML |
MODEL_PATH
|
Specifies the location of the imported model to import. | Imported TensorFlow model
, Imported TensorFlow lite model , Imported ONNX model , Imported XGBoost model |
FEEDBACK_TYPE
|
Specifies feedback type for matrix factorization models which changes the algorithm that is used during training. | Matrix factorization |
NUM_FACTORS
|
Specifies the number of latent factors. | Matrix factorization |
USER_COL
|
The user column name. | Matrix factorization |
ITEM_COL
|
The item column name. | Matrix factorization |
|
The rating column name. | Matrix factorization |
WALS_ALPHA
|
A hyperparameter for matrix factorization models with IMPLICIT feedback. | Matrix factorization |
BOOSTER_TYPE
|
For boosted tree models, specify the booster type to use, with default value GBTREE. | Boosted trees |
NUM_PARALLEL_TREE
|
Number of parallel trees constructed during each iteration. | Boosted trees
, Random forest |
DART_NORMALIZE_TYPE
|
Type of normalization algorithm for DART booster. | Boosted trees |
TREE_METHOD
|
Type of tree construction algorithm. | Boosted trees
, Random forest |
MIN_TREE_CHILD_WEIGHT
|
Minimum sum of instance weight needed in a child for further partitioning. | Boosted trees
, Random forest |
COLSAMPLE_BYTREE
|
Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. | Boosted trees
, Random forest |
COLSAMPLE_BYLEVEL
|
Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. | Boosted trees
, Random forest |
COLSAMPLE_BYNODE
|
Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. | Boosted trees
, Random forest |
MIN_SPLIT_LOSS
|
Minimum loss reduction required to make a further partition on a leaf node of the tree. | Boosted trees
, Random forest |
MAX_TREE_DEPTH
|
Maximum depth of a tree. | Boosted trees
, Random forest |
SUBSAMPLE
|
Subsample ratio of the training instances. | Boosted trees
, Random forest |
ACTIVATION_FN
|
Specifies the activation function of the neural network. | DNN
, Wide & Deep , Autoencoder |
BATCH_SIZE
|
Specifies the mini batch size of samples that are fed to the neural network. | DNN
, Wide & Deep , Autoencoder |
DROPOUT
|
Specifies the dropout rate of units in the neural network. | DNN
, Wide & Deep , Autoencoder |
|
Specifies the hidden layers of the neural network. | DNN
, Wide & Deep , Autoencoder |
OPTIMIZER
|
Specifies the optimizer for training the model. | DNN
, Wide & Deep , Autoencoder |
|
The timestamp column name for time series models. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
TIME_SERIES_DATA_COL
|
The data column name for time series models. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
TIME_SERIES_ID_COL
|
The ID column names for time-series models. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
HORIZON
|
The number of time points to forecast. When forecasting multiple time series at once, this parameter applies to each time series. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
AUTO_ARIMA
|
Whether the training process should use auto.ARIMA or not. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
AUTO_ARIMA_MAX_ORDER
|
The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
AUTO_ARIMA_MIN_ORDER
|
The minimum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
NON_SEASONAL_ORDER
|
The tuple of non-seasonal p, d, and q for the ARIMA_PLUS model. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
DATA_FREQUENCY
|
The data frequency of the input time series. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
FORECAST_LIMIT_LOWER_BOUND
|
The lower bound of the time series forecasting values. | ARIMA_PLUS |
FORECAST_LIMIT_UPPER_BOUND
|
The upper bound of the time series forecasting values. | ARIMA_PLUS |
INCLUDE_DRIFT
|
Should the ARIMA_PLUS model include a linear drift term or not. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
HOLIDAY_REGION
|
The geographical region based on which the holiday effect is applied in modeling. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
CLEAN_SPIKES_AND_DIPS
|
Whether the spikes and dips should be cleaned. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
ADJUST_STEP_CHANGES
|
Whether the step changes should be adjusted. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
DECOMPOSE_TIME_SERIES
|
Whether the separate components of both the history and the forecast parts of the time series (such as seasonal components) should be saved. | ARIMA_PLUS |
HIERARCHICAL_TIME_SERIES_COLS
|
The column names used to generate hierarchical time series forecasts. The column order represents the hierarchy structure. | ARIMA_PLUS |
ENABLE_GLOBAL_EXPLAIN
|
Specifies whether to compute global explanations using explainable AI to evaluate global feature importance to the model. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep |
APPROX_GLOBAL_FEATURE_CONTRIB
|
Specifies whether to use fast approximation for feature contribution computation. | Boosted trees
, Random forest |
INTEGRATED_GRADIENTS_NUM_STEPS
|
Specifies the number of steps to sample between the example being explained and its baseline for approximating the integral in integrated gradients attribution methods. | DNN
, Wide & Deep |
CALCULATE_P_VALUES
|
Specifies whether to compute p-values for the model during training. | Linear & logistic regression |
FIT_INTERCEPT
|
Specifies whether to fit an intercept for the model during training. | Linear & logistic regression |
CATEGORY_ENCODING_METHOD
|
Specifies the default encoding method for categorical features. | Linear & logistic regression
, Boosted trees |
ENDPOINT
|
Specifies the Vertex AI endpoint to use for a remote model. This can be the name of a Google model in Vertex AI or the HTTPS endpoint of a model deployed to Vertex AI. | Remote models over Google models in Vertex AI
Remote models over hosted models in Vertex AI |
REMOTE_SERVICE_TYPE
|
Specifies the Cloud AI service to use for a remote model. | Remote models over Cloud AI services |
XGBOOST_VERSION
|
Specifies the Xgboost version for model training. | Boosted trees
, Random forest |
TF_VERSION
|
Specifies the TensorFlow (TF) version for model training. | DNN
, Wide & Deep , Autoencoder |
NUM_TRIALS
|
Specifies the maximum number of submodels to train when you are running hyperparameter tuning. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
MAX_PARALLEL_TRIALS
|
Specifies the maximum number of trials to run at the same time when you are running hyperparameter tuning. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
HPARAM_TUNING_ALGORITHM
|
Specifies the algorithm used to tune the hyperparameters when you are running hyperparameter tuning. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
HPARAM_TUNING_OBJECTIVES
|
Specifies the hyperparameter tuning objective for the model. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , Kmeans , Matrix factorization , Autoencoder |
NUM_PRINCIPAL_COMPONENTS
|
The number of principal components to keep. | PCA |
PCA_EXPLAINED_VARIANCE_RATIO
|
The ratio for the explained variance. | PCA |
SCALE_FEATURES
|
Determines whether or not to scale the numerical features to unit variance. | PCA |
PCA_SOLVER
|
The solver to use to calculate the principal components. | PCA |
TIME_SERIES_LENGTH_FRACTION
|
The fraction of the interpolated length of the time series that's used to model the time series trend component. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
MIN_TIME_SERIES_LENGTH
|
The minimum number of time points that are used in modeling the trend component of the time series. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
MAX_TIME_SERIES_LENGTH
|
The maximum number of time points that are used in modeling the trend component of the time series. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
TREND_SMOOTHING_WINDOW_SIZE
|
The smoothing window size for the trend component. | ARIMA_PLUS
, ARIMA_PLUS_XREG |
SEASONALITIES
|
The seasonality of the time series data refers to the presence of variations that occur at certain regular intervals such as weekly, monthly or quarterly. | ARIMA_PLUS |
PROMPT_COL
|
The name of the prompt column in the training data table to use when performing supervised tuning. | Remote models over Google models in Vertex AI |
LEARNING_RATE_MULTIPLIER
|
A multiplier to apply to the recommended learning rate when performing supervised tuning. | Remote models over Google models in Vertex AI |
EVALUATION_TASK
|
When performing supervised tuning, the type of task that you want to tune the model to perform. | Remote models over Google models in Vertex AI |
DOCUMENT_PROCESSOR
|
Identifies the document processor to use when the REMOTE_SERVICE_TYPE option value is CLOUD_AI_DOCUMENT_V1
. |
Remote models over Cloud AI services |
SPEECH_RECOGNIZER
|
Identifies the speech recognizer to use when the REMOTE_SERVICE_TYPE option value is CLOUD_AI_SPEECH_TO_TEXT_V2
|
Remote models over Cloud AI services |
KMS_KEY_NAME
|
Specifies the Cloud Key Management Service customer-managed encryption key (CMEK) to use to encrypt the model. | Linear & logistic regression
, Boosted trees , Random forest , DNN , Wide & Deep , AutoML , K-means , PCA , Autoencoder , Matrix factorization , ARIMA_PLUS , ARIMA_PLUS_XREG , ONNX , TensorFlow , TensorFlow Lite , XGBoost |
CONTRIBUTION_METRIC
|
The expression to use when performing contribution analysis. | Contribution analysis |
DIMENSION_ID_COLS
|
The names of the columns to use as dimensions when summarizing the contribution analysis metric. | Contribution analysis |
IS_TEST_COL
|
The name of the column to use to determine whether a given row is test data or control data. | Contribution analysis |
MIN_APRIORI_SUPPORT
|
The minimum apriori support threshold for including segments in the model output. | Contribution analysis |
TOP_K_INSIGHTS_BY_APRIORI_SUPPORT
|
The number of top insights by apriori support to include in the model output. | Contribution analysis |
PRUNING_METHOD
|
The pruning method to use for the contribution analysis model. | Contribution analysis |
AS
All model types support the following AS
clause syntax for specifying the
training data: AS query_statement
For time series forecasting models that have a DATA_FREQUENCY
value
of either DAILY
or AUTO_FREQUENCY
, you can optionally use the
following AS
clause syntax to perform custom holiday modeling
in addition to specifying the training data:
AS ( training_data AS ( query_statement ), custom_holiday AS ( holiday_statement ) )
query_statement
The query_statement
argument specifies the query that is used to
generate the training data. For information about the supported SQL syntax of
the query_statement
clause, see GoogleSQL query syntax
.
holiday_statement
The holiday_statement
argument specifies the query that provides custom
holiday modeling information for time series forecast models. This query must
return 50,000 rows or less and must contain the following columns:
-
region
: Required. ASTRING
value that identifies the region to target for holiday modeling. Use one of the following options:- An upper-case holiday region code
.
Use this option to
overwrite or supplement the holidays for the specified region. You
can see the holidays for a region by running
SELECT * FROM bigquery-public-data.ml_datasets.holidays_and_events_for_forecasting WHERE region = region
. - An arbitrary string. Use this option to specify a custom region that
you want to model holidays for. For example, you could specify
London
if you are only modeling holidays for that city.
Be sure not to use an existing holiday region code when you are trying to model for a custom region. For example, if you want to model a holiday in California, and specify
CA
as theregion
value, the service recognizes that as the holiday region code for Canada and targets that region. Because the argument is case-sensitive, you could specifyca
,California
, or some other value that isn't a holiday region code. - An upper-case holiday region code
.
Use this option to
overwrite or supplement the holidays for the specified region. You
can see the holidays for a region by running
-
holiday_name
: Required. ASTRING
value that identifies the holiday to target for holiday modeling. Use one of the following options:- The holiday name as it is represented in the
bigquery-public-data.ml_datasets.holidays_and_events_for_forecasting
public table, including case. Use this option to overwrite or supplement the specified holiday. - A string that represents a custom holiday. The string must be a valid
column name so that it can be used in
ML.EXPLAIN_FORECAST
output. For example, it cannot contain space. For more information on column naming, see Column names .
- The holiday name as it is represented in the
-
primary_date
: Required. ADATE
value that specifies the date the holiday falls on. -
preholiday_days
: Optional. AnINT64
value that specifies the start of the holiday window around the holiday that is taken into account when modeling. Must be greater than or equal to1
. Defaults to1
. -
postholiday_days
: Optional. AnINT64
value that specifies the end of the holiday window around the holiday that is taken into account when modeling. Must be greater than or equal to1
. Defaults to1
.
The preholiday_days
and postholiday_days
arguments together describe
the holiday window around the holiday that is taken into account
when modeling. The holiday window is defined as [primary_date - preholiday_days, primary_date + postholiday_days]
and is
inclusive of the pre- and post-holiday days. The value for each holiday window
must be less than or equal to 30
and must be the same across the given
holiday. For example, if you are modeling Arbor Day for several different years,
you must specify the same holiday window for all of those years.
To achieve the best holiday modeling result, provide as much historical and forecast information about the occurrences of each included holiday as possible. For example, if you have time series data from 2018 to 2022 and would like to forecast for 2023, you get the best result by providing the custom holiday information for all of those years, similar to the following:
CREATE OR REPLACE MODEL `mydataset.arima_model` OPTIONS ( model_type = 'ARIMA_PLUS' , holiday_region = 'US' ,...) AS ( training_data AS ( SELECT * FROM `mydataset.timeseries_data` ), custom_holiday AS ( SELECT 'US' AS region , 'Halloween' AS holiday_name , primary_date , 5 AS preholiday_days , 1 AS postholiday_days FROM UNNEST ( [ DATE ( '2018-10-31' ), DATE ( '2019-10-31' ), DATE ( '2020-10-31' ), DATE ( '2021-10-31' ), DATE ( '2022-10-31' ), DATE ( '2023-10-31' ) ] ) AS primary_date ) )
Supported inputs
The CREATE MODEL
statement supports the following data types for input label,
data split columns and input feature columns.
Supported input feature types
See Supported input feature types for BigQuery ML supported input feature types.
Supported data types for input label columns
BigQuery ML supports different GoogleSQL data types depending on the
model type. Supported data types for input_label_cols
include:
Model type
|
Supported label types
|
---|---|
regression models
|
INT64
NUMERIC
BIGNUMERIC
FLOAT64
|
classification models
|
Any groupable data type |
Supported data types for data split columns
BigQuery ML supports different GoogleSQL data types depending on
the data split method. Supported data types for data_split_col
include:
Data split method
|
Supported column types
|
---|---|
CUSTOM
|
BOOL
|
SEQ
|
INT64
NUMERIC
BIGNUMERIC
FLOAT64
TIMESTAMP
|
Limitations
CREATE MODEL
statements must comply with the following rules:
- Only one
CREATE
statement is allowed. - When you use a
CREATE MODEL
statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables. - The label column cannot contain
NULL
values. If the label column containsNULL
values, then the query fails. - The
CREATE MODEL IF NOT EXISTS
clause always updates the last modified timestamp of a model. - Query statements used in the
CREATE MODEL
statement cannot containEXTERNAL_QUERY
. If you want to useEXTERNAL_QUERY
, then materialize the query result and then use theCREATE MODEL
statement with the newly created table.