This page shows you how to train a classification or regression model from a tabular dataset with Tabular Workflow for End-to-End AutoML.
Before you begin
Before you train a model, complete the following:
- Prepare your training data
- Create a Vertex AI dataset .
-
Enable the following APIs: Vertex AI, Dataflow, Compute Engine, Cloud Storage.
-
Ensure that your project's service accounts have the necessary roles assigned to them. To view the service accounts and their associated roles, go to the IAMpage and check the "Include Google-provided role grants" checkbox.
If you receive an error related to quotas while running Tabular Workflow for End-to-End AutoML, you might need to request a higher quota. To learn more, see Manage quotas for Tabular Workflows .
Get the URI of the previous hyperparameter tuning result
If you previously completed an End-to-End AutoML workflow run, use the hyperparameter tuning result from the previous run to save training time and resources. Find the previous hyperparameter tuning result by using the Google Cloud console or by loading it programmatically with the API.
Google Cloud console
To find the hyperparameter tuning result URI by using the Google Cloud console, perform the following steps:
-
In the Google Cloud console, in the Vertex AI section, go to the Pipelinespage.
-
Select the Runstab.
-
Select the pipeline run you want to use.
-
Select Expand Artifacts.
-
Click on component exit-handler-1.
-
Click on component stage_1_tuning_result_artifact_uri_empty.
-
Find component automl-tabular-cv-trainer-2.
-
Click on the associated artifact tuning_result_output.
-
Select the Node Infotab.
-
Copy the URI for use in the Train a model step.
API: Python
The following sample code demonstrates how you load the hyperparameter
tuning result by using the API. The variable job
refers to the previous
model training pipeline run.
def
get_task_detail
(
task_details
:
List
[
Dict
[
str
,
Any
]],
task_name
:
str
)
-
> List
[
Dict
[
str
,
Any
]]:
for
task_detail
in
task_details
:
if
task_detail
.
task_name
==
task_name
:
return
task_detail
pipeline_task_details
=
job
.
gca_resource
.
job_detail
.
task_details
stage_1_tuner_task
=
get_task_detail
(
pipeline_task_details
,
"automl-tabular-stage-1-tuner"
)
stage_1_tuning_result_artifact_uri
=
(
stage_1_tuner_task
.
outputs
[
"tuning_result_output"
]
.
artifacts
[
0
]
.
uri
)
Train a model
Google Cloud console
To train a model by using the Google Cloud console, perform the following steps:
-
In the Google Cloud console, in the Vertex AI section, go to the Pipelinespage.
-
Select the Template Gallerytab.
-
In the AutoML for Tabular Classification / Regressioncard, click Create run.
-
In the Run detailspage, configure as follows:
- Enter a pipeline run name.
- Optional: If you want to set the Vertex AI Pipelines service account or the Dataflow worker service account , open the Advanced options . Learn more about service accounts.
- Click Continue .
-
In the Runtime configurationpage, configure as follows:
-
Enter a Cloud Storage bucket or a folder within the bucket to use as the root output directory. This directory will be used to save intermediate files, such as the materialized dataset and the model. Remember to clean up the directory after training is complete and the model and other important artifacts are copied to another Cloud Storage bucket. Alternately, set a Time to Live (TTL) for the Cloud Storage bucket.
The buckets for your project are listed in the Cloud Storage section of Google Cloud console.
- Click Continue .
-
-
In the Training methodpage, configure as follows:
- Select the name of the dataset you want to use to train your model.
- Select your target column. The target column is the value that the model will predict. Learn more about target column requirements .
- Enter the display name for your new model.
- Optional
: To choose how to split the data between training, test, and
validation sets, open the Advanced options
. You can choose between the
following data split options:
- Random (Default): Vertex AI randomly selects the rows associated with each of the data sets. By default, Vertex AI selects 80% of your data rows for the training set, 10% for the validation set, and 10% for the test set. Set the percentage of data rows that you want to be associated with each of the data sets.
- Manual : Vertex AI selects data rows for each of the data sets based on the values in a data split column. Provide the name of the data split column.
- Chronological : Vertex AI splits data based on the timestamp in a time column. Provide the name of the time column. You can also set the percentage of data rows that you want to be associated with the training set, the validation set, and the test set.
- Stratified : Vertex AI randomly selects the rows associated with each of the data sets, but preserves the distribution of target column values. Provide the name of the target column. You can also set the percentage of data rows that you want to be associated with the training set, the validation set, and the test set.
- Optional: You can run the pipeline without the architecture search. If you choose Skip architecture search , you will be prompted to provide a set of hyperparameters from a previous pipeline run in the Training options page.
- Click Continue .
-
In the Training optionspage, configure as follows:
- Optional: Click Generate statistics . Generating statistics populates the Transformation dropdown menus.
- Review your column list and exclude any columns from training that should not be used to train the model.
- Review the transformations selected for your included features, along with whether invalid data is allowed, and make any required updates. Learn more about transformations and invalid data .
- If you chose to skip the architecture search in the Training method page, provide the path to the hyperparameter tuning result from a previous pipeline run.
- Optional: If you want to specify the weight column, open the Advanced options and make your selection. Learn more about weight columns .
- Optional: If you want to change your optimization objective from the default, open the Advanced options and make your selection. Learn more about optimization objectives .
- Optional: If you choose to perform the architecture search in the Training method page, you can specify the number of parallel trials. Open the Advanced options and enter your value.
- Optional:
You can provide fixed values for a subset of the hyperparameters.
Vertex AI searches for the optimal values of the remaining unfixed hyperparameters.
This option is a good choice if you have a strong preference for the model type. You can choose
between neural networks and boosted trees for your model type. Open the Advanced options
and provide a study spec override in JSON format.
For example, if you want to set the model type to Neural Networks (NN), enter the following:
[ { "parameter_id": "model_type", "categorical_value_spec": { "values": ["nn"] } } ]
- Click Continue .
-
In the Compute and pricingpage, configure as follows:
- Enter the maximum number of hours you want your model to train for. Learn more about pricing .
- Optional: In the Compute Settings section, you can configure the machine types and the number of machines for each stage of the workflow. This option is a good choice if you have a large dataset and want to optimize the machine hardware accordingly.
-
Click Submit.
API: Python
The following sample code demonstrates how to run a model training pipeline:
job
=
aiplatform
.
PipelineJob
(
...
template_path
=
template_path
,
parameter_values
=
parameter_values
,
...
)
job
.
run
(
service_account
=
SERVICE_ACCOUNT
)
The optional service_account
parameter in job.run()
lets you set the
Vertex AI Pipelines service account to an account of your choice.
The pipeline and the parameter values are defined by the following function. The training data can be either a CSV file in Cloud Storage or a table in BigQuery.
template_path
,
parameter_values
=
automl_tabular_utils
.
get_automl_tabular_pipeline_and_parameters
(
...
)
The following is a subset of get_automl_tabular_pipeline_and_parameters
parameters:
Parameter name | Type | Definition |
---|---|---|
data_source_csv_filenames
|
String | A URI for a CSV stored in Cloud Storage. |
data_source_bigquery_table_path
|
String | A URI for a BigQuery table. |
dataflow_service_account
|
String | (Optional) Custom service account to run Dataflow jobs. The Dataflow job can be configured to use private IPs and a specific VPC subnet. This parameter acts as an override for the default Dataflow worker service account. |
prediction_type
|
String | Choose classification
to train a classification model or regression
to train a regression model. |
optimization_objective
|
String | If you are training a binary classification model, the default objective is AUC ROC. If you are training a regression model, the default objective is RMSE. If you want a different optimization objective for your model, choose one of the options in Optimization objectives for classification or regression models . |
enable_probabilistic_inference
|
Boolean | If you are training a regression model and you set this value to true
, Vertex AI models the probability distribution of the inference. Probabilistic inference can improve model quality by handling noisy data and quantifying uncertainty. If quantiles
are specified, then Vertex AI also returns the quantiles of the distribution. |
quantiles
|
List[float] | Quantiles to use for probabilistic inference. A quantile indicates the likelihood that a target is less than a given value. Provide a list of up to five unique numbers between 0
and 1
, exclusive. |
Workflow customization options
You can customize the End-to-End AutoML workflow by defining argument values that are passed in during pipeline definition. You can customize your workflow in the following ways:
- Override search space
- Configure hardware
- Distill the model
- Skip architecture search
Override search space
The following get_automl_tabular_pipeline_and_parameters
parameter lets
you provide fixed values for a subset of the hyperparameters.
Vertex AI searches for the optimal values of the remaining
unfixed hyperparameters. Use this parameter if you want to choose
between neural networks and boosted trees for your model type.
Parameter name | Type | Definition |
---|---|---|
study_spec_parameters_override
|
List[Dict[String, Any]] | (Optional) Custom subset of hyperparameters. This parameter configures the automl-tabular-stage-1-tuner
component of the pipeline. |
The following code demonstrates how to set the model type to Neural Networks (NN):
study_spec_parameters_override
=
[
{
"parameter_id"
:
"model_type"
,
"categorical_value_spec"
:
{
"values"
:
[
"nn"
]
# The default value is ["nn", "boosted_trees"], this reduces the search space
}
}
]
Configure hardware
The following get_automl_tabular_pipeline_and_parameters
parameters let
you configure the machine types and the number of machines for training.
This option is a good choice if you have a large dataset and want to optimize
the machine hardware accordingly.
Parameter name | Type | Definition |
---|---|---|
stage_1_tuner_worker_pool_specs_override
|
Dict[String, Any] | (Optional) Custom configuration of the machine types and the number of machines for training. This parameter configures the automl-tabular-stage-1-tuner
component of the pipeline. |
cv_trainer_worker_pool_specs_override
|
Dict[String, Any] | (Optional) Custom configuration of the machine types and the number of machines for training. This parameter configures the automl-tabular-stage-1-tuner
component of the pipeline. |
The following code demonstrates how to set n1-standard-8
machine type for the
TensorFlow chief node and n1-standard-4
machine type for the
TensorFlow evaluator node:
worker_pool_specs_override
=
[
{
"machine_spec"
:
{
"machine_type"
:
"n1-standard-8"
}},
# override for TF chief node
{},
# override for TF worker node, since it's not used, leave it empty
{},
# override for TF ps node, since it's not used, leave it empty
{
"machine_spec"
:
{
"machine_type"
:
"n1-standard-4"
# override for TF evaluator node
}
}
]
Distill the model
The following get_automl_tabular_pipeline_and_parameters
parameter lets you
create a smaller version of the ensemble model. A smaller model reduces
latency and cost for inference.
Parameter name | Type | Definition |
---|---|---|
run_distillation
|
Boolean | If TRUE
, creates a smaller version of the ensemble model. |
Skip architecture search
The following get_automl_tabular_pipeline_and_parameters
parameter
lets you run the pipeline without the architecture search and provide a set of hyperparameters from a previous pipeline run
instead.
Parameter name | Type | Definition |
---|---|---|
stage_1_tuning_result_artifact_uri
|
String | (Optional) URI of the hyperparameter tuning result from a previous pipeline run. |
Optimization objectives for classification or regression models
When you train a model, Vertex AI selects a default optimization objective based on your model type and the data type used for your target column.
Classification models are best for:Optimization objective | API value | Use this objective if you want to... |
---|---|---|
AUC ROC
|
maximize-au-roc
|
Maximize the area under the receiver operating characteristic (ROC) curve. Distinguishes between classes. Default value for binary classification. |
Log loss
|
minimize-log-loss
|
Keep inferences probabilities as accurate as possible. Only supported objective for multi-class classification. |
AUC PR
|
maximize-au-prc
|
Maximize the area under the precision-recall curve. Optimizes results for inferences for the less common class. |
Precision at Recall
|
maximize-precision-at-recall
|
Optimize precision at a specific recall value. |
Recall at Precision
|
maximize-recall-at-precision
|
Optimize recall at a specific precision value. |
Optimization objective | API value | Use this objective if you want to... |
---|---|---|
RMSE
|
minimize-rmse
|
Minimize root-mean-squared error (RMSE). Captures more extreme values accurately. Default value. |
MAE
|
minimize-mae
|
Minimize mean-absolute error (MAE). Views extreme values as outliers with less impact on model. |
RMSLE
|
minimize-rmsle
|
Minimize root-mean-squared log error (RMSLE). Penalizes error on relative size rather than absolute value. Useful when both predicted and actual values can be quite large. |
What's next
- Learn about online inferences for classification and regression models.
- Learn about batch inferences for classification and regression models.
- Learn about pricing for model training .