Run training job with experiment tracking

Vertex AI provides a managed training service that lets you operationalize large scale model training. You can enable experiment tracking using Vertex AI SDK for Python to capture parameters and performance metrics when submitting the custom training job.

This feature isn't available when you:

submit a training job through Google Cloud console or Google Cloud CLI,
use TPU in the training job,
use distributed training in the training job.

Both prebuilt training containers and custom containers are supported are supported. Required: A version of the Vertex AI SDK for Python higher than 1.24.1 for google-cloud-aiplatform is installed. If you are training with TensorFlow, ensure the protobuf version less than 4.0 is installed to avoid conflicts.

There are two options for logging data to Vertex AI Experiments, autologging and manual logging.

Autologging is recommended if you are using one of these supported frameworks: Fastai, Gluon, Keras, LightGBM, Pytorch Lightning, Scikit-learn, Spark, Statsmodels, XGBoost. If your framework isn't supported, or there are custom metrics you want to log to your experiment run, you can manually adapt your training script to log parameters, metrics and artifacts.

AutoLog data

To enable autologging, just set enable_autolog=True , see from_local_script . You have the option to create an experiment run, or not. If an experiment name isn't specified, one is created for you.

The Vertex AI SDK for Python handles creating ExperimentRun resources for you.

Python

  def 
  
 create_custom_job_with_experiment_autologging_sample 
 ( 
  
 project 
 : 
  
 str 
 , 
  
 location 
 : 
  
 str 
 , 
  
 staging_bucket 
 : 
  
 str 
 , 
  
 display_name 
 : 
  
 str 
 , 
  
 script_path 
 : 
  
 str 
 , 
  
 container_uri 
 : 
  
 str 
 , 
  
 service_account 
 : 
  
 str 
 , 
  
 experiment 
 : 
  
 str 
 , 
  
 experiment_run 
 : 
  
 Optional 
 [ 
 str 
 ] 
  
 = 
  
 None 
 , 
 ) 
  
 - 
>  
 None 
 : 
  
 aiplatform 
 . 
 init 
 ( 
 project 
 = 
 project 
 , 
  
 location 
 = 
 location 
 , 
  
 staging_bucket 
 = 
 staging_bucket 
 , 
  
 experiment 
 = 
 experiment 
 ) 
  
 job 
  
 = 
  
 aiplatform 
 . 
 CustomJob 
 . 
 from_local_script 
 ( 
  
 display_name 
 = 
 display_name 
 , 
  
 script_path 
 = 
 script_path 
 , 
  
 container_uri 
 = 
 container_uri 
 , 
  
 enable_autolog 
 = 
 True 
 , 
  
 ) 
  
 job 
 . 
 run 
 ( 
  
 service_account 
 = 
 service_account 
 , 
  
 experiment 
 = 
 experiment 
 , 
  
 experiment_run 
 = 
 experiment_run 
 , 
  
 )

project : . You can find these Project IDs in the Google Cloud console welcome page.
location : See List of available locations.
staging_bucket : The name you gave your bucket, for example, my_bucket .
display_name : The user-defined name of the CustomJob .
script_path : The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.
container_uri : The URI of the training container image can be a Vertex AI prebuilt training container or a custom container
service_account : See Create a service account with required permissions .
experiment : Provide a name for your experiment. The experiment must have a TensorBoard instance . You can find your list of experiments in the Google Cloud console by selecting Experiments in the section nav.
experiment_run : (Optional) Specify a run name. If not specified, a run is auto-created.

Manually log data

Use the manually log data option to incorporate your training script.

Here's how to change the training script:

  import 
  
 os 
 import 
  
 pickle 
 import 
  
 pandas 
  
 as 
  
 pd 
 from 
  
 sklearn.linear_model 
  
 import 
 LinearRegression 
 # To use manual logging APIs, import aiplatform 
 from 
  
 google.cloud 
  
 import 
 aiplatform 
 # Create Dataset 
 data 
 = 
 { 
 'A' 
 : 
 [ 
 1.1 
 , 
 2.2 
 , 
 4.1 
 , 
 5.2 
 ], 
 'B' 
 : 
 [ 
 200 
 , 
 212.12 
 , 
 22 
 , 
 123 
 ], 
 'Y' 
 : 
 [ 
 1 
 , 
 0 
 , 
 1 
 , 
 0 
 ]} 
 df 
 = 
 pd 
 . 
 DataFrame 
 ( 
 data 
 ) 
 X 
 = 
 df 
 [[ 
 'A' 
 , 
 'B' 
 ]] 
 Y 
 = 
 df 
 [ 
 'Y' 
 ] 
 # Train model 
 model 
 = 
 LinearRegression 
 () 
 . 
 fit 
 ( 
 X 
 , 
 Y 
 ) 
 # Save the model to gcs 
 model_dir 
 = 
 os 
 . 
 getenv 
 ( 
 'AIP_MODEL_DIR' 
 ) 
 model_gcs 
 = 
 model_dir 
 . 
 replace 
 ( 
 'gs://' 
 , 
 '/gcs/' 
 ) 
 model_name 
 = 
 'model.joblib' 
 os 
 . 
 mkdir 
 ( 
 model_gcs 
 ) 
 f 
 = 
 open 
 ( 
 os 
 . 
 path 
 . 
 join 
 ( 
 model_gcs 
 , 
 model_name 
 ), 
 'wb' 
 ) 
 pickle 
 . 
 dump 
 ( 
 model 
 , 
 f 
 ) 
 f 
 = 
 open 
 ( 
 os 
 . 
 path 
 . 
 join 
 ( 
 model_gcs 
 , 
 model_name 
 ), 
 'wb' 
 ) 
 pickle 
 . 
 dump 
 ( 
 model 
 , 
 f 
 ) 
 # Call aiplatform's logging APIs to save data to Vertex AI Experiments. 
 params 
 = 
 model 
 . 
 get_params 
 () 
 aiplatform 
 . 
 log_params 
 ( 
 params 
 ) 
 metrics 
 = 
 { 
 "training_accuracy" 
 : 
 model 
 . 
 score 
 ( 
 X 
 , 
 Y 
 )} 
 aiplatform 
 . 
 log_metrics 
 ( 
 metrics 
 )

You have the option to create an experiment run, or not. If an experiment name isn't specified, one is created for you.

Learn more, see Manually log data to an experiment run .

Python

  def 
  
 create_custom_job_with_experiment_sample 
 ( 
  
 project 
 : 
  
 str 
 , 
  
 location 
 : 
  
 str 
 , 
  
 staging_bucket 
 : 
  
 str 
 , 
  
 display_name 
 : 
  
 str 
 , 
  
 script_path 
 : 
  
 str 
 , 
  
 container_uri 
 : 
  
 str 
 , 
  
 service_account 
 : 
  
 str 
 , 
  
 experiment 
 : 
  
 str 
 , 
  
 experiment_run 
 : 
  
 Optional 
 [ 
 str 
 ] 
  
 = 
  
 None 
 , 
 ) 
  
 - 
>  
 None 
 : 
  
 aiplatform 
 . 
 init 
 ( 
  
 project 
 = 
 project 
 , 
  
 location 
 = 
 location 
 , 
  
 staging_bucket 
 = 
 staging_bucket 
 , 
  
 experiment 
 = 
 experiment 
  
 ) 
  
 job 
  
 = 
  
 aiplatform 
 . 
 CustomJob 
 . 
 from_local_script 
 ( 
  
 display_name 
 = 
 display_name 
 , 
  
 script_path 
 = 
 script_path 
 , 
  
 container_uri 
 = 
 container_uri 
 , 
  
 ) 
  
 job 
 . 
 run 
 ( 
  
 service_account 
 = 
 service_account 
 , 
  
 experiment 
 = 
 experiment 
 , 
  
 experiment_run 
 = 
 experiment_run 
 , 
  
 )

project : . You can find these Project IDs in the Google Cloud console welcome page.
location : See List of available locations
staging_bucket : The name you gave your bucket, for example, my_bucket .
display_name : The user-defined name of the CustomJob .
script_path : The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.
container_uri : The URI of the training container image can be a Vertex AI prebuilt training container , or a custom container . If you are using a custom container, be sure google-cloud-aiplatform>=1.24.0 is installed.
service_account : See Create a service account with required permissions .
experiment : Provide a name for your experiment. You can find your list of experiments in the Google Cloud console by selecting Experiments in the section nav.
experiment_run : Specify a run name. If not specified, a run is be auto-created.

View autologged parameters and metrics

Use the Vertex AI SDK for Python to compare runs and get runs data. The Google Cloud console provides an easy way to compare these runs.

What's next

Log data to an experiment run

Relevant notebook sample

Custom training autologging

Run training job with experiment tracking Stay organized with collections Save and categorize content based on your preferences.

AutoLog data

Python

Manually log data

Python

View autologged parameters and metrics

What's next

Relevant notebook sample

Run training job with experiment tracking