Visualizing jobs with Vertex AI TensorBoard

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

With Managed Training, you can visualize your training logs in near real-time using Vertex AI TensorBoard. Simply configure your workload to save logs to a Cloud Storage bucket, and they will be automatically streamed to the TensorBoard interface for analysis.

Prerequisites

Before you begin, ensure you have the following:

  • A running Managed Training cluster.
  • A Cloud Storage bucket to store your TensorBoard logs. This bucket must be in the same region as your TensorBoard instance. For setup instructions, see Create a Cloud Storage bucket .
  • A Vertex AI TensorBoard instance. For creation instructions, see Create a Vertex AI TensorBoard instance.
  • The correct IAM permissions. To allow Cloud Storage FUSE to read from and write to the storage bucket, the service account used by your cluster's VMs requires the Storage Object User ( roles/storage.objectUser ) role.

Enabling Tensorboard upload

To configure the TensorBoard integration for your job, pass the following arguments using the --extra flag in your Slurm job submission:

  • tensorboard_base_output_dir : Specifies the Cloud Storage path to upload logs to. For example, gs://my-bucket/my-logs .

  • tensorboard_url : Specifies the Vertex AI TensorBoard instance, experiment, or run URL. If only an instance is provided, a new experiment and run are created. If omitted, the default TensorBoard instance for the project is used. For example, projects/123/locations/us-central1/tensorboards/456 .

Example

  # 
Using specific tensorboard instance
sbatch --extra="tensorboard_base_output_dir=<your-cloud-storage-dir>,tensorboard_url=projects/<project-id>/locations/<location>/tensorboards/<tensorboard-instance-id>" your_script.sbatch 

Writing logs from your training job

Within your training script, access the AIP_TENSORBOARD_LOG_DIR environment variable. This variable provides the unique Cloud Storage path where your script should write its TensorBoard logs.

The path follows this structure:

 gs://<your-cloud-storage-path>/<cluster-id>-<cluster-uuid>/tensorboard/job-<job-id>/ 

The following example shows a complete workflow with two key components: the Slurm submission script that configures the job, and the Python training script that reads the environment variable to write its logs.

Slurm Job Script (simple_job.sbatch):

  #!/bin/bash 
 #SBATCH --job-name=tensorboard-simple-test 
 #SBATCH --output=tensorboard-simple-test-%j.out 
 
  # 
  
 Activate 
  
 your 
  
 Python 
  
 virtual 
  
 environment 
  
 if 
  
 needed 
 # 
  
 source 
  
 / 
 path 
 / 
 to 
 / 
 your 
 / 
 venv 
 / 
 bin 
 / 
 activate 
 python3 
  
 simple_logger 
 . 
 py 
 

Python Script (simple_logger.py):

  import 
  
 tensorflow 
  
 as 
  
 tf 
 import 
  
 os 
 # Get the log directory from the environment variable 
 log_dir 
 = 
 os 
 . 
 environ 
 . 
 get 
 ( 
 "AIP_TENSORBOARD_LOG_DIR" 
 ) 
 print 
 ( 
 f 
 "Writing TensorBoard logs to: 
 { 
 log_dir 
 } 
 " 
 ) 
 writer 
 = 
 tf 
 . 
 summary 
 . 
 create_file_writer 
 ( 
 log_dir 
 ) 
 with 
 writer 
 . 
 as_default 
 (): 
 for 
 step 
 in 
 range 
 ( 
 10 
 ): 
 # Simulate some metrics 
 loss 
 = 
 1.0 
 - 
 ( 
 step 
 * 
 0.1 
 ) 
 accuracy 
 = 
 0.6 
 + 
 ( 
 step 
 * 
 0.04 
 ) 
 # Log the metrics 
 tf 
 . 
 summary 
 . 
 scalar 
 ( 
 'loss' 
 , 
 loss 
 , 
 step 
 = 
 step 
 ) 
 tf 
 . 
 summary 
 . 
 scalar 
 ( 
 'accuracy' 
 , 
 accuracy 
 , 
 step 
 = 
 step 
 ) 
 writer 
 . 
 flush 
 () 
 print 
 ( 
 f 
 "Step 
 { 
 step 
 } 
 : loss= 
 { 
 loss 
 : 
 .4f 
 } 
 , accuracy= 
 { 
 accuracy 
 : 
 .4f 
 } 
 " 
 ) 
 writer 
 . 
 close 
 () 
 print 
 ( 
 f 
 "--- Finished writing metrics to 
 { 
 log_dir 
 } 
 ---" 
 ) 
 

Real-time Log Synchronization

To visualize metrics from a running job, you must periodically close and recreate the summary writer in your training code. This is necessary because gcsfuse only syncs log files to Cloud Storage once they are closed. This "flushing" technique ensures that intermediate results are visible in the TensorBoard console before the job completes.

Viewing Vertex AI TensorBoard

Once your job is submitted, you can monitor its progress by going to the to the Vertex AI Experiments page in the Google Cloud console.

Create a Mobile Website
View Site in Mobile | Classic
Share by: