Run custom training jobs on a persistent resource

This page shows you how to run a custom training job on a persistent resource by using the Google Cloud CLI, Vertex AI SDK for Python, and the REST API.

Normally, when you create a custom training job , you need to specify compute resources that the job creates and runs on. After you create a persistent resource, you can instead configure the custom training job to run on one or more resource pools of that persistent resource. Running a custom training job on a persistent resource significantly reduces the job startup time that's otherwise needed for compute resource creation.

Required roles

To get the permission that you need to run custom training jobs on a persistent resource, ask your administrator to grant you the Vertex AI User ( roles/aiplatform.user ) IAM role on your project. For more information about granting roles, see Manage access to projects, folders, and organizations .

This predefined role contains the aiplatform.customJobs.create permission, which is required to run custom training jobs on a persistent resource.

You might also be able to get this permission with custom roles or other predefined roles .

Create a training job that runs on a persistent resource

To create a custom training jobs that runs on a persistent resource, make the following modifications to the standard instructions for creating a custom training job :

gcloud

  • Specify the --persistent-resource-id flag and set the value to the ID of the persistent resource ( PERSISTENT_RESOURCE_ID ) that you want to use.
  • Specify the --worker-pool-spec flag such that the values for machine-type and disk-type matches exactly with a corresponding resource pool from the persistent resource. Specify one --worker-pool-spec for single node training and multiple for distributed training.
  • Specify a replica-count less than or equal to the replica-count or max-replica-count of the corresponding resource pool.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python . For more information, see the Python API reference documentation .

  def 
  
 create_custom_job_on_persistent_resource_sample 
 ( 
  
 project 
 : 
  
 str 
 , 
  
 location 
 : 
  
 str 
 , 
  
 staging_bucket 
 : 
  
 str 
 , 
  
 display_name 
 : 
  
 str 
 , 
  
 container_uri 
 : 
  
 str 
 , 
  
 persistent_resource_id 
 : 
  
 str 
 , 
  
 service_account 
 : 
  
 Optional 
 [ 
 str 
 ] 
  
 = 
  
 None 
 , 
 ) 
  
 - 
>  
 None 
 : 
  
 aiplatform 
 . 
 init 
 ( 
  
 project 
 = 
 project 
 , 
  
 location 
 = 
 location 
 , 
  
 staging_bucket 
 = 
 staging_bucket 
  
 ) 
  
 worker_pool_specs 
  
 = 
  
 [ 
 { 
 "machine_spec": { 
 "machine_type": "n1-standard-4", 
 "accelerator_type": "NVIDIA_TESLA_K80", 
 "accelerator_count": 1, 
 }, 
 "replica_count": 1, 
 "container_spec": { 
 "image_uri": container_uri, 
 "command": [ 
 ] 
 , 
  
 "args" 
 : 
  
 [] 
 , 
  
 } 
 , 
  
 }] 
  
 custom_job 
  
 = 
  
 aiplatform 
 . 
 CustomJob 
 ( 
  
 display_name 
 = 
 display_name 
 , 
  
 worker_pool_specs 
 = 
 worker_pool_specs 
 , 
  
 persistent_resource_id 
 = 
 persistent_resource_id 
 , 
  
 ) 
  
 custom_job 
 . 
 run 
 ( 
 service_account 
 = 
 service_account 
 ) 
 

REST

  • Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource ( PERSISTENT_RESOURCE_ID ) that you want to use.
  • Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
  • Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: