This page shows you how to run a custom training job on a persistent resource by using the Google Cloud CLI, Vertex AI SDK for Python, and the REST API.
Normally, when you create a custom training job , you need to specify compute resources that the job creates and runs on. After you create a persistent resource, you can instead configure the custom training job to run on one or more resource pools of that persistent resource. Running a custom training job on a persistent resource significantly reduces the job startup time that's otherwise needed for compute resource creation.
Required roles
To get the permission that
you need to run custom training jobs on a persistent resource,
ask your administrator to grant you the Vertex AI User
( roles/aiplatform.user
)
IAM role on your project.
For more information about granting roles, see Manage access to projects, folders, and organizations
.
This predefined role contains the aiplatform.customJobs.create
permission,
which is required to
run custom training jobs on a persistent resource.
You might also be able to get this permission with custom roles or other predefined roles .
Create a training job that runs on a persistent resource
To create a custom training jobs that runs on a persistent resource, make the following modifications to the standard instructions for creating a custom training job :
gcloud
- Specify the
--persistent-resource-idflag and set the value to the ID of the persistent resource ( PERSISTENT_RESOURCE_ID ) that you want to use. - Specify the
--worker-pool-specflag such that the values formachine-typeanddisk-typematches exactly with a corresponding resource pool from the persistent resource. Specify one--worker-pool-specfor single node training and multiple for distributed training. - Specify a
replica-countless than or equal to thereplica-countormax-replica-countof the corresponding resource pool.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python . For more information, see the Python API reference documentation .
REST
- Specify the
persistent_resource_idparameter and set the value to the ID of the persistent resource ( PERSISTENT_RESOURCE_ID ) that you want to use. - Specify the
worker_pool_specsparameter such that the values ofmachine_specanddisk_specfor each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify onemachine_specfor single node training and multiple for distributed training. - Specify a
replica_countless than or equal to thereplica_countormax_replica_countof the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

