Schedule training jobs based on resource availability
Stay organized with collectionsSave and categorize content based on your preferences.
For custom training jobs that request GPU resources, Dynamic Workload Scheduler lets you
schedule the jobs based on when the requested GPU resources become available.
This page shows you how to schedule custom training jobs by using Dynamic Workload Scheduler,
and how to customize the scheduling behavior on Vertex AI.
Recommended use cases
We recommend using Dynamic Workload Scheduler to schedule custom training jobs in the
following situations:
The custom training job requests L4, A100, H100, H200, or B200 GPUs and you want to run the
job as soon as the requested resources become available. For example, when
Vertex AI allocates the GPU resources outside of peak hours.
Your workload requires multiple nodes and can't start running until all GPU
nodes are provisioned and ready at the same time. For example, you're
creating a distributed training job.
Requirements
To use Dynamic Workload Scheduler, your custom training job must meet the following
requirements:
Your custom training job requests L4, A100, H100, H200, or B200 GPUs.
Your custom training job has a maximumtimeoutof 7 days or less.
Your custom training job uses the same machine configuration for all worker
pools.
Supported job types
All custom training job types are supported, includingCustomJob,HyperparameterTuningjob, andTrainingPipeline.
Enable Dynamic Workload Scheduler in your custom training job
To enable Dynamic Workload Scheduler in your custom training job, set thescheduling.strategyAPI field toFLEX_STARTwhen you create the job.
For details on how to create a custom training job, see the following links.
Configure the duration to wait for resource availability
You can configure how long your job can wait for resources in thescheduling.maxWaitDurationfield. A value of0means that the job waits
indefinitely until the requested resources become available. The default value
is1 day.
Examples
The following examples show you how to enable Dynamic Workload Scheduler for acustomJob.
Select the tab for the interface that you want to use.
gcloud
When submitting a job using the Google Cloud CLI, add thescheduling.strategyfield in theconfig.yamlfile.
When submitting a job using the Vertex AI REST API, set the fieldsscheduling.strategyandscheduling.maxWaitDurationwhen creating your
custom training job.
When you submit a job using Dynamic Workload Scheduler, instead of consuming on-demand
Vertex AI quota, Vertex AI consumespreemptiblequota. For
example, for Nvidia H100 GPUs, instead of consuming:
However,preemptiblequota is used only in name. Your resources aren't
preemptible and behave like standard resources.
Before submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotas
have been increased to a sufficient amount. For details on
Vertex AI quotas and instructions for making quota increase requests, seeVertex AI quotas and limits.
Billing
You're charged only for the duration that the job is running and not for the
time that the job is waiting for resources to become available. For details,
seePricing.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Schedule training jobs based on resource availability\n\nFor custom training jobs that request GPU resources, Dynamic Workload Scheduler lets you\nschedule the jobs based on when the requested GPU resources become available.\nThis page shows you how to schedule custom training jobs by using Dynamic Workload Scheduler,\nand how to customize the scheduling behavior on Vertex AI.\n\nRecommended use cases\n---------------------\n\nWe recommend using Dynamic Workload Scheduler to schedule custom training jobs in the\nfollowing situations:\n\n- The custom training job requests L4, A100, H100, H200, or B200 GPUs and you want to run the job as soon as the requested resources become available. For example, when Vertex AI allocates the GPU resources outside of peak hours.\n- Your workload requires multiple nodes and can't start running until all GPU nodes are provisioned and ready at the same time. For example, you're creating a distributed training job.\n\nRequirements\n------------\n\nTo use Dynamic Workload Scheduler, your custom training job must meet the following\nrequirements:\n\n- Your custom training job requests L4, A100, H100, H200, or B200 GPUs.\n- Your custom training job has a maximum `timeout` of 7 days or less.\n- Your custom training job uses the same machine configuration for all worker pools.\n\n### Supported job types\n\nAll custom training job types are supported, including `CustomJob`,\n`HyperparameterTuningjob`, and `TrainingPipeline`.\n\nEnable Dynamic Workload Scheduler in your custom training job\n-------------------------------------------------------------\n\nTo enable Dynamic Workload Scheduler in your custom training job, set the\n`scheduling.strategy` API field to `FLEX_START` when you create the job.\n\nFor details on how to create a custom training job, see the following links.\n\n- [Create a `CustomJob`](/vertex-ai/docs/training/create-custom-job)\n- [Create a `HyperparameterTuningJob`](/vertex-ai/docs/training/hyperparameter-tuning-overview)\n- [Create a `TrainingPipeline`](/vertex-ai/docs/training/create-training-pipeline)\n\n### Configure the duration to wait for resource availability\n\nYou can configure how long your job can wait for resources in the\n`scheduling.maxWaitDuration` field. A value of `0` means that the job waits\nindefinitely until the requested resources become available. The default value\nis **1 day**.\n\n### Examples\n\nThe following examples show you how to enable Dynamic Workload Scheduler for a `customJob`.\nSelect the tab for the interface that you want to use. \n\n### gcloud\n\nWhen submitting a job using the Google Cloud CLI, add the `scheduling.strategy`\nfield in the\n[`config.yaml`](/sdk/gcloud/reference/ai/custom-jobs/create#--config) file.\n\nExample YAML configuration file: \n\n workerPoolSpecs:\n machineSpec:\n machineType: a2-highgpu-1g\n acceleratorType: NVIDIA_TESLA_A100\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/ucaip-test/ucaip-training-test\n args:\n - port=8500\n command:\n - start\n scheduling:\n strategy: FLEX_START\n maxWaitDuration: 7200s\n\n### Python\n\nWhen submitting a job using the Vertex AI SDK for Python, set the\n`scheduling_strategy` field in the relevant `CustomJob` creation method. \n\n from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat\n\n def create_custom_job_with_dws_sample(\n project: str,\n location: str,\n staging_bucket: str,\n display_name: str,\n script_path: str,\n container_uri: str,\n service_account: str,\n experiment: str,\n experiment_run: Optional[str] = None,\n ) -\u003e None:\n aiplatform.init(project=project, location=location, staging_bucket=staging_bucket, experiment=experiment)\n\n job = aiplatform.CustomJob.from_local_script(\n display_name=display_name,\n script_path=script_path,\n container_uri=container_uri,\n enable_autolog=True,\n machine_type=\"a2-highgpu-1g\",\n accelerator_type=\"NVIDIA_TESLA_A100\",\n accelerator_count=1,\n )\n\n job.run(\n service_account=service_account,\n experiment=experiment,\n experiment_run=experiment_run,\n max_wait_duration=1800,\n scheduling_strategy=gca_custom_job_compat.Scheduling.Strategy.FLEX_START\n )\n\n### REST\n\nWhen submitting a job using the Vertex AI REST API, set the fields\n`scheduling.strategy` and `scheduling.maxWaitDuration` when creating your\ncustom training job.\n\nExample request JSON body: \n\n {\n \"displayName\": \"MyDwsJob\",\n \"jobSpec\": {\n \"workerPoolSpecs\": [\n {\n \"machineSpec\": {\n \"machineType\": \"a2-highgpu-1g\",\n \"acceleratorType\": \"NVIDIA_TESLA_A100\",\n \"acceleratorCount\": 1\n },\n \"replicaCount\": 1,\n \"diskSpec\": {\n \"bootDiskType\": \"pd-ssd\",\n \"bootDiskSizeGb\": 100\n },\n \"containerSpec\": {\n \"imageUri\": \"python:3.10\",\n \"command\": [\n \"sleep\"\n ],\n \"args\": [\n \"100\"\n ]\n }\n }\n ],\n \"scheduling\": {\n \"maxWaitDuration\": \"1800s\",\n \"strategy\": \"FLEX_START\"\n }\n }\n }\n\nQuota\n-----\n\nWhen you submit a job using Dynamic Workload Scheduler, instead of consuming on-demand\nVertex AI quota, Vertex AI consumes *preemptible* quota. For\nexample, for Nvidia H100 GPUs, instead of consuming:\n\n`aiplatform.googleapis.com/custom_model_training_nvidia_h100_gpus`,\n\nVertex AI consumes:\n\n`aiplatform.googleapis.com/custom_model_training_preemptible_nvidia_h100_gpus`.\n\nHowever, *preemptible* quota is used only in name. Your resources aren't\npreemptible and behave like standard resources.\n\nBefore submitting a job using Dynamic Workload Scheduler, ensure that your preemptible quotas\nhave been increased to a sufficient amount. For details on\nVertex AI quotas and instructions for making quota increase requests, see\n[Vertex AI quotas and limits](/vertex-ai/docs/quotas).\n\nBilling\n-------\n\nYou're charged only for the duration that the job is running and not for the\ntime that the job is waiting for resources to become available. For details,\nsee [Pricing](/vertex-ai/pricing#custom-trained_models).\n\nWhat's Next\n-----------\n\n- Learn more about [configuring compute resources](/vertex-ai/docs/training/configure-compute) for custom training jobs.\n- Learn more about [using distributed training](/vertex-ai/docs/training/distributed-training) for custom training jobs.\n- Learn more about [other scheduling options](/vertex-ai/docs/reference/rest/v1/CustomJobSpec#scheduling)."]]