This page explains the state of a training cluster through the lifecycle of a training job, and how Vertex AI handles training errors. You can use this information to adapt your training code accordingly.
Lifecycle of a training job
This section explains how Vertex AI handles worker VMs through the lifecycle of a training job.
Queue a new job
When you create a CustomJob
or HyperparameterTuningJob
, the job might remain
in the JOB_STATE_QUEUED
state
for some time before
Vertex AI runs it. This period is usually brief, but if your
Google Cloud project does not have sufficient remaining custom training
quotas
for your job, then Vertex AI
keeps the job queued until you have sufficient quotas.
Start workers in parallel
When a training job starts, Vertex AI schedules as many workers as
possible in a short amount of time. As a result, workers may start up in
parallel instead of sequentially. In order to reduce startup latency,
Vertex AI starts running your code on each worker as soon as it
becomes available. When all the workers are available, Vertex AI
sets the job state to JOB_STATE_RUNNING
.
In most cases, your machine learning framework automatically handles the workers starting in parallel. If you're using a distribution strategy in your training code, you may need to adjust it manually to handle workers starting in parallel. Learn more about distribution strategies in TensorFlow and in PyTorch .
Restart workers during the training job
During a training job, Vertex AI can restart your workers from any worker pool with the same hostname. This can occur for the following reasons:
- VM maintenance : When the VM running a worker is subjected to VM maintenance, Vertex AI restarts the worker on another VM. Learn more about live migration for VM maintenance.
-
Non-zero exits : If any worker exits with a non-zero exit code, Vertex AI restarts that worker immediately in the same VM.
- If a worker fails due to a common error , it is treated as a permanent error , and Vertex AI shuts down the entire job. If any containers restart before Vertex AI shuts down the entire job, these containers may produce logs in Cloud Logging.
- If a worker fails due to a non-permanent error (any error not listed in the common errors ), Vertex AI allows the restarted worker to continue running, with up to five restarts per worker. After five restarts, if a worker fails again, Vertex AI retries the entire job up to three times before failing the entire job.
To handle worker restarts in your training code, save checkpoints regularly during training so that you can restore from checkpoints when a worker restarts. If you expect training to take more than four hours, we recommend that you save a checkpoint at least once every four hours. Learn how to use training checkpoints in TensorFlow and in PyTorch .
Successfully completing a job
A training job completes successfully when its primary replica exits with exit code 0. At that point, Vertex AI shuts down all the other running workers.
How Vertex AI handles training job errors
This section explains how Vertex AI handles common training job errors and internal errors.
About one minute after a job ends, Vertex AI sets the error code on the training job object, based on the exit code.
Handle common errors
Vertex AI shuts down all workers if it encounters any of the following issues:
-
SIGABRT
-
ExitCode 6
-
ExitCode 134
(custom containers)
-
-
SIGSEGV
-
ExitCode 11
-
ExitCode 139
(custom containers)
-
n1-standard-4
),
Vertex AI system agents can take up to 40% of total memory.
For larger VMs, the overhead is relatively small. Compare allocatable memory for n1-standard
machine types
.For jobs running on A2 and A3 VMs, Dynamic Workload Scheduler lets you schedule jobs that run when the requested GPU resources become available, rather than failing with a stockout error. For more information, see Schedule training jobs based on resource availability .
Handle internal errors
If Vertex AI has an internal error, it attempts to restart a job
twice (three attempts in total). If the restart attempts also fail,
Vertex AI returns an internal error with the message: Internal error occurred for the current attempt
.