Optimize GKE AI/ML workload prioritization

Autopilot Standard

This document describes tools and best practices for maximizing resource utilization and minimizing downtime of heterogeneous AI/ML workloads in Google Kubernetes Engine (GKE), especially when there's no capacity in reservations or through on-demand resources. Heterogeneous workloads refer to different types of AI/ML workloads that run simultaneously in the same GKE cluster. For example, you might run a latency-sensitive online inference service alongside a series of interruptible batch training jobs.

This guide provides recommendations for Platform admins and operators and Data and AI specialists.

Benefits of AI/ML workload prioritization

Heterogeneous workloads have different priorities and share limited capacity and resources. The best practices in this page describe how to configure GKE and open source tools to help you get the following benefits:

Minimize downtime for high-priority workloads.
Quickly execute high-priority workloads.
Optimize resource consumption.

Background

GKE supports the following open source tools for optimizing resource utilization.

Kueue:a Kubernetes-native workload queueing system designed for batch, AI, and high performance computing workloads. Kueue can be extended to manage other workload types, such as those defined by Custom Resource Definitions like leaderworkerset . Kueue manages quotas and how workloads consume them in a Kubernetes cluster. Kueue makes decisions about when a workload waits, when a workload starts (for example, by creating the Pod), and when a Pod belonging to a workload gets preempted.

For more information about Kueue, see the Kueue concepts documentation.
Hotswap:a technique that reduces mean time to recovery (MTTR). Hotswap enables preemption based on workload priority when cluster resources are fully utilized and no additional capacity is available, either from on-demand instances or existing reservations.
- When a node that hosts a workload becomes unhealthy, the workload is rescheduled on eligible spare nodes. If no spare nodes are available, Hotswap can preempt a lower-priority workload to make room for the workload being recovered.
- If you configure your Pods with PriorityClass , the workload configured with higher priority evicts a running low-priority workload to acquire its resources. This eviction process is known as preemption.

Use cases

Use the following table to understand the best practices for each use case:

Use case

Best practice

Description

Multiple workloads with different priorities

Use Kueue to define queues and assign priorities to workloads based on their importance. Kueue can manage quota so that certain teams or projects have access to a set amount of resources.

Kueue lets you apply the following configurations:

Prioritize high priority Jobs by assigning higher Kueue WorkloadPriority to them.
Enable Kueue's fair-share queuing so that all workloads eventually receive resources, even low-priority ones.

To test the best practice configuration, see the Kueue example in this document.

You have to reduce the current MTTR.

Use Hotswap to reschedule workloads in healthy resources when an interruption occurs, and preempt low-priority workloads in favor of high-priority workloads.

Hotswap lets you apply the following configurations:

Configure PriorityClasses to define priority levels for your workloads.
Assign higher PriorityClasses to critical workloads.
Automatically reschedule workloads on healthy nodes when interruptions occur.

To test the best practice configuration, see the Hotswap example in this document.

Multiple AI workloads competing for limited resources

Combine Kueue and Hotswap. This combination provides a robust system that prioritizes critical workloads both during initial scheduling and during runtime.

Kueue and Hotswap let you apply the following configurations:

Use Kueue to manage the initial scheduling and admission of workloads based on priority.
Use Hotswap to handle workload interruptions and enable rapid recovery. Hotswap helps to reduce the time to recovery of a high-priority workload when an interruption occurs.

To test the best practice configuration, see the Kueue and Hotswap example in this document.

Examples of best practice implementations

The following examples demonstrate how to implement Kueue and Hotswap, and how to combine them for the best practices described in the preceding section.

Kueue

The following example manifest shows a Kueue configuration:

   
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 ResourceFlavor 
  
 metadata 
 : 
  
 name 
 : 
  
 tpu-v6e-slice 
  
 spec 
 : 
  
 nodeLabels 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 --- 
  
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 ClusterQueue 
  
 metadata 
 : 
  
 name 
 : 
  
 tpu-training-cq 
  
 spec 
 : 
  
 resourceGroups 
 : 
  
 - 
  
 flavors 
 : 
  
 - 
  
 name 
 : 
  
 tpu-v6e-slice 
  
 resources 
 : 
  
 - 
  
 name 
 : 
  
 google.com/tpu 
  
 nominalQuota 
 : 
  
 32 
  
 queueingStrategy 
 : 
  
 BestEffortFIFO 
  
 preemption 
 : 
  
 reclaimWithinCohort 
 : 
  
 Never 
  
 reclaimOutOfCohort 
 : 
  
 enable 
 : 
  
 true 
  
 reclaimMoreThanNominalQuota 
 : 
  
 false 
  
 --- 
  
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 LocalQueue 
  
 metadata 
 : 
  
 name 
 : 
  
 default-queue 
  
 namespace 
 : 
  
 default 
  
 spec 
 : 
  
 clusterQueue 
 : 
  
 tpu-training-cq

This manifest does the following:

Defines a ResourceFlavor named tpu-v6e-slice that specifies the node labels for TPU v6e slices.
Defines a ClusterQueue named tpu-training-cq that manages the quota for TPU resources.
Defines a LocalQueue named default-queue that allows workloads in the default namespace to use the tpu-training-cq cluster queue.

Hotswap

The following example shows a Hotswap configuration that defines two Priority Classes, low-priority-job and high-priority-job . This Hotswap configuration creates a high-priority JobSet workload and uses MaxText .

   
 apiVersion 
 : 
  
 scheduling.k8s.io/v1 
  
 kind 
 : 
  
 PriorityClass 
  
 metadata 
 : 
  
 name 
 : 
  
 low-priority-job 
  
 value 
 : 
  
 1000000 
  
 globalDefault 
 : 
  
 false 
  
 description 
 : 
  
 "This 
  
 priority 
  
 class 
  
 should 
  
 be 
  
 used 
  
 for 
  
 low 
  
 priority 
  
 pods 
  
 only." 
  
 --- 
  
 apiVersion 
 : 
  
 scheduling.k8s.io/v1 
  
 kind 
 : 
  
 PriorityClass 
  
 metadata 
 : 
  
 name 
 : 
  
 high-priority-job 
  
 value 
 : 
  
 2000000 
  
 globalDefault 
 : 
  
 false 
  
 description 
 : 
  
 "This 
  
 priority 
  
 class 
  
 should 
  
 be 
  
 used 
  
 for 
  
 critical 
  
 pods 
  
 only." 
  
 --- 
  
 apiVersion 
 : 
  
 jobset.x-k8s.io/v1alpha2 
  
 kind 
 : 
  
 JobSet 
  
 metadata 
 : 
  
 name 
 : 
  
 high-jax-trillium 
  
 annotations 
 : 
  
 alpha.jobset.sigs.k8s.io/exclusive-topology 
 : 
  
 cloud.google.com/gke-nodepool 
  
 spec 
 : 
  
 failurePolicy 
 : 
  
 maxRestarts 
 : 
  
 10 
  
 restartStrategy 
 : 
  
 BlockingRecreate 
  
 replicatedJobs 
 : 
  
 - 
  
 name 
 : 
  
 slice 
  
 replicas 
 : 
  
 2 
  
 template 
 : 
  
 spec 
 : 
  
 backoffLimit 
 : 
  
 0 
  
 completions 
 : 
  
 4 
  
 parallelism 
 : 
  
 4 
  
 template 
 : 
  
 spec 
 : 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 cloud.google.com/gke-tpu-topology 
 : 
  
 4x4 
  
 hostNetwork 
 : 
  
 true 
  
 dnsPolicy 
 : 
  
 ClusterFirstWithHostNet 
  
 priorityClassName 
 : 
  
 high-priority-job 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 jax-program 
  
 image 
 : 
  
< IMAGE LOCATION 
>  
 command 
 : 
  
 - 
  
 python3 
  
 - 
  
 MaxText/train.py 
  
 - 
  
 MaxText/configs/base.yml 
  
 - 
  
 model_name=llama2-7b 
  
 - 
  
 run_name=<UNIQUE RUN NAME> 
  
 - 
  
 steps=300 
  
 - 
  
 base_output_directory=gs://<OUTPUT BUCKET> 
  
 - 
  
 dataset_path=gs://max-datasets-rogue 
  
 - 
  
 max_target_length=4096 
  
 - 
  
 dataset_type=synthetic 
  
 - 
  
 enable_checkpointing=False 
  
 resources 
 : 
  
 limits 
 : 
  
 google.com/tpu 
 : 
  
 4

Based on this configuration, Hotswap performs the following actions:

If an infrastructure failure interrupts the high-priority workload, the JobSet restarts it. Hotswap preempts the low-priority workload to reschedule the high-priority workload before the infrastructure recovers. The low-priority workload remains in a failed status. This process significantly reduces workload idle time.
When the infrastructure recovers, Hotswap reschedules the low-priority workload in the node pool that recovered.

Kueue and Hotswap

Combine Kueue and Hotswap when you operate in a complex environment with limited resources. This combination provides a robust system that prioritizes critical workloads during initial scheduling and during runtime.

The following example shows a combined Kueue and Hotswap configuration. This example uses MaxText :

   
 apiVersion 
 : 
  
 scheduling.k8s.io/v1 
  
 kind 
 : 
  
 PriorityClass 
  
 metadata 
 : 
  
 name 
 : 
  
 low-priority-job 
  
 value 
 : 
  
 1000000 
  
 globalDefault 
 : 
  
 false 
  
 description 
 : 
  
 "This 
  
 priority 
  
 class 
  
 should 
  
 be 
  
 used 
  
 for 
  
 low 
  
 priority 
  
 pods 
  
 only." 
  
 --- 
  
 apiVersion 
 : 
  
 scheduling.k8s.io/v1 
  
 kind 
 : 
  
 PriorityClass 
  
 metadata 
 : 
  
 name 
 : 
  
 high-priority-job 
  
 value 
 : 
  
 2000000 
  
 globalDefault 
 : 
  
 false 
  
 description 
 : 
  
 "This 
  
 priority 
  
 class 
  
 should 
  
 be 
  
 used 
  
 for 
  
 critical 
  
 pods 
  
 only." 
  
 --- 
  
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 ResourceFlavor 
  
 metadata 
 : 
  
 name 
 : 
  
 tpu-v6e-slice 
  
 spec 
 : 
  
 nodeLabels 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 --- 
  
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 ClusterQueue 
  
 metadata 
 : 
  
 name 
 : 
  
 tpu-training-cq 
  
 spec 
 : 
  
 resourceGroups 
 : 
  
 - 
  
 flavors 
 : 
  
 - 
  
 name 
 : 
  
 tpu-v6e-slice 
  
 resources 
 : 
  
 - 
  
 name 
 : 
  
 google.com/tpu 
  
 nominalQuota 
 : 
  
 32 
  
 queueingStrategy 
 : 
  
 BestEffortFIFO 
  
 preemption 
 : 
  
 reclaimWithinCohort 
 : 
  
 Never 
  
 reclaimOutOfCohort 
 : 
  
 enable 
 : 
  
 true 
  
 reclaimMoreThanNominalQuota 
 : 
  
 false 
  
 --- 
  
 apiVersion 
 : 
  
 kueue.x-k8s.io/v1beta1 
  
 kind 
 : 
  
 LocalQueue 
  
 metadata 
 : 
  
 name 
 : 
  
 default-queue 
  
 namespace 
 : 
  
 default 
  
 spec 
 : 
  
 clusterQueue 
 : 
  
 tpu-training-cq 
  
 --- 
  
 apiVersion 
 : 
  
 jobset.x-k8s.io/v1alpha2 
  
 kind 
 : 
  
 JobSet 
  
 metadata 
 : 
  
 name 
 : 
  
 low-jax-trillium 
  
 annotations 
 : 
  
 kueue.x-k8s.io/queue-name 
 : 
  
 default-queue 
  
 alpha.jobset.sigs.k8s.io/exclusive-topology 
 : 
  
 cloud.google.com/gke-nodepool 
  
 spec 
 : 
  
 failurePolicy 
 : 
  
 maxRestarts 
 : 
  
 10 
  
 restartStrategy 
 : 
  
 BlockingRecreate 
  
 replicatedJobs 
 : 
  
 - 
  
 name 
 : 
  
 slice 
  
 replicas 
 : 
  
 2 
  
 template 
 : 
  
 spec 
 : 
  
 backoffLimit 
 : 
  
 0 
  
 completions 
 : 
  
 4 
  
 parallelism 
 : 
  
 4 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 kueue.x-k8s.io/managed-by 
 : 
  
 kueue 
  
 kueue.x-k8s.io/priority-class 
 : 
  
 low-priority-job 
  
 spec 
 : 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 cloud.google.com/gke-tpu-topology 
 : 
  
 4x4 
  
 hostNetwork 
 : 
  
 true 
  
 dnsPolicy 
 : 
  
 ClusterFirstWithHostNet 
  
 priorityClassName 
 : 
  
 low-priority-job 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 jax-program 
  
 image 
 : 
  
< IMAGE LOCATION 
>  
 command 
 : 
  
 - 
  
 python3 
  
 - 
  
 MaxText/train.py 
  
 - 
  
 MaxText/configs/base.yml 
  
 - 
  
 model_name=llama2-7b 
  
 - 
  
 run_name=low-priority-run 
  
 - 
  
 steps=30000 
  
 - 
  
 base_output_directory=gs://<OUTPUT BUCKET> 
  
 - 
  
 dataset_path=gs://max-datasets-rogue 
  
 - 
  
 max_target_length=4096 
  
 - 
  
 dataset_type=synthetic 
  
 - 
  
 enable_checkpointing=False 
  
 resources 
 : 
  
 limits 
 : 
  
 google.com/tpu 
 : 
  
 4 
  
 --- 
  
 apiVersion 
 : 
  
 jobset.x-k8s.io/v1alpha2 
  
 kind 
 : 
  
 JobSet 
  
 metadata 
 : 
  
 name 
 : 
  
 high-jax-trillium 
  
 annotations 
 : 
  
 kueue.x-k8s.io/queue-name 
 : 
  
 default-queue 
  
 alpha.jobset.sigs.k8s.io/exclusive-topology 
 : 
  
 cloud.google.com/gke-nodepool 
  
 spec 
 : 
  
 failurePolicy 
 : 
  
 maxRestarts 
 : 
  
 10 
  
 restartStrategy 
 : 
  
 BlockingRecreate 
  
 replicatedJobs 
 : 
  
 - 
  
 name 
 : 
  
 slice 
  
 replicas 
 : 
  
 2 
  
 template 
 : 
  
 spec 
 : 
  
 backoffLimit 
 : 
  
 0 
  
 completions 
 : 
  
 4 
  
 parallelism 
 : 
  
 4 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 kueue.x-k8s.io/managed-by 
 : 
  
 kueue 
  
 kueue.x-k8s.io/priority-class 
 : 
  
 high-priority-job 
  
 spec 
 : 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-tpu-accelerator 
 : 
  
 tpu-v6e-slice 
  
 cloud.google.com/gke-tpu-topology 
 : 
  
 4x4 
  
 hostNetwork 
 : 
  
 true 
  
 dnsPolicy 
 : 
  
 ClusterFirstWithHostNet 
  
 priorityClassName 
 : 
  
 high-priority-job 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 jax-program 
  
 image 
 : 
  
< IMAGE LOCATION 
>  
 command 
 : 
  
 - 
  
 python3 
  
 - 
  
 MaxText/train.py 
  
 - 
  
 MaxText/configs/base.yml 
  
 - 
  
 model_name=llama2-7b 
  
 - 
  
 run_name=high-priority-run 
  
 - 
  
 steps=300 
  
 - 
  
 base_output_directory=gs://<OUTPUT BUCKET> 
  
 - 
  
 dataset_path=gs://max-datasets-rogue 
  
 - 
  
 max_target_length=4096 
  
 - 
  
 dataset_type=synthetic 
  
 - 
  
 enable_checkpointing=False 
  
 resources 
 : 
  
 limits 
 : 
  
 google.com/tpu 
 : 
  
 4

Based on this configuration, Kueue is combined with Hotswap, and performs the following actions:

Kueue manages the admission of both low-jax-trillium and high-jax-trillium JobSets into the cluster queue based on their defined priorities and available resources.
If the high-jax-trillium JobSet is interrupted by an infrastructure failure, Hotswap preempts the low-jax-trillium JobSet to reschedule the high-priority JobSet.
Hotswap ensures the high-priority JobSet restarts quickly, minimizing its idle time.
When the infrastructure recovers, Hotswap reschedules the low-priority JobSet in the recovered node pool.

What's next

Learn how to deploy GPU workloads in GKE .
Learn how to deploy TPU workloads in GKE .

Optimize GKE AI/ML workload prioritization Stay organized with collections Save and categorize content based on your preferences.