This page describes GPUs in Google Kubernetes Engine (GKE) to help you to select
the optimal GPU configuration for your workloads. If you want to deploy GPU
workloads that use Slurm, seeCreate an AI-optimized Slurm clusterinstead.
You can use GPUs to accelerate resource intensive tasks, such as machine
learning and data processing. The information on this page can help you to do
the following:
Ensure GPU availability when needed.
Decide whether to use GPUs in GKE Autopilot mode or
GKE Standard mode clusters.
Choose GPU-related features to efficiently use your GPU capacity.
Monitor GPU node metrics.
Improve GPU workload reliability by handling disruptions more effectively.
In GKE, the way you request GPU hardware depends on whether you
are using Autopilot or Standard mode. In Autopilot,
you request GPU hardware by specifying GPU resources in your workloads. In
GKE Standard mode, you can attach GPU hardware to nodes
in your clusters, and then allocate GPU resources to containerized workloads
running on those nodes. For detailed instructions on how to attach and use GPUs
in your workloads, refer toDeploy GPU workloads on
AutopilotorRun GPUs
on Standard node pools.
GKE offers someGPU-specific featuresto
improve efficient GPU resource utilization of workloads running on your nodes, including
time-sharing,
multi-instance GPUs, and multi-instance GPUs with NVIDIA MPS.
This page helps you to consider choices for requesting GPUs
in GKE, including the following:
The GPU hardware that's available for use in GKE is a subset of
theGPU models available on Compute Engine.
The specific hardware that's available depends on the Compute Engine region or
zone of your cluster. For more information about specific availability, seeGPU regions and zones.
Your GPU quota is the maximum number of GPUs that can run in your
Google Cloud project. To use GPUs in your GKE clusters,
your project must have enough GPU quota. Check theQuotaspageto ensure that you have enough GPUs available in your project.
Your GPU quota should be at least equal to the total number of GPUs you intend
to run in your cluster. If you enablecluster autoscaling, you
should request GPU quota at least equivalent to your cluster's maximum number of
nodes multiplied by the number of GPUs per node.
For example, if you expect to utilize three nodes with two GPUs each, then six is the GPU quota required for your project.
To request additional GPU quota, follow the instructions torequest a quota adjustment, usinggpusas the metric.
Choose GPU support using Autopilot or Standard
GPUs are available in Autopilot and Standard
clusters.
Best practice:
Use Autopilot clusters for a fully managed Kubernetes
experience. In Autopilot, GKE manages driver installation, node
scaling, Pod isolation, and node provisioning.
The following table provides an overview of the differences
between Autopilot and Standard GPU support:
Description
Autopilot
Standard
Requesting GPU hardware
Specify GPU resources in your workloads.
Attach GPU hardware to nodes in your clusters, and then allocate GPU resources to
containerized workloads running on those nodes.
GPU hardware availability
NVIDIA GB200
NVIDIA B200
NVIDIA H200 141GB
NVIDIA H100 80GB
NVIDIA A100 80GB
NVIDIA A100 40GB
NVIDIA RTX PRO 6000
NVIDIA L4
NVIDIA T4
All GPU types that are supported by Compute Engine
Selecting a GPU
You request a GPU quantity and type in your workload specification.
By default, Autopilot installs the default driver for that
GKE version and manages your nodes. To select a specific
driver version in Autopilot, seeNVIDIA drivers selection for Autopilot GPU Pods.
Manage the GPU stack through GKE or the NVIDIA GPU Operator on GKE
By default, GKE manages the entire lifecycle of the GPU nodes, including
automatic GPU driver installation, monitoring GPU workloads on GKE withNVIDIA Data Center GPU Manager (DCGM),
and GPU sharing strategies.
Best practice:
Use GKE to manage your GPU nodes, since
GKE fully manages the GPU node lifecycle.
Get started with
GKE for GPU node management by choosing one of the following:
TheNVIDIA GPU Operatormay be used as an alternative to fully managed GPU support on GKE
on both Container-Optimized OS (COS) and Ubuntu node images. Select this option if
you are looking for a consistent experience across multiple cloud service
providers, you are already using the NVIDIA GPU Operator, or if you are using
software that depends on the NVIDIA GPU operator. To learn more, seeManage the GPU stack with the NVIDIA GPU Operator.
To select the best option for your use case, refer to the following table
comparing the two methods of managing GPU nodes on GKE.
Description
Use GKE to manage GPU nodes
Use NVIDIA GPU Operator on GKE
Management of GPU node lifecycle (installation, upgrade)
Withsystem metricsenabled, the following GPU metrics are available in
Cloud Monitoring: duty cycle, memory usage, and memory capacity.
Self-managed DCGM provided by GPU operator.
Even whenGKE GPU system metricsare enabled, GPU-related system metrics are not collected, including duty
cycle, memory usage, and memory capacity.
Optimize resource usage using GPU features in GKE
By default, Kubernetes only supports assigning GPUs as whole
units to containers but GKE provides additional features that you can use to optimize the resource usage of your GPU workloads.
The following features are available in GKE to reduce the amount
of underutilized GPU resources:
Split a single GPU into up to seven hardware-separated instances that
can be assigned as individual GPUs to containers on a node. Each
assigned container gets the resources available to that instance.
Present a single GPU as multiple units to multiple containers on a
node. The GPU driver context-switches and allocates the full GPU
resources to each assigned container as needed over time.
Share a single physical NVIDIA GPU across multiple containers. NVIDIA MPS is an alternative, binary-compatible implementation of the CUDA API designed to transparently enable co-operative multi-process CUDA applications to run concurrently on a single GPU device.
Access the NVIDIA CUDA-X libraries for CUDA applications
CUDAis NVIDIA's parallel computing platform and programming model for GPUs. To
use CUDA applications, the image that you use must have the libraries. To add the NVIDIA CUDA-X libraries, you can build and use your own image by including the following values in theLD_LIBRARY_PATHenvironment variable in your container specification:
/usr/local/nvidia/lib64: the location of the
NVIDIA device drivers.
/usr/local/cuda-CUDA_VERSION/lib64: the
location of the NVIDIA CUDA-X libraries on the node.
ReplaceCUDA_VERSIONwith the CUDA-X image version that
you used. Some versions also contain debug utilities in/usr/local/nvidia/bin. For details, seethe NVIDIA CUDA image on DockerHub.
If your GKE cluster hassystem metricsenabled, then the following metrics are available inCloud Monitoringto monitor your GPU workload performance:
Duty Cycle (container/accelerator/duty_cycle):Percentage of time over the past sample period (10 seconds) during which the
accelerator was actively processing. Between 1 and 100.
Memory Usage (container/accelerator/memory_used):Amount of accelerator memory allocated in bytes.
Memory Capacity (container/accelerator/memory_total):Total accelerator memory in bytes.
These metrics apply at the container level (container/accelerator) and are not
collected for containers scheduled on a GPU that uses GPU time-sharing or NVIDIA MPS.
You can use predefined dashboards to monitor your clusters with GPU nodes.
For more information, seeView observability metrics. For general information about monitoring
your clusters and their resources, refer toObservability for
GKE.
View usage metrics for workloads
You view your workload GPU usage metrics from theWorkloadsdashboard in the Google Cloud console.
To view your workload GPU usage, perform the following steps:
Go to theWorkloadspage in the Google Cloud console.
The Workloads dashboard displays charts for GPU memory usage and capacity,
and GPU duty cycle.
View NVIDIA Data Center GPU Manager (DCGM) metrics
You can collect and visualize NVIDIA DCGM metrics by usingGoogle Cloud Managed Service for Prometheus.
For
Autopilot clusters, GKE installs the drivers. For Standard clusters, you must install the NVIDIA drivers.
The GKE nodes that host the GPUs are subject to maintenance
events or other disruptions that might cause node shutdown. In GKE clusters with the control
plane running version 1.29.1-gke.1425000 and later, you can reduce
disruption to workloads by configuring GKE to terminate your workloads gracefully.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-11-11 UTC."],[],[]]