This page describes GPUs in Google Kubernetes Engine (GKE) to help you to select the optimal GPU configuration for your workloads. If you want to deploy GPU workloads that use Slurm, see Create an AI-optimized Slurm cluster instead.
You can use GPUs to accelerate resource intensive tasks, such as machine learning and data processing. The information on this page can help you to do the following:
- Ensure GPU availability when needed.
- Decide whether to use GPUs in GKE Autopilot mode or GKE Standard mode clusters.
- Choose GPU-related features to efficiently use your GPU capacity.
- Monitor GPU node metrics.
- Improve GPU workload reliability by handling disruptions more effectively.
This page is intended for Platform admins and operators and Machine learning (ML) engineers who want to ensure that accelerator infrastructure is optimized for your workloads.
Before reading this page, ensure that you're familiar with the following:
GPU selection in GKE
In GKE, the way you request GPU hardware depends on whether you are using Autopilot or Standard mode. In Autopilot, you request GPU hardware by specifying GPU resources in your workloads. In GKE Standard mode, you can attach GPU hardware to nodes in your clusters, and then allocate GPU resources to containerized workloads running on those nodes. For detailed instructions on how to attach and use GPUs in your workloads, refer to Deploy GPU workloads on Autopilot or Run GPUs on Standard node pools .
GKE offers some GPU-specific features to improve efficient GPU resource utilization of workloads running on your nodes, including time-sharing, multi-instance GPUs, and multi-instance GPUs with NVIDIA MPS.
This page helps you to consider choices for requesting GPUs in GKE, including the following:
- Choosing your GPU quota , the maximum number of GPUs that can run in your project
- Deciding between Autopilot and Standard modes
- Manage the GPU stack through GKE or NVIDIA GPU Operator on GKE
- Choosing features to reduce the amount of underutilized GPU resources
- Accessing NVIDIA CUDA-X libraries for CUDA applications
- Monitoring GPU node metrics
- Handling disruption due to node maintenance
- Use GKE Sandbox to secure GPU workloads
Available GPU models
The GPU hardware that's available for use in GKE is a subset of the GPU models available on Compute Engine . The specific hardware that's available depends on the Compute Engine region or zone of your cluster. For more information about specific availability, see GPU regions and zones .
For information about GPU pricing, see the Google Cloud SKUs and the GPU pricing page .
Plan GPU quota
Your GPU quota is the maximum number of GPUs that can run in your Google Cloud project. To use GPUs in your GKE clusters, your project must have enough GPU quota. Check the Quotaspage to ensure that you have enough GPUs available in your project.
Your GPU quota should be at least equal to the total number of GPUs you intend to run in your cluster. If you enable cluster autoscaling , you should request GPU quota at least equivalent to your cluster's maximum number of nodes multiplied by the number of GPUs per node.
For example, if you expect to utilize three nodes with two GPUs each, then six is the GPU quota required for your project.
To request additional GPU quota, follow the instructions to request a quota adjustment
, using gpus
as the metric.
Choose GPU support using Autopilot or Standard
GPUs are available in Autopilot and Standard clusters.
Use Autopilot clusters for a fully managed Kubernetes experience. In Autopilot, GKE manages driver installation, node scaling, Pod isolation, and node provisioning.