GPU support for services

This page describes GPU configuration for your Cloud Run services. GPUs work well for AI inference workloads, such as large language models (LLMs) or other compute intensive non-AI use cases such as video transcoding and 3D rendering. Google provides NVIDIA L4 GPUs with 24 GB of GPU memory (VRAM), which is separate from the instance memory .

GPU on Cloud Run is fully managed, with no extra drivers or libraries needed. The GPU feature offers on-demand availability with no reservations needed, similar to the way on-demand CPU and on-demand memory work in Cloud Run. Instances of a Cloud Run service that has been configured to use GPU can scale down to zero for cost savings when not in use.

Cloud Run instances with an attached L4 GPU with drivers pre-installed start in approximately 5 seconds, at which point the processes running in your container can start to use the GPU.

You can configure one GPU per Cloud Run instance. If you use sidecar containers, note that the GPU can only be attached to one container.

Supported regions

  • asia-southeast1 (Singapore)
  • asia-south1 (Mumbai) . This region is available by invitation only. Contact your Google Account team if you are interested in this region.
  • europe-west1 (Belgium)leaf icon Low CO 2
  • europe-west4 (Netherlands)leaf icon Low CO 2
  • us-central1 (Iowa)leaf icon Low CO 2
  • us-east4 (Northern Virginia)

Supported GPU types

You can use one L4 GPU per Cloud Run instance. An L4 GPU has the following pre-installed drivers:

  • The current NVIDIA driver version: 535.216.03 (CUDA 12.2)

Pricing impact

See Cloud Run pricing for GPU pricing details. Note the following requirements and considerations:

  • There are no per request fees. You must use instance-based billing to use the GPU feature, minimum instances are charged at the full rate even when idle.
  • There is a difference in cost between GPU zonal redundancy and non-zonal redundancy. See Cloud Run pricing for GPU pricing details.
  • When you deploy services or functions from source code with GPU configuration using gcloud beta , Cloud Run uses the e2-highcpu-8 machine type , instead of the default e2-standard-2 machine type (Preview). The larger machine type provides higher CPU support, and higher network bandwidth which results in faster build times.
  • You must use a minimum of 4 CPU and 16 GiB of memory.
  • GPU is billed for the entire duration of the instance lifecycle.

GPU zonal redundancy options

By default, Cloud Run deploys your service across multiple zones within a region. This architecture provides inherent resilience: if a zone experiences an outage, Cloud Run automatically routes traffic away from the affected zone to healthy zones within the same region.

When working with GPU resources, keep in mind GPU resources have specific capacity constraints. During a zonal outage, the standard failover mechanism for GPU workloads relies on sufficient unused GPU capacity being available in the remaining healthy zones. Due to the constrained nature of GPUs, this capacity might not always be available.

To increase the availability of your GPU-accelerated services during zonal outages, you can configure zonal redundancy specifically for GPUs:

  • Zonal Redundancy Turned On(default): Cloud Run reserves GPU capacity for your service across multiple zones. This significantly increases the probability that your service can successfully handle traffic rerouted from an affected zone, offering higher reliability during zonal failures with additional cost per GPU second.

  • Zonal Redundancy Turned Off: Cloud Run attempts failover for GPU workloads on a best-effort basis. Traffic is routed to other zones only if sufficient GPU capacity is available at that moment. This option does not guarantee reserved capacity for failover scenarios but results in a lower cost per GPU second.

SLA

The SLA for Cloud Run GPU depends on whether the service uses the zonal redundancy or non-zonal redundancy option . Refer to the SLA page for details.

Request a quota increase

Projects using Cloud Run nvidia-l4 GPUs in a region for the first time are automatically granted 3 GPU quota (zonal redundancy off) when the first deployment is created. If you need additional Cloud Run GPUs, you must request a quota increase for your Cloud Run service. Use the links provided in the following buttons to request the quota you need.

Quota needed Quota link
GPU with zonal redundancy turned off (lower price) Request GPU quota without zonal redundancy
GPU with zonal redundancy turned on (higher price) Request GPU quota with zonal redundancy

For more information on requesting quota increases, see How to increase quota .

Before you begin

The following list describes requirements and limitations when using GPUs in Cloud Run:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project .

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project .

  6. Enable the Cloud Run API.

    Enable the API

  7. Request required quota .
  8. Consult Best practices: AI inference on Cloud Run with GPUs for recommendations on building your container image and loading large models.
  9. Make sure your Cloud Run service has the following configurations:

Required roles

To get the permissions that you need to configure and deploy Cloud Run services, ask your administrator to grant you the following IAM roles on services:

  • Cloud Run Developer ( roles/run.developer ) - the Cloud Run service
  • Service Account User ( roles/iam.serviceAccountUser ) - the service identity

If you are deploying a service or function from source code, you must also have additional roles granted to you on your project and Cloud Build service account.

For a list of IAM roles and permissions that are associated with Cloud Run, see Cloud Run IAM roles and Cloud Run IAM permissions . If your Cloud Run service interfaces with Google Cloud APIs, such as Cloud Client Libraries, see the service identity configuration guide . For more information about granting roles, see deployment permissions and manage access .

Configure a Cloud Run service with GPU

Any configuration change leads to the creation of a new revision. Subsequent revisions will also automatically get this configuration setting unless you make explicit updates to change it.

You can use the Google Cloud console, Google Cloud CLI or YAML to configure GPU.

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Select Servicesfrom the menu, and click Deploy containerto configure a new service. If you are configuring an existing service, click the service, then click Edit and deploy new revision.

  3. If you are configuring a new service, fill out the initial service settings page, then click Container(s), Volumes, Networking, Securityto expand the service configuration page.

  4. Click the Containertab.

    image

    • Configure CPU, memory, concurrency, execution environment, and startup probe following the recommendations in Before you begin .
    • Check the GPU checkbox, then select the GPU type from the GPU typemenu, and the number of GPUs from the Number of GPUsmenu.
    • By default for new services, zonal redundancy is turned on. To change the current setting, select the GPU checkbox to show the GPU redundancy options.
      • Select No zonal redundancy to turn off zonal redundancy
      • Select Zonal redundancy to turn on zonal redundancy.
  5. Click Createor Deploy.

gcloud

To create a service with GPU enabled, use the gcloud run deploy command:

  • To deploy a container:

      
    gcloud  
    run  
    deploy  
     SERVICE 
      
     \ 
      
    --image  
     IMAGE_URL 
      
     \ 
      
    --gpu  
     1 
    

    Replace the following:

    • SERVICE : the name of your Cloud Run service.
    • IMAGE_URL : a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest . If you use Artifact Registry, the repository REPO_NAME must already be created. The URL follows the format of LOCATION -docker.pkg.dev/ PROJECT_ID / REPO_NAME / PATH : TAG .
  • To deploy source code and let Cloud Run default to using the e2-highcpu-8 machine type from Cloud Build, use the gcloud beta run deploy command:

      
    gcloud  
    beta  
    run  
    deploy  
     SERVICE 
      
     \ 
      
    --source  
    .  
     \ 
      
    --gpu  
     1 
    

To update the GPU configuration for a service, use the gcloud run services update command:

  
gcloud  
run  
services  
update  
 SERVICE 
  
 \ 
  
--image  
 IMAGE_URL 
  
 \ 
  
--cpu  
 CPU 
  
 \ 
  
--memory  
 MEMORY 
  
 \ 
  
--no-cpu-throttling  
 \ 
  
--gpu  
 GPU_NUMBER 
  
 \ 
  
--gpu-type  
 GPU_TYPE 
  
 \ 
  
--max-instances  
 MAX_INSTANCE 
  
-- GPU_ZONAL_REDUNDANCY 
  

Replace the following:

  • SERVICE : the name of your Cloud Run service.
  • IMAGE_URL : a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest . If you use Artifact Registry, the repository REPO_NAME must already be created. The URL follows the format of LOCATION -docker.pkg.dev/ PROJECT_ID / REPO_NAME / PATH : TAG .
  • CPU : the number of CPU. You must specify at least 4 CPU.
  • MEMORY : the amount of memory. You must specify at least 16Gi (16 GiB).
  • GPU_NUMBER : the value 1 (one). If this is unspecified but a GPU_TYPE is present, the default is 1 .
  • GPU_TYPE : the GPU type. If this is unspecified but a GPU_NUMBER is present, the default is nvidia-l4 (nvidia L 4 lowercase L, not numeric value fourteen).
  • MAX_INSTANCE : the maximum number of instances. This number can't exceed the GPU quota allocated for your project.
  • GPU_ZONAL_REDUNDANCY : no-gpu-zonal-redundancy to turn off zonal redundancy, or gpu-zonal-redundancy to turn on zonal redundancy.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration :

    gcloud  
    run  
    services  
    describe  
     SERVICE 
      
    --format  
     export 
      
    >  
    service.yaml
  2. Update the nvidia.com/gpu: attribute and nodeSelector:
    run.googleapis.com/accelerator:
    :

     apiVersion 
     : 
      
     serving.knative.dev/v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
      SERVICE 
     
     spec 
     : 
      
     template 
     : 
      
     metadata 
     : 
      
     annotations 
     : 
      
     autoscaling.knative.dev/maxScale 
     : 
      
     ' MAX_INSTANCE 
    ' 
      
     run.googleapis.com/cpu-throttling 
     : 
      
     'false' 
      
     run.googleapis.com/gpu-zonal-redundancy-disabled 
     : 
      
     ' GPU_ZONAL_REDUNDANCY 
    ' 
      
     spec 
     : 
      
     containers 
     : 
      
     - 
      
     image 
     : 
      
      IMAGE_URL 
     
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
      CONTAINER_PORT 
     
      
     name 
     : 
      
     http1 
      
     resources 
     : 
      
     limits 
     : 
      
     cpu 
     : 
      
     ' CPU 
    ' 
      
     memory 
     : 
      
     ' MEMORY 
    ' 
      
     nvidia.com/gpu 
     : 
      
     '1' 
      
     # Optional: use a longer startup probe to allow long starting containers 
      
     startupProbe 
     : 
      
     failureThreshold 
     : 
      
     1800 
      
     periodSeconds 
     : 
      
     1 
      
     tcpSocket 
     : 
      
     port 
     : 
      
      CONTAINER_PORT 
     
      
     timeoutSeconds 
     : 
      
     1 
      
     nodeSelector 
     : 
      
     run.googleapis.com/accelerator 
     : 
      
      GPU_TYPE 
     
    

    Replace the following:

    • SERVICE : the name of your Cloud Run service.
    • IMAGE_URL : a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest . If you use Artifact Registry, the repository REPO_NAME must already be created. The URL follows the format of LOCATION -docker.pkg.dev/ PROJECT_ID / REPO_NAME / PATH : TAG .
    • CONTAINER_PORT : the container port set for your service.
    • CPU : the number of CPU. You must specify at least 4 CPU.
    • MEMORY : the amount of memory. You must specify at least 16Gi (16 GiB).
    • GPU_TYPE : the value nvidia-l4 (nvidia- L 4 lowercase L, not numeric value fourteen).
    • MAX_INSTANCE : the maximum number of instances. This number can't exceed the GPU quota allocated for your project.
    • GPU_ZONAL_REDUNDANCY : false to turn on GPU zonal redundancy, or true to turn it off.
  3. Create or update the service using the following command:

    gcloud  
    run  
    services  
    replace  
    service.yaml

Terraform

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands .

Add the following to a google_cloud_run_v2_service resource in your Terraform configuration:
  resource 
  
 "google_cloud_run_v2_service" 
  
 "default" 
  
 { 
  
 provider 
  
 = 
  
 google-beta 
  
 name 
  
 = 
  
 " SERVICE 
" 
  
 location 
  
 = 
  
 "europe-west1" 
  
 template 
  
 { 
  
 gpu_zonal_redundancy_disabled 
  
 = 
  
 " GPU_ZONAL_REDUNDANCY 
" 
  
 containers 
  
 { 
  
 image 
  
 = 
  
 " IMAGE_URL 
" 
  
 resources 
  
 { 
  
 limits 
  
 = 
  
 { 
  
 "cpu" 
  
 = 
  
 " CPU 
" 
  
 "memory" 
  
 = 
  
 " MEMORY 
" 
  
 "nvidia.com/gpu" 
  
 = 
  
 "1" 
  
 } 
  
 } 
  
 } 
  
 node_selector 
  
 { 
  
 accelerator 
  
 = 
  
 " GPU_TYPE 
" 
  
 } 
  
 } 
 } 
 

Replace the following:

  • SERVICE : the name of your Cloud Run service.
  • GPU_ZONAL_REDUNDANCY : false to turn on GPU zonal redundancy, or true to turn it off.
  • IMAGE_URL : a reference to the container image, for example, us-docker.pkg.dev/cloudrun/container/hello:latest . If you use Artifact Registry, the repository REPO_NAME must already be created. The URL follows the format of LOCATION -docker.pkg.dev/ PROJECT_ID / REPO_NAME / PATH : TAG .
  • CPU : the number of CPU. You must specify at least 4 CPU.
  • MEMORY : the amount of memory. You must specify at least 16Gi (16 GiB).
  • GPU_TYPE : the value nvidia-l4 (nvidia- L 4 lowercase L, not numeric value fourteen).

View GPU settings

To view the current GPU settings for your Cloud Run service:

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Click the service you are interested in to open the Service detailspage.

  3. Click the Revisionstab.

  4. In the details panel at the right, the GPU setting is listed under the Containertab.

gcloud

  1. Use the following command:

    gcloud  
    run  
    services  
    describe  
     SERVICE 
    
  2. Locate the GPU setting in the returned configuration.

Removing GPU

You can remove GPU using the Google Cloud console, the Google Cloud CLI, or YAML.

Console

  1. In the Google Cloud console, go to Cloud Run:

    Go to Cloud Run

  2. Select Servicesfrom the menu, and click Deploy containerto configure a new service. If you are configuring an existing service, click the service, then click Edit and deploy new revision.

  3. If you are configuring a new service, fill out the initial service settings page, then click Container(s), Volumes, Networking, Securityto expand the service configuration page.

  4. Click the Containertab.

    image

    • Uncheck the GPU checkbox.
  5. Click Createor Deploy.

gcloud

To remove GPU, set the number of GPUs to 0 using the gcloud run services update command:

  
gcloud  
run  
services  
update  
 SERVICE 
  
--gpu  
 0 
  

Replace SERVICE with the name of your Cloud Run service.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration :

    gcloud  
    run  
    services  
    describe  
     SERVICE 
      
    --format  
     export 
      
    >  
    service.yaml
  2. Delete the nvidia.com/gpu: and the nodeSelector: run.googleapis.com/accelerator: nvidia-l4 lines.

  3. Create or update the service using the following command:

    gcloud  
    run  
    services  
    replace  
    service.yaml

Libraries

By default, all of the NVIDIA L4 driver libraries are mounted under /usr/local/nvidia/lib64 . Cloud Run automatically appends this path to the LD_LIBRARY_PATH environment variable (i.e. ${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64 ) of the container with the GPU. This allows the dynamic linker to find the NVIDIA driver libraries. The linker searches and resolves paths in the order you list in the LD_LIBRARY_PATH environment variable. Any values you specify in this variable take precedence over the default Cloud Run driver libraries path /usr/local/nvidia/lib64 .

If you want to use a CUDA version greater than 12.2, the easiest way is to depend on a newer NVIDIA base image with forward compatibility packages already installed. Another option is to manually install the NVIDIA forward compatibility packages and add them to LD_LIBRARY_PATH . Consult NVIDIA's compatibility matrix to determine which CUDA versions are forward compatible with the provided NVIDIA driver version (535.216.03).

About GPUs and maximum instances

The number of instances with GPUs is limited in two ways:

  • The Maximum instances setting limits the number of instances per service . This can't be set higher than the GPU quota per project per region for GPU .
  • The quota of GPUs allowed per project per region. This limits the number of instances across services in the same region.
Create a Mobile Website
View Site in Mobile | Classic
Share by: