Best practices: Cloud Run jobs with GPUs

This page provides best practices for optimizing performance when using a Cloud Run job with GPU for AI workloads such as, training large language models (LLMs) using your preferred frameworks, fine-tuning, and performing batch or offline inference on LLMs. To create a Cloud Run job that can perform compute intensive tasks or batch processing in real time, you should:

Use models that load fast and require minimal transformation into GPU-ready structures, and optimize how they are loaded.
Use configurations that allow for maximum, efficient, concurrent execution to reduce the number of GPUs needed to serve a target request per second while keeping costs down.

Recommended ways to load large ML models on Cloud Run

Google recommends either storing ML models inside container images or optimize loading them from Cloud Storage .

Storing and loading ML models trade-offs

Here is a comparison of the options:

Model location

Deploy time

Development experience

Container startup time

Storage cost

Container image

Slow. An image containing a large model will take longer to import into Cloud Run.

Changes to the container image will require redeployment, which may be slow for large images.

Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance.

Potentially multiple copies in Artifact Registry.

Cloud Storage , loaded using Cloud Storage FUSE volume mount

Fast. Model downloaded during container startup.

Not difficult to set up, does not require changes to the docker image.

Fast when you use network optimizations . Does not parallelize the download.

One copy in Cloud Storage.

Cloud Storage , downloaded concurrently using the Google Cloud CLI command gcloud storage cp or the Cloud Storage API as shown in the transfer manager concurrent download code sample .

Fast. Model downloaded during container startup.

Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API.

Fast when you use network optimizations . The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount.

One copy in Cloud Storage.

Internet

Fast. Model downloaded during container startup.

Typically simpler (many frameworks download models from central repositories).

Typically poor and unpredictable:

Frameworks may apply model transformations during initialization. (You should do this at build time).
Model host and libraries for downloading the model may not be efficient.
There is reliability risk associated with downloading from the internet. Your job could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket.

Depends on the model hosting provider.

Store models in container images

By storing the ML model in the container image, model loading will benefit from Cloud Run's optimized container streaming infrastructure. However, building container images that include ML models is a resource-intensive process, especially when working with large models. In particular, the build process can become bottlenecked on network throughput. When using Cloud Build, we recommend using a more powerful build machine with increased compute and networking performance. To do this, build an image using a build config file that has the following steps:

 steps 
 : 
 - 
  
 name 
 : 
  
 'gcr.io/cloud-builders/docker' 
  
 args 
 : 
  
 [ 
 'build' 
 , 
  
 '-t' 
 , 
  
 ' IMAGE 
' 
 , 
  
 '.' 
 ] 
 - 
  
 name 
 : 
  
 'gcr.io/cloud-builders/docker' 
  
 args 
 : 
  
 [ 
 'push' 
 , 
  
 ' IMAGE 
' 
 ] 
 images 
 : 
 - 
  
  IMAGE 
 
 options 
 : 
  
 machineType 
 : 
  
 'E2_HIGHCPU_32' 
  
 diskSizeGb 
 : 
  
 '500'

You can create one model copy per image if the layer containing the model is distinct between images (different hash). There could be additional Artifact Registry cost because there could be one copy of the model per image if your model layer is unique across each image.

Store models in Cloud Storage

To optimize ML model loading when loading ML models from Cloud Storage, either using Cloud Storage volume mounts or directly using the Cloud Storage API or command line, you must use Direct VPC with the egress setting value set to all-traffic , along with Private Google Access .

For an additional cost, using Anywhere Cache can reduce model loading latency by efficiently caching data on SSDs for faster reads.

Load models from the internet

To optimize ML model loading from the internet, route all traffic through the vpc network with the egress setting value set to all-traffic and set up Cloud NAT to reach the public internet at high bandwidth.

Build, deployment, runtime, and system design considerations

The following sections describe considerations for build, deploy, runtime and system design.

At build time

The following list shows considerations you need to take into account when you are planning your build:

Choose a good base image. You should start with an image from the Deep Learning Containers or the NVIDIA container registry for the ML framework you're using. These images have the latest performance-related packages installed. We don't recommend creating a custom image.
Choose 4-bit quantized models to maximize concurrency unless you can prove they affect result quality. Quantization produces smaller and faster models, reducing the amount of GPU memory needed to serve the model, and can increase parallelism at run time. Ideally, the models should be trained at the target bit depth rather than quantized down to it.
Pick a model format with fast load times to minimize container startup time, such as GGUF. These formats more accurately reflect the target quantization type and require less transformations when loaded onto the GPU. For security reasons, don't use pickle-format checkpoints.
Create and warm LLM caches at build time. Start the LLM on the build machine while building the docker image. Enable prompt caching and feed common or example prompts to help warm the cache for real-world use. Save the outputs it generates to be loaded at runtime.
Save your own inference model that you generate during build time. This saves significant time compared to loading less efficiently stored models and applying transforms like quantization at container startup.

At deployment

The following list shows considerations you need to take into account when you are planning your deployment:

Set a task timeout of one hour or lesser for job executions.
If you are running parallel tasks in a job execution, determine and set parallelism to less than the GPU quota without zonal redundancy allocated for your project. To request for a quota increase, see How to increase quota . GPU tasks start as quickly as possible, and go up to a maximum that varies depending on how much GPU quota you allocated for the project and the region selected. Deployments fail if you set parallelism to more than the GPU quota limit.

At run time

Actively manage your supported context length. The smaller the context window you support, the more queries you can support running in parallel. The details of how to do this depend on the framework.
Use the LLM caches you generated at build time. Supply the same flags you used during build time when you generated the prompt and prefix cache.
Load from the saved model you just wrote. See Storing and loading models trade-offs for a comparison on how to load the model.
Consider using a quantized key-value cache if your framework supports it. This can reduce per-query memory requirements and allows for configuration of more parallelism. However, it can also impact quality.
Tune the amount of GPU memory to reserve for model weights, activations and key-value caches. Set it as high as you can without getting an out-of-memory error.
Check to see whether your framework has any options for improving container startup performance (for example, using model loading parallelization).

At the system design level

Add semantic caches where appropriate. In some cases, caching whole queries and responses can be a great way of limiting the cost of common queries.
Control variance in your preambles. Prompt caches are only useful when they contain the prompts in sequence. Caches are effectively prefix-cached. Insertions or edits in the sequence mean that they're either not cached or only partially present.

Best practices: Cloud Run jobs with GPUs Stay organized with collections Save and categorize content based on your preferences.

Recommended ways to load large ML models on Cloud Run

Storing and loading ML models trade-offs

Store models in container images

Store models in Cloud Storage

Load models from the internet

Build, deployment, runtime, and system design considerations

At build time

At deployment

At run time

At the system design level

Best practices: Cloud Run jobs with GPUs