Stay organized with collectionsSave and categorize content based on your preferences.
This page provides best practices for optimizing performance when using a
Cloud Run job with GPU for AI workloads such as,
training large language models (LLMs) using your preferred frameworks, fine-tuning,
and performing batch or offline inference on LLMs.
To create a Cloud Run job that can perform compute intensive
tasks or batch processing in real time, you should:
Use models that load fast and require minimal transformation into GPU-ready
structures, and optimize how they are loaded.
Use configurations that allow for maximum, efficient, concurrent execution to
reduce the number of GPUs needed to serve a target request per second while
keeping costs down.
Recommended ways to load large ML models on Cloud Run
Slightly more difficult to set up, because you'll need to either install
the Google Cloud CLI on the image or update your code to use the
Cloud Storage API.
Fast when you usenetwork optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount.
Typically simpler (many frameworks download models from central repositories).
Typically poor and unpredictable:
Frameworks may apply model transformations during initialization. (You should do this at build time).
Model host and libraries for downloading the model may not be efficient.
There is reliability risk associated with downloading from the internet. Your job could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket.
Depends on the model hosting provider.
Store models in container images
By storing the ML model in the container image, model loading will benefit from Cloud Run's optimized container streaming infrastructure.
However, building container images that include ML models is a resource-intensive process,
especially when working with large models. In particular, the build process can become
bottlenecked on network throughput. When using Cloud Build, we recommend
using a more powerful build machine with increased compute and networking
performance. To do this, build an image using abuild config filethat has the following steps:
You can create one model copy per image if the layer containing the model is
distinct between images (different hash). There could be additional Artifact Registry
cost because there could be one copy of the model per image if your model layer
is unique across each image.
Store models in Cloud Storage
To optimize ML model loading when loading ML models from Cloud Storage,
either usingCloud Storage volume mountsor directly using the Cloud Storage API or command line, you must useDirect VPCwith
the egress setting value set toall-traffic, along withPrivate Google Access.
Load models from the internet
To optimize ML model loading from the internet,route all traffic through
the vpc networkwith the egress setting
value set toall-trafficand set upCloud NATto reach the public internet at high bandwidth.
Build, deployment, runtime, and system design considerations
The following sections describe considerations for build, deploy, runtime and system design.
At build time
The following list shows considerations you need to take into account when you
are planning your build:
Choose a good base image. You should start with an image from theDeep Learning Containersor theNVIDIA container registryfor the ML framework you're using. These images have the latest performance-related
packages installed. We don't recommend creating a custom image.
Choose 4-bit quantized models to maximize concurrency unless you can prove
they affect result quality. Quantization produces smaller and faster models,
reducing the amount of GPU memory needed to serve the model, and can increase
parallelism at run time. Ideally, the models
should be trained at the target bit depth rather than quantized down to it.
Pick a model format with fast load times to minimize container startup time,
such as GGUF. These formats more accurately reflect the target quantization type
and require less transformations when loaded onto the GPU. For security reasons,
don't use pickle-format checkpoints.
Create and warm LLM caches at build time. Start the LLM on the build machine
while building the docker image. Enable prompt caching and feed common or example
prompts to help warm the cache for real-world use. Save the outputs it generates
to be loaded at runtime.
Save your own inference model that you generate during build time. This saves
significant time compared to loading less efficiently stored models and applying
transforms like quantization at container startup.
At deployment
The following list shows considerations you need to take into account when you
are planning your deployment:
If you are running parallel tasks in a job execution, determine and setparallelismto less than the lowest value of theapplicable quota limitsyou allocated for your project. By default, the GPU job instance quota is set to5for tasks that run parallelly. To request for a quota increase, seeHow to increase quota. GPU tasks start as quickly as possible,
and go up to a maximum that varies depending on how much GPU quota you allocated
for the project and the region selected. Deployments fail if you set parallelism
to more than the GPU quota limit.
At run time
Actively manage your supported context length. The smaller the context window
you support, the more queries you can support running in parallel. The details
of how to do this depend on the framework.
Use the LLM caches you generated at build time. Supply the same flags you
used during build time when you generated the prompt and prefix cache.
Consider using a quantized key-value cache if your framework supports it.
This can reduce per-query memory requirements and allows for configuration of
more parallelism. However, it can also impact quality.
Tune the amount of GPU memory to reserve for model weights, activations and
key-value caches. Set it as high as you can without getting an out-of-memory error.
Check to see whether your framework has any
options for improving container startup performance (for example, using model loading parallelization).
At the system design level
Add semantic caches where appropriate. In some cases, caching whole queries
and responses can be a great way of limiting the cost of common queries.
Control variance in your preambles. Prompt caches are only useful when they
contain the prompts in sequence. Caches are effectively prefix-cached.
Insertions or edits in the sequence mean that they're either not cached or only
partially present.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Best practices: Cloud Run jobs with GPUs\n\n| **Preview\n| --- GPU support for Cloud Run jobs**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\nThis page provides best practices for optimizing performance when using a Cloud Run job with GPU for AI workloads such as, training large language models (LLMs) using your preferred frameworks, fine-tuning, and performing batch or offline inference on LLMs. To create a Cloud Run job that can perform compute intensive tasks or batch processing in real time, you should:\n\n- Use models that load fast and require minimal transformation into GPU-ready structures, and optimize how they are loaded.\n- Use configurations that allow for maximum, efficient, concurrent execution to reduce the number of GPUs needed to serve a target request per second while keeping costs down.\n\nRecommended ways to load large ML models on Cloud Run\n-----------------------------------------------------\n\nGoogle recommends either storing ML models [inside container images](#model-container)\nor [optimize loading them from Cloud Storage](#model-storage).\n\n### Storing and loading ML models trade-offs\n\nHere is a comparison of the options:\n\n### Store models in container images\n\nBy storing the ML model in the container image, model loading will benefit from Cloud Run's optimized container streaming infrastructure.\nHowever, building container images that include ML models is a resource-intensive process,\nespecially when working with large models. In particular, the build process can become\nbottlenecked on network throughput. When using Cloud Build, we recommend\nusing a more powerful build machine with increased compute and networking\nperformance. To do this, build an image using a\n[build config file](/build/docs/build-push-docker-image#build_an_image_using_a_build_config_file)\nthat has the following steps: \n\n```yaml\nsteps:\n- name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', '\u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e', '.']\n- name: 'gcr.io/cloud-builders/docker'\n args: ['push', '\u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e']\nimages:\n- IMAGE\noptions:\n machineType: 'E2_HIGHCPU_32'\n diskSizeGb: '500'\n \n```\n\nYou can create one model copy per image if the layer containing the model is\ndistinct between images (different hash). There could be additional Artifact Registry\ncost because there could be one copy of the model per image if your model layer\nis unique across each image.\n\n### Store models in Cloud Storage\n\nTo optimize ML model loading when loading ML models from Cloud Storage,\neither using [Cloud Storage volume mounts](/run/docs/configuring/jobs/cloud-storage-volume-mounts)\nor directly using the Cloud Storage API or command line, you must use [Direct VPC](/run/docs/configuring/vpc-direct-vpc) with\nthe egress setting value set to `all-traffic`, along with [Private Google Access](/vpc/docs/configure-private-google-access).\n\n### Load models from the internet\n\nTo optimize ML model loading from the internet, [route all traffic through\nthe vpc network](/run/docs/configuring/vpc-direct-vpc) with the egress setting\nvalue set to `all-traffic` and set up [Cloud NAT](/nat/docs/overview) to reach the public internet at high bandwidth.\n\nBuild, deployment, runtime, and system design considerations\n------------------------------------------------------------\n\nThe following sections describe considerations for build, deploy, runtime and system design.\n\n### At build time\n\nThe following list shows considerations you need to take into account when you\nare planning your build:\n\n- Choose a good base image. You should start with an image from the [Deep Learning Containers](/deep-learning-containers/docs/choosing-container) or the [NVIDIA container registry](https://catalog.ngc.nvidia.com/containers) for the ML framework you're using. These images have the latest performance-related packages installed. We don't recommend creating a custom image.\n- Choose 4-bit quantized models to maximize concurrency unless you can prove they affect result quality. Quantization produces smaller and faster models, reducing the amount of GPU memory needed to serve the model, and can increase parallelism at run time. Ideally, the models should be trained at the target bit depth rather than quantized down to it.\n- Pick a model format with fast load times to minimize container startup time, such as GGUF. These formats more accurately reflect the target quantization type and require less transformations when loaded onto the GPU. For security reasons, don't use pickle-format checkpoints.\n- Create and warm LLM caches at build time. Start the LLM on the build machine while building the docker image. Enable prompt caching and feed common or example prompts to help warm the cache for real-world use. Save the outputs it generates to be loaded at runtime.\n- Save your own inference model that you generate during build time. This saves significant time compared to loading less efficiently stored models and applying transforms like quantization at container startup.\n\n### At deployment\n\nThe following list shows considerations you need to take into account when you\nare planning your deployment:\n\n- Set a [task timeout of one hour\n or lesser](/run/docs/configuring/task-timeout) for job executions.\n- If you are running parallel tasks in a job execution, determine and set [parallelism](/run/docs/configuring/parallelism) to less than the lowest value of the [applicable quota limits](/run/docs/configuring/jobs/gpu#request-quota) you allocated for your project. By default, the GPU job instance quota is set to `5` for tasks that run parallelly. To request for a quota increase, see [How to increase quota](/run/quotas#increase). GPU tasks start as quickly as possible, and go up to a maximum that varies depending on how much GPU quota you allocated for the project and the region selected. Deployments fail if you set parallelism to more than the GPU quota limit.\n\n### At run time\n\n- Actively manage your supported context length. The smaller the context window you support, the more queries you can support running in parallel. The details of how to do this depend on the framework.\n- Use the LLM caches you generated at build time. Supply the same flags you used during build time when you generated the prompt and prefix cache.\n- Load from the saved model you just wrote. See [Storing and loading models trade-offs](#loading-storing-models-tradeoff) for a comparison on how to load the model.\n- Consider using a quantized key-value cache if your framework supports it. This can reduce per-query memory requirements and allows for configuration of more parallelism. However, it can also impact quality.\n- Tune the amount of GPU memory to reserve for model weights, activations and key-value caches. Set it as high as you can without getting an out-of-memory error.\n- Check to see whether your framework has any options for improving container startup performance (for example, using model loading parallelization).\n\n### At the system design level\n\n- Add semantic caches where appropriate. In some cases, caching whole queries and responses can be a great way of limiting the cost of common queries.\n- Control variance in your preambles. Prompt caches are only useful when they contain the prompts in sequence. Caches are effectively prefix-cached. Insertions or edits in the sequence mean that they're either not cached or only partially present."]]