Run LLM inference on Cloud Run GPUs with vLLM

The following codelab shows how to run a backend service that runs vLLM , which is an inference engine for production systems, along with Google's Gemma 2 , which is a 2 billion parameters instruction-tuned model.

See the entire codelab at Run LLM inference on Cloud Run GPUs with vLLM .

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-25 UTC.

Design a Mobile Site

View Site in Mobile | Classic

Share by: