Deploy and serve Gemma 3 27B inference with vLLM on GKE


This tutorial shows you how to deploy and serve a Gemma 3 27B large language model (LLM) with the vLLM serving framework . You deploy Gemma 3 on a single A4 virtual machine (VM) instance on Google Kubernetes Engine (GKE).

This tutorial is intended for machine learning (ML) engineers, platform administrators and operators, and for data and AI specialists who are interested in using Kubernetes container orchestration capabilities to handle inference workloads.

Objectives

  1. Access Gemma 3 by using Hugging Face.

  2. Prepare your environment.

  3. Create a GKE cluster in Autopilot mode.

  4. Create a Kubernetes secret for Hugging Face credentials.

  5. Deploy a vLLM container to your GKE cluster.

  6. Interact with Gemma 3 by using curl.

  7. Clean up.

Costs

This tutorial uses billable components of Google Cloud, including:

To generate a cost estimate based on your projected usage, use the Pricing Calculator .

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  4. To initialize the gcloud CLI, run the following command:

    gcloud  
    init
  5. Create or select a Google Cloud project .

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID 
      

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID 
      

      Replace PROJECT_ID with your Google Cloud project name.

  6. Verify that billing is enabled for your Google Cloud project .

  7. Enable the required API:

    gcloud  
    services  
     enable 
      
    container.googleapis.com
  8. Install the Google Cloud CLI.

  9. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  10. To initialize the gcloud CLI, run the following command:

    gcloud  
    init
  11. Create or select a Google Cloud project .

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID 
      

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID 
      

      Replace PROJECT_ID with your Google Cloud project name.

  12. Verify that billing is enabled for your Google Cloud project .

  13. Enable the required API:

    gcloud  
    services  
     enable 
      
    container.googleapis.com
  14. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.admin

    gcloud  
    projects  
    add-iam-policy-binding  
     PROJECT_ID 
      
    --member = 
     "user: USER_IDENTIFIER 
    " 
      
    --role = 
     ROLE 
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • USER_IDENTIFIER : the identifier for your user account—for example, myemail@example.com .
    • ROLE : the IAM role that you grant to your user account.
  15. Sign in to or create a Hugging Face account .

Access Gemma 3 by using Hugging Face

To use Hugging Face to access Gemma 3, do the following:

  1. Sign in to Hugging Face
  2. Create a Hugging Face read access token . Click Your Profile > Settings > Access tokens > +Create new token
  3. Copy and save the read access token value. You use it later in this tutorial.

Prepare your environment

To prepare your environment, set the default environment variables:

 gcloud  
config  
 set 
  
project  
 PROJECT_ID 
gcloud  
config  
 set 
  
billing/quota_project  
 PROJECT_ID 
 export 
  
 PROJECT_ID 
 = 
 $( 
gcloud  
config  
get  
project ) 
 export 
  
 RESERVATION_URL 
 = 
 RESERVATION_URL 
 export 
  
 REGION 
 = 
 REGION 
 export 
  
 CLUSTER_NAME 
 = 
 CLUSTER_NAME 
 export 
  
 HUGGING_FACE_TOKEN 
 = 
 HUGGING_FACE_TOKEN 
 export 
  
 NETWORK 
 = 
 NETWORK_NAME 
 export 
  
 SUBNETWORK 
 = 
 SUBNETWORK_NAME 
 

Replace the following:

  • PROJECT_ID : the ID of the Google Cloud project where you want to create the GKE cluster.

  • RESERVATION_URL : the URL of the reservation that you want to use to create your GKE cluster. Based on the project in which the reservation exists, specify one of the following values:

    • The reservation exists in your project: RESERVATION_NAME

    • The reservation exists in a different project, and your project can use the reservation: projects/ RESERVATION_PROJECT_ID /reservations/ RESERVATION_NAME

  • REGION : the region where you want to create your GKE cluster. You can only create the cluster in the region where you reservation exists.

  • CLUSTER_NAME : the name of the GKE cluster to create.

  • HUGGING_FACE_TOKEN : the Hugging Face access token that you created in the previous section.

  • NETWORK_NAME : the network that the GKE cluster uses. Specify one of the following values:

    • If you created a custom network, then specify the name of your network.

    • Otherwise, specify default .

  • SUBNETWORK_NAME : the subnetwork that the GKE cluster uses. Specify one of the following values:

    • If you created a custom subnetwork, then specify the name of your subnetwork. You can only specify a subnetwork that exists in the same region as the reservation.

    • Otherwise, specify default .

Create a GKE cluster in Autopilot mode

To create a GKE cluster in Autopilot mode, run the following command:

 gcloud  
container  
clusters  
create-auto  
 $CLUSTER_NAME 
  
 \ 
  
--project = 
 $PROJECT_ID 
  
 \ 
  
--region = 
 $REGION 
  
 \ 
  
--release-channel = 
rapid  
 \ 
  
--network = 
 $NETWORK 
  
 \ 
  
--subnetwork = 
 $SUBNETWORK 
 

Creating the GKE cluster might take some time to complete. To verify that Google Cloud has finished creating your cluster, go to Kubernetes clusters on the Google Cloud console.

Create a Kubernetes secret for Hugging Face credentials

To create a Kubernetes secret for Hugging Face credentials, follow these steps:

  1. Configure kubectl to communicate with your GKE cluster:

     gcloud  
    container  
    clusters  
    get-credentials  
     $CLUSTER_NAME 
      
     \ 
      
    --location = 
     $REGION 
     
    
  2. Create a Kubernetes secret to store your Hugging Face token:

     kubectl  
    create  
    secret  
    generic  
    hf-secret  
     \ 
      
    --from-literal = 
     hf_api_token 
     = 
     ${ 
     HUGGING_FACE_TOKEN 
     } 
      
     \ 
      
    --dry-run = 
    client  
    -o  
    yaml  
     | 
      
    kubectl  
    apply  
    -f  
    - 
    

Deploy a vLLM container to your GKE cluster

To deploy the vLLM container to serve the Gemma 3 27B model by using Kubernetes Deployments, follow these steps:

  1. Create a vllm-3-27b-it.yaml file with your chosen vLLM deployment:

      apiVersion 
     : 
      
     apps/v1 
     kind 
     : 
      
     Deployment 
     metadata 
     : 
      
     name 
     : 
      
     vllm-gemma-deployment 
     spec 
     : 
      
     replicas 
     : 
      
     1 
      
     selector 
     : 
      
     matchLabels 
     : 
      
     app 
     : 
      
     gemma-server 
      
     template 
     : 
      
     metadata 
     : 
      
     labels 
     : 
      
     app 
     : 
      
     gemma-server 
      
     ai.gke.io/model 
     : 
      
     gemma-3-27b-it 
      
     ai.gke.io/inference-server 
     : 
      
     vllm 
      
     examples.ai.gke.io/source 
     : 
      
     user-guide 
      
     spec 
     : 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     inference-server 
      
     image 
     : 
      
     us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01 
      
     resources 
     : 
      
     requests 
     : 
      
     cpu 
     : 
      
     "10" 
      
     memory 
     : 
      
     "128Gi" 
      
     ephemeral-storage 
     : 
      
     "120Gi" 
      
     nvidia.com/gpu 
     : 
      
     "8" 
      
     limits 
     : 
      
     cpu 
     : 
      
     "10" 
      
     memory 
     : 
      
     "128Gi" 
      
     ephemeral-storage 
     : 
      
     "120Gi" 
      
     nvidia.com/gpu 
     : 
      
     "8" 
      
     command 
     : 
      
     [ 
     "python3" 
     , 
      
     "-m" 
     , 
      
     "vllm.entrypoints.openai.api_server" 
     ] 
      
     args 
     : 
      
     - 
      
     --model=$(MODEL_ID) 
      
     - 
      
     --tensor-parallel-size=8 
      
     - 
      
     --host=0.0.0.0 
      
     - 
      
     --port=8000 
      
     - 
      
     --max-model-len=4096 
      
     - 
      
     --max-num-seqs=4 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     MODEL_ID 
      
     value 
     : 
      
     google/gemma-3-27b-it 
      
     - 
      
     name 
     : 
      
     HUGGING_FACE_HUB_TOKEN 
      
     valueFrom 
     : 
      
     secretKeyRef 
     : 
      
     name 
     : 
      
     hf-secret 
      
     key 
     : 
      
     hf_api_token 
      
     volumeMounts 
     : 
      
     - 
      
     mountPath 
     : 
      
     /dev/shm 
      
     name 
     : 
      
     dshm 
      
     livenessProbe 
     : 
      
     httpGet 
     : 
      
     path 
     : 
      
     /health 
      
     port 
     : 
      
     8000 
      
     initialDelaySeconds 
     : 
      
     600 
      
     periodSeconds 
     : 
      
     10 
      
     readinessProbe 
     : 
      
     httpGet 
     : 
      
     path 
     : 
      
     /health 
      
     port 
     : 
      
     8000 
      
     initialDelaySeconds 
     : 
      
     600 
      
     periodSeconds 
     : 
      
     5 
      
     volumes 
     : 
      
     - 
      
     name 
     : 
      
     dshm 
      
     emptyDir 
     : 
      
     medium 
     : 
      
     Memory 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-accelerator 
     : 
      
     nvidia-b200 
      
     cloud.google.com/reservation-name 
     : 
      
      RESERVATION_URL 
     
      
     cloud.google.com/reservation-affinity 
     : 
      
     "specific" 
      
     cloud.google.com/gke-gpu-driver-version 
     : 
      
     latest 
     --- 
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
     llm-service 
     spec 
     : 
      
     selector 
     : 
      
     app 
     : 
      
     gemma-server 
      
     type 
     : 
      
     ClusterIP 
      
     ports 
     : 
      
     - 
      
     protocol 
     : 
      
     TCP 
      
     port 
     : 
      
     8000 
      
     targetPort 
     : 
      
     8000 
     
    
  2. Apply the vllm-3-27b-it.yaml file to your GKE cluster:

     kubectl apply -f vllm-3-27b-it.yaml 
    

    During the deployment process, the container must download Gemma 3 from Hugging Face. For this reason, deployment of the container might take up to 30 minutes to complete.

  3. Wait for the deployment to complete:

     kubectl wait \
        --for=condition=Available \
        --timeout=1800s deployment/vllm-gemma-deployment 
    

Interact with Gemma 3 by using curl

To verify your deployed Gemma 3 27B instruction-tuned models, follow these steps:

  1. Set up port forwarding to Gemma 3:

     kubectl port-forward service/llm-service 8000:8000 
    
  2. Open a new terminal window. You can then chat with your model by using curl :

     curl  
    http://127.0.0.1:8000/v1/chat/completions  
     \ 
    -X  
    POST  
     \ 
    -H  
     "Content-Type: application/json" 
      
     \ 
    -d  
     '{ 
     "model": "google/gemma-3-27b-it", 
     "messages": [ 
     { 
     "role": "user", 
     "content": "Why is the sky blue?" 
     } 
     ] 
     }' 
     
    

    The output is similar to the following:

      { 
      
     "id" 
     : 
      
     "chatcmpl-e4a2e624bea849d9b09f838a571c4d9e" 
     , 
      
     "object" 
     : 
      
     "chat.completion" 
     , 
      
     "created" 
     : 
      
     1741763029 
     , 
      
     "model" 
     : 
      
     "google/gemma-3-27b-it" 
     , 
      
     "choices" 
     : 
      
     [ 
      
     { 
      
     "index" 
     : 
      
     0 
     , 
      
     "message" 
     : 
      
     { 
      
     "role" 
     : 
      
     "assistant" 
     , 
      
     "reasoning_content" 
     : 
      
     null 
     , 
      
     "content" 
     : 
      
     "Okay, let's break down why the sky appears blue! It's a fascinating phenomenon rooted in physics, specifically something called **Rayleigh scattering**. Here's the explanation: ..." 
     , 
      
     "tool_calls" 
     : 
      
     [] 
      
     }, 
      
     "logprobs" 
     : 
      
     null 
     , 
      
     "finish_reason" 
     : 
      
     "stop" 
     , 
      
     "stop_reason" 
     : 
      
     106 
      
     } 
      
     ], 
      
     "usage" 
     : 
      
     { 
      
     "prompt_tokens" 
     : 
      
     15 
     , 
      
     "total_tokens" 
     : 
      
     668 
     , 
      
     "completion_tokens" 
     : 
      
     653 
     , 
      
     "prompt_tokens_details" 
     : 
      
     null 
      
     }, 
      
     "prompt_logprobs" 
     : 
      
     null 
     } 
     
    

If you want to observe your model's performance, then you can use the vLLM dashboard integration in Cloud Monitoring. This dashboard helps you view critical performance metrics for your model like token throughput, network latency, and error rates. For information, see vLLM in the Monitoring documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your project

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID 

Delete your GKE cluster

To delete your GKE cluster, run the following command:

 gcloud  
container  
clusters  
delete  
 $CLUSTER_NAME 
  
 \ 
  
--region = 
 $REGION 
 

Delete the yaml file and Kubernetes secret

To delete the vllm-3-27b-it.yaml file and the Kubernetes secret from the GKE cluster, run the following commands:

 kubectl  
delete  
-f  
vllm-3-27b-it.yaml
kubectl  
delete  
secret  
hf-secret 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: