Use vLLM on GKE to run inference with Llama 4

This tutorial shows you how to deploy and serve a Llama 4 Scout (17Bx16E), a 17B large language model (LLM), and serve it by using the vLLM framework. You deploy this model on a single A4 virtual machine (VM) instance on Google Kubernetes Engine (GKE).

This tutorial is intended for machine learning (ML) engineers, platform administrators and operators, and for data and AI specialists who are interested in using Kubernetes container orchestration capabilities to handle inference workloads.

Objectives

Access Llama 4 by using Hugging Face.
Prepare your environment.
Create a GKE cluster in Autopilot mode.
Create a Kubernetes secret for Hugging Face credentials.
Deploy a vLLM container to your GKE cluster.
Interact with Llama 4 by using curl.
Clean up.

Costs

This tutorial uses billable components of Google Cloud, including:

To generate a cost estimate based on your projected usage, use the Pricing Calculator .

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

To initialize the gcloud CLI, run the following command:

gcloud  
init

Create or select a Google Cloud project .

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID 
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID 
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project .

Enable the required API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

gcloud  
services  
 enable 
  
container.googleapis.com

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

To initialize the gcloud CLI, run the following command:

gcloud  
init

Create or select a Google Cloud project .

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID 
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID 
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project .

Enable the required API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

gcloud  
services  
 enable 
  
container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.admin
```
gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
--member = 
 "user: USER_IDENTIFIER 
" 
  
--role = 
 ROLE 
```
Replace the following:
- PROJECT_ID : Your project ID.
- USER_IDENTIFIER : The identifier for your user account. For example, myemail@example.com .
- ROLE : The IAM role that you grant to your user account.
Sign in to or create a Hugging Face account .

Access Llama 4 by using Hugging Face

To use Hugging Face to access Llama 4, do the following:

Sign the consent agreement to use Llama 4 .
Create a Hugging Face read access token .
Copy and save the read access token value. You use it later in this tutorial.

Prepare your environment

To prepare your environment, set the following variables:

 gcloud config set project PROJECT_ID 
gcloud config set billing/quota_project PROJECT_ID 
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL= RESERVATION_URL 
export REGION= REGION 
export CLUSTER_NAME= CLUSTER_NAME 
export HUGGING_FACE_TOKEN= HUGGING_FACE_TOKEN 
export NETWORK= NETWORK_NAME 
export SUBNETWORK= SUBNETWORK_NAME

Replace the following:

PROJECT_ID : the ID of the Google Cloud project where you want to create the GKE cluster.
RESERVATION_URL : the URL of the reservation that you want to use to create your GKE cluster. Based on the project in which the reservation exists, specify one of the following values:
- The reservation exists in your project: RESERVATION_NAME
- The reservation exists in a different project, and your project can use the reservation: projects/ RESERVATION_PROJECT_ID /reservations/ RESERVATION_NAME
REGION : the region where you want to create your GKE cluster. You can only create the cluster in the region where you reservation exists.
CLUSTER_NAME : the name of the GKE cluster to create.
HUGGING_FACE_TOKEN : the Hugging Face access token that you created in the previous section.
NETWORK_NAME : the network that the GKE cluster uses. Specify one of the following values:
- If you created a custom network, then specify the name of your network.
- Otherwise, specify default .
SUBNETWORK_NAME : the subnetwork that the GKE cluster uses. Specify one of the following values:
- If you created a custom subnetwork, then specify the name of your subnetwork. You can only specify a subnetwork that exists in the same region as the reservation.
- Otherwise, specify default .

Create and configure Google Cloud resources

Follow these instructions in this section to create the required resources.

Create a GKE cluster in Autopilot mode

To create a GKE cluster in Autopilot mode, run the following command:

 gcloud container clusters create-auto $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --region=$REGION \
    --release-channel=rapid \
    --network=$NETWORK \
    --subnetwork=$SUBNETWORK

The creation of the GKE cluster might take some time to complete. To verify that Google Cloud has finished creating your cluster, go to Kubernetes clusters on the Google Cloud console.

Create a Kubernetes secret to store your Hugging Face credentials

To create a Kubernetes secret to store your Hugging Face credentials, do the following:

Configure kubectl to communicate with your GKE cluster:

 gcloud container clusters get-credentials $CLUSTER_NAME \
    --location=$REGION

Create a Kubernetes secret that contains the Hugging Face read access token that you created in an earlier step:

 kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HUGGING_FACE_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

Deploy a vLLM container to your GKE cluster

To deploy the vLLM container to serve the Llama-4-Scout-17B-16E-Instruct model, do the following:

Create a vllm-l4-17b.yaml file with your chosen vLLM deployment:

  apiVersion 
 : 
  
 apps/v1 
 kind 
 : 
  
 Deployment 
 metadata 
 : 
  
 name 
 : 
  
 vllm-llama4-deployment 
 spec 
 : 
  
 replicas 
 : 
  
 1 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 llama4-server 
  
 template 
 : 
  
 metadata 
 : 
  
 labels 
 : 
  
 app 
 : 
  
 llama4-server 
  
 ai.gke.io/model 
 : 
  
 llama-4-scout-17b 
  
 ai.gke.io/inference-server 
 : 
  
 vllm 
  
 examples.ai.gke.io/source 
 : 
  
 user-guide 
  
 spec 
 : 
  
 containers 
 : 
  
 - 
  
 name 
 : 
  
 inference-server 
  
 image 
 : 
  
 us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250722_0916_RC01 
  
 resources 
 : 
  
 requests 
 : 
  
 cpu 
 : 
  
 "10" 
  
 memory 
 : 
  
 "128Gi" 
  
 ephemeral-storage 
 : 
  
 "240Gi" 
  
 nvidia.com/gpu 
 : 
  
 "8" 
  
 limits 
 : 
  
 cpu 
 : 
  
 "10" 
  
 memory 
 : 
  
 "128Gi" 
  
 ephemeral-storage 
 : 
  
 "240Gi" 
  
 nvidia.com/gpu 
 : 
  
 "8" 
  
 command 
 : 
  
 [ 
 "python3" 
 , 
  
 "-m" 
 , 
  
 "vllm.entrypoints.openai.api_server" 
 ] 
  
 args 
 : 
  
 - 
  
 --model=$(MODEL_ID) 
  
 - 
  
 --tensor-parallel-size=8 
  
 - 
  
 --host=0.0.0.0 
  
 - 
  
 --port=8000 
  
 - 
  
 --max-model-len=4096 
  
 - 
  
 --max-num-seqs=4 
  
 env 
 : 
  
 - 
  
 name 
 : 
  
 MODEL_ID 
  
 value 
 : 
  
 meta-llama/Llama-4-Scout-17B-16E-Instruct 
  
 - 
  
 name 
 : 
  
 HUGGING_FACE_HUB_TOKEN 
  
 valueFrom 
 : 
  
 secretKeyRef 
 : 
  
 name 
 : 
  
 hf-secret 
  
 key 
 : 
  
 hf_api_token 
  
 volumeMounts 
 : 
  
 - 
  
 mountPath 
 : 
  
 /dev/shm 
  
 name 
 : 
  
 dshm 
  
 livenessProbe 
 : 
  
 httpGet 
 : 
  
 path 
 : 
  
 /health 
  
 port 
 : 
  
 8000 
  
 initialDelaySeconds 
 : 
  
 1800 
  
 periodSeconds 
 : 
  
 10 
  
 readinessProbe 
 : 
  
 httpGet 
 : 
  
 path 
 : 
  
 /health 
  
 port 
 : 
  
 8000 
  
 initialDelaySeconds 
 : 
  
 1800 
  
 periodSeconds 
 : 
  
 5 
  
 volumes 
 : 
  
 - 
  
 name 
 : 
  
 dshm 
  
 emptyDir 
 : 
  
 medium 
 : 
  
 Memory 
  
 nodeSelector 
 : 
  
 cloud.google.com/gke-accelerator 
 : 
  
 nvidia-b200 
  
 cloud.google.com/reservation-name 
 : 
  
 RESERVATION_URL 
  
 cloud.google.com/reservation-affinity 
 : 
  
 "specific" 
  
 cloud.google.com/gke-gpu-driver-version 
 : 
  
 latest 
 --- 
 apiVersion 
 : 
  
 v1 
 kind 
 : 
  
 Service 
 metadata 
 : 
  
 name 
 : 
  
 llm-service 
 spec 
 : 
  
 selector 
 : 
  
 app 
 : 
  
 llama4-server 
  
 type 
 : 
  
 ClusterIP 
  
 ports 
 : 
  
 - 
  
 protocol 
 : 
  
 TCP 
  
 port 
 : 
  
 8000 
  
 targetPort 
 : 
  
 8000 
 --- 
 apiVersion 
 : 
  
 monitoring.googleapis.com/v1 
 kind 
 : 
  
 PodMonitoring 
 metadata 
 : 
  
 name 
 : 
  
 vllm-llama4-monitoring 
 spec 
 : 
  
 selector 
 : 
  
 matchLabels 
 : 
  
 app 
 : 
  
 llama4-server 
  
 endpoints 
 : 
  
 - 
  
 port 
 : 
  
 8000 
  
 path 
 : 
  
 /metrics 
  
 interval 
 : 
  
 30s

Apply the vllm-l4-17b.yaml file to your GKE cluster:
```
 kubectl apply -f vllm-l4-17b.yaml 
```
During the deployment process, the container must download the Llama-4-Scout-17B-16E-Instruct model from Hugging Face. For this reason, deployment of the container might take up to 30 minutes to complete.
To see the completion status, run the following command:
```
 kubectl wait \
          --for=condition=Available \
          --timeout=1800s deployment/vllm-llama4-deployment 
```
The --timeout=1800s flag allows the command to monitor the deployment for up to 30 minutes.

Interact with Llama 4 by using curl

To verify the Llama 4 Scout model that you deployed, do the following:

Set up port forwarding to Llama 4 Scout:

 kubectl port-forward service/llm-service 8000:8000

Open a new terminal window. You can then chat with your model by using curl :

 curl http://127.0.0.1:8000/v1/chat/completions \
     -X POST \
     -H "Content-Type: application/json" \
     -d '{
       "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
       "messages": [
         {
           "role": "user",
           "content": "Describe a sailboat in one short sentence?"
         }
       ]
     }'

The output that you see is similar to the following:

  { 
  
 "id" 
 : 
  
 "chatcmpl-ec0ad6310c494a889b17600881c06e3d" 
 , 
  
 "object" 
 : 
  
 "chat.completion" 
 , 
  
 "created" 
 : 
  
 1754073279 
 , 
  
 "model" 
 : 
  
 "meta-llama/Llama-4-Scout-17B-16E-Instruct" 
 , 
  
 "choices" 
 : 
  
 [ 
  
 { 
  
 "index" 
 : 
  
 0 
 , 
  
 "message" 
 : 
  
 { 
  
 "role" 
 : 
  
 "assistant" 
 , 
  
 "content" 
 : 
  
 "A sailboat is a type of watercraft that uses the wind for propulsion, typically featuring a hull, mast, and one or more sails." 
 , 
  
 "refusal" 
 : 
  
 null 
 , 
  
 "annotations" 
 : 
  
 null 
 , 
  
 "audio" 
 : 
  
 null 
 , 
  
 "function_call" 
 : 
  
 null 
 , 
  
 "tool_calls" 
 : 
  
 [], 
  
 "reasoning_content" 
 : 
  
 null 
  
 }, 
  
 "logprobs" 
 : 
  
 null 
 , 
  
 "finish_reason" 
 : 
  
 "stop" 
 , 
  
 "stop_reason" 
 : 
  
 null 
  
 } 
  
 ], 
  
 "service_tier" 
 : 
  
 null 
 , 
  
 "system_fingerprint" 
 : 
  
 null 
 , 
  
 "usage" 
 : 
  
 { 
  
 "prompt_tokens" 
 : 
  
 19 
 , 
  
 "total_tokens" 
 : 
  
 49 
 , 
  
 "completion_tokens" 
 : 
  
 30 
 , 
  
 "prompt_tokens_details" 
 : 
  
 null 
  
 }, 
  
 "prompt_logprobs" 
 : 
  
 null 
 , 
  
 "kv_transfer_params" 
 : 
  
 null 
  
 }

Observe model performance

To observe the performance of your model, you can use the vLLM dashboard integration in Cloud Monitoring . You can use this dashboard to view critical performance metrics like token throughput, request latency, and error rates.

For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the vLLM observability guidance in the Cloud Monitoring documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your project

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

Delete the resources

To delete the deployment and service in the vllm-l4-17b.yaml file and the Kubernetes secret from the GKE cluster, run the following command:
```
 kubectl delete -f vllm-l4-17b.yaml
kubectl delete secret hf-secret 
```

To delete your GKE cluster, run the following command:

 gcloud container clusters delete $CLUSTER_NAME \
        --region=$REGION

What's next

Learn how to manage AI-optimized GKE clusters