Serve a model with a single GPU in GKE


This tutorial shows you how to deploy and serve a large language model (LLM) using GPUs on Google Kubernetes Engine (GKE) with NVIDIA Triton Inference Server and TensorFlow Serving This provides a foundation for understanding and exploring practical LLM deployment for inference in a managed Kubernetes environment. You deploy a pre-built container to a GKE cluster with a single L4 Tensor Core GPU and you prepare the GKE infrastructure to do online inference.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who want to host a pre-trained machine learning (ML) model on a GKE cluster. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks .

Before reading this page, ensure that you're familiar with the following:

Objectives

  1. Create a GKE Autopilot or Standard cluster.
  2. Configure a Cloud Storage bucket, where the pre-trained model lives.
  3. Deploy the online inference framework you select.
  4. Make a test request to the deployed service.

Costs

This tutorial uses the following billable components of Google Cloud:
  • GKE
  • Cloud Storage
  • L4 GPU accelerators
  • Egress traffic

Use the Pricing Calculator to generate a cost estimate based on your projected usage.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up .

Before you begin

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project .

  4. Enable the GKE API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project .

  7. Enable the GKE API.

    Enable the API

Set defaults for the Google Cloud CLI

  1. In the Google Cloud console, start a Cloud Shell instance: Open Cloud Shell

  2. Download the source code for this sample app:

     git  
    clone  
    https://github.com/GoogleCloudPlatform/kubernetes-engine-samples cd 
      
    kubernetes-engine-samples/ai-ml/gke-online-serving-single-gpu 
    
  3. Set the default environment variables:

     gcloud  
    config  
     set 
      
    project  
     PROJECT_ID 
    gcloud  
    config  
     set 
      
    compute/region  
     COMPUTE_REGION 
     
    

    Replace the following values:

    • PROJECT_ID : your Google Cloud project ID .
    • COMPUTE_REGION : the Compute Engine region that supports the accelerator type you want to use, for example, us-central1 for L4 GPU.
  4. In Cloud Shell, create the following environment variables:

      export 
      
     PROJECT_ID 
     = 
     $( 
    gcloud  
    config  
    get  
    project ) 
     export 
      
     REGION 
     = 
     $( 
    gcloud  
    config  
    get  
    compute/region ) 
     export 
      
     K8S_SA_NAME 
     = 
    gpu-k8s-sa export 
      
     GSBUCKET 
     = 
     $PROJECT_ID 
    -gke-bucket export 
      
     MODEL_NAME 
     = 
    mnist export 
      
     CLUSTER_NAME 
     = 
    online-serving-cluster 
    

Create a GKE cluster

You can serve models on a single GPU in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. With GKE Autopilot the resources scale automatically based on the model requests.

To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation .

Autopilot

Run the following command to create a GKE Autopilot cluster:

   
gcloud  
container  
clusters  
create-auto  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--location = 
 ${ 
 REGION 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--release-channel = 
rapid 

GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

Standard

  1. Run the following command to create a GKE Standard cluster:

       
    gcloud  
    container  
    clusters  
    create  
     ${ 
     CLUSTER_NAME 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
     \ 
      
    --location = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --workload-pool = 
     ${ 
     PROJECT_ID 
     } 
    .svc.id.goog  
     \ 
      
    --addons  
    GcsFuseCsiDriver  
     \ 
      
    --release-channel = 
    rapid  
     \ 
      
    --num-nodes = 
     1 
     
    

    The cluster creation might take several minutes.

  2. Run the following command to create the node pool:

       
    gcloud  
    container  
    node-pools  
    create  
    gpupool  
     \ 
      
    --accelerator  
     type 
     = 
    nvidia-l4,count = 
     1 
    ,gpu-driver-version = 
    latest  
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
     \ 
      
    --location = 
     ${ 
     REGION 
     } 
      
     \ 
      
    --node-locations = 
     ${ 
     REGION 
     } 
    -a  
     \ 
      
    --cluster = 
     ${ 
     CLUSTER_NAME 
     } 
      
     \ 
      
    --machine-type = 
    g2-standard-8  
     \ 
      
    --num-nodes = 
     1 
     
    

    GKE creates a single node pool containing one L4 GPU for each node.

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store the pre-trained model that will be served.

In Cloud Shell, run the following:

 gcloud  
storage  
buckets  
create  
gs:// $GSBUCKET 
 

Configure your cluster to access the bucket using Workload Identity Federation for GKE

To let your cluster access the Cloud Storage bucket, you do the following:

  1. Create a Google Cloud service account.
  2. Create a Kubernetes ServiceAccount in your cluster.
  3. Bind the Kubernetes ServiceAccount to the Google Cloud service account.

Create a Google Cloud service account

  1. In the Google Cloud console, go to the Create service accountpage:

    Go to Create service account

  2. In the Service account IDfield, enter gke-ai-sa .

  3. Click Create and continue.

  4. In the Rolelist, select the Cloud Storage > Storage Insights Collector Servicerole.

  5. Click Add another role.

  6. In the Select a rolelist, select the Cloud Storage > Storage Object Adminrole.

  7. Click Continue, and then click Done.

Create a Kubernetes ServiceAccount in your cluster

In Cloud Shell, do the following:

  1. Create a Kubernetes namespace:

     kubectl  
    create  
    namespace  
    gke-ai-namespace 
    
  2. Create a Kubernetes ServiceAccount in the namespace:

     kubectl  
    create  
    serviceaccount  
    gpu-k8s-sa  
    --namespace = 
    gke-ai-namespace 
    

Bind the Kubernetes ServiceAccount to the Google Cloud service account

In Cloud Shell, run the following commands:

  1. Add an IAM binding to the Google Cloud service account:

     gcloud  
    iam  
    service-accounts  
    add-iam-policy-binding  
    gke-ai-sa@ PROJECT_ID 
    .iam.gserviceaccount.com  
     \ 
      
    --role  
    roles/iam.workloadIdentityUser  
     \ 
      
    --member  
     "serviceAccount: PROJECT_ID 
    .svc.id.goog[gke-ai-namespace/gpu-k8s-sa]" 
     
    

    The --member flag provides the full identity of the Kubernetes ServiceAccount in Google Cloud.

  2. Annotate the Kubernetes ServiceAccount:

     kubectl  
    annotate  
    serviceaccount  
    gpu-k8s-sa  
     \ 
      
    --namespace  
    gke-ai-namespace  
     \ 
      
    iam.gke.io/gcp-service-account = 
    gke-ai-sa@ PROJECT_ID 
    .iam.gserviceaccount.com 
    

Deploy the online inference server

Each online inference framework expects to find the pre-trained ML model in a specific format. The following section shows how to deploy the inference server depending on the framework you want to use:

Triton

  1. In Cloud Shell, copy the pre-trained ML model into the Cloud Storage bucket:

     gcloud  
    storage  
    cp  
    src/triton-model-repository  
    gs:// $GSBUCKET 
      
    --recursive 
    
  2. Deploy the framework by using a Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster:

     envsubst < 
    src/gke-config/deployment-triton.yaml  
     | 
      
    kubectl  
    --namespace = 
    gke-ai-namespace  
    apply  
    -f  
    - 
    
  3. Validate that GKE deployed the framework:

     kubectl  
    get  
    deployments  
    --namespace = 
    gke-ai-namespace 
    

    When the framework is ready, the output is similar to the following:

     NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
    triton-deployment    1/1     1            1           5m29s 
    
  4. Deploy the Services to access the Deployment:

     kubectl  
    apply  
    --namespace = 
    gke-ai-namespace  
    -f  
    src/gke-config/service-triton.yaml 
    
  5. Check the external IP is assigned:

     kubectl  
    get  
    services  
    --namespace = 
    gke-ai-namespace 
    

    The output is similar to the following:

     NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                        AGE
    kubernetes      ClusterIP      34.118.224.1     <none>          443/TCP                                        60m
    triton-server   LoadBalancer   34.118.227.176   35.239.54.228   8000:30866/TCP,8001:31035/TCP,8002:30516/TCP   5m14s 
    

    Take note of the IP address for the triton-server in the EXTERNAL-IPcolumn.

  6. Check that the service and the deployment are working correctly:

     curl  
    -v  
     EXTERNAL_IP 
    :8000/v2/health/ready 
    

    The output is similar to the following:

     ...
    < HTTP/1.1 200 OK
    < Content-Length: 0
    < Content-Type: text/plain
    ... 
    

TF Serving

  1. In Cloud Shell, copy the pre-trained ML model into the Cloud Storage bucket:

     gcloud  
    storage  
    cp  
    src/tfserve-model-repository  
    gs:// $GSBUCKET 
      
    --recursive 
    
  2. Deploy the framework by using a Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster:

     envsubst < 
    src/gke-config/deployment-tfserve.yaml  
     | 
      
    kubectl  
    --namespace = 
    gke-ai-namespace  
    apply  
    -f  
    - 
    
  3. Validate that GKE deployed the framework:

     kubectl  
    get  
    deployments  
    --namespace = 
    gke-ai-namespace 
    

    When the framework is ready, the output is similar to the following:

     NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
    tfserve-deployment   1/1     1            1           5m29s 
    
  4. Deploy the Services to access the Deployment:

     kubectl  
    apply  
    --namespace = 
    gke-ai-namespace  
    -f  
    src/gke-config/service-tfserve.yaml 
    
  5. Check the external IP is assigned:

     kubectl  
    get  
    services  
    --namespace = 
    gke-ai-namespace 
    

    The output is similar to the following:

     NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                        AGE
    kubernetes      ClusterIP      34.118.224.1     <none>          443/TCP                                        60m
    tfserve-server  LoadBalancer   34.118.227.176   35.239.54.228   8500:30003/TCP,8000:32194/TCP                  5m14s 
    

    Take note of the IP address for the tfserve-server in the EXTERNAL-IPcolumn.

  6. Check that the Service and the Deployment are working correctly:

     curl  
    -v  
     EXTERNAL_IP 
    :8000/v1/models/mnist 
    

    Replace the EXTERNAL_IP with your external IP address.

    The output is similar to the following:

     ...
    < HTTP/1.1 200 OK
    < Content-Type: application/json
    < Date: Thu, 12 Oct 2023 19:01:19 GMT
    < Content-Length: 154
    <
    {
      "model_version_status": [
            {
            "version": "1",
            "state": "AVAILABLE",
            "status": {
              "error_code": "OK",
              "error_message": ""
            }
          }
        ]
    } 
    

Serve the model

Triton

  1. Create a Python virtual environment in Cloud Shell.

     python  
    -m  
    venv  
    ./mnist_client source 
      
    ./mnist_client/bin/activate 
    
  2. Install the required Python packages.

     pip  
    install  
    -r  
    src/client/triton-requirements.txt 
    
  3. Test Triton inference server by loading an image:

      cd 
      
    src/client
    python  
    triton_mnist_client.py  
    -i  
     EXTERNAL_IP 
      
    -m  
    mnist  
    -p  
    ./images/ TEST_IMAGE 
    .png 
    

    Replace the following:

    • EXTERNAL_IP : Your external IP address.
    • TEST_IMAGE : The name of the file that corresponds to the image you want to test. You can use the images stored in src/client/images .

    Depending on which image you use, the output is similar to the following:

     Calling Triton HTTP Service      ->      Prediction result: 7 
    

TF Serving

  1. Create a Python virtual environment in Cloud Shell.

     python  
    -m  
    venv  
    ./mnist_client source 
      
    ./mnist_client/bin/activate 
    
  2. Install the required Python packages.

     pip  
    install  
    -r  
    src/client/tfserve-requirements.txt 
    
  3. Test TensorFlow Serving with a few images.

      cd 
      
    src/client
    python  
    tfserve_mnist_client.py  
    -i  
     EXTERNAL_IP 
      
    -m  
    mnist  
    -p  
    ./images/ TEST_IMAGE 
    .png 
    

Replace the following:

  • EXTERNAL_IP : Your external IP address.
  • TEST_IMAGE : A value from 0 to 9 . You can use the images stored in src/client/images .

Depending on which image you use, you will get an output similar to this:

 Calling TensorFlow Serve HTTP Service    ->      Prediction result: 5 

Observe model performance

Triton

To observe the model performance, you can use the Triton dashboard integration in Cloud Monitoring . With this dashboard, you can view critical performance metrics like token throughput, request latency, and error rates.

To use the Triton dashboard, you must enable Google Cloud Managed Service for Prometheus , which collects the metrics from Triton, in your GKE cluster. Triton exposes metrics in Prometheus format by default; you do not need to install an additional exporter.

You can then view the metrics by using the Triton dashboard. For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the Triton observability guidance in the Cloud Monitoring documentation.

TF Serving

To observe the model performance, you can use the TF Serving dashboard integration in Cloud Monitoring . With this dashboard, you can view critical performance metrics like token throughput, request latency, and error rates.

To use the TF Serving dashboard, you must enable Google Cloud Managed Service for Prometheus , which collects the metrics from TF Serving, in your GKE cluster.

You can then view the metrics by using the TF Serving dashboard. For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the TF Serving observability guidance in the Cloud Monitoring documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, do one of the following:

  • Keep the GKE cluster:Delete the Kubernetes resources in the cluster and the Google Cloud resources
  • Keep the Google Cloud project:Delete the GKE cluster and the Google Cloud resources
  • Delete the project

Delete the Kubernetes resources in the cluster and the Google Cloud resources

  1. Delete the Kubernetes namespace and the workloads that you deployed:

Triton

 kubectl  
-n  
gke-ai-namespace  
delete  
-f  
src/gke-config/service-triton.yaml
kubectl  
-n  
gke-ai-namespace  
delete  
-f  
src/gke-config/deployment-triton.yaml
kubectl  
delete  
namespace  
gke-ai-namespace 

TF Serving

 kubectl  
-n  
gke-ai-namespace  
delete  
-f  
src/gke-config/service-tfserve.yaml
kubectl  
-n  
gke-ai-namespace  
delete  
-f  
src/gke-config/deployment-tfserve.yaml
kubectl  
delete  
namespace  
gke-ai-namespace 
  1. Delete the Cloud Storage bucket:

    1. Go to the Bucketspage:

      Go to Buckets

    2. Select the checkbox for PROJECT_ID -gke-bucket .

    3. Click Delete.

    4. To confirm deletion, type DELETE and click Delete.

  2. Delete the Google Cloud service account:

    1. Go to the Service accountspage:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox for gke-gpu-sa@ PROJECT_ID .iam.gserviceaccount.com .

    4. Click Delete.

    5. To confirm deletion, click Delete.

Delete the GKE cluster and the Google Cloud resources

  1. Delete the GKE cluster:

    1. Go to the Clusterspage:

      Go to Clusters

    2. Select the checkbox for online-serving-cluster .

    3. Click Delete.

    4. To confirm deletion, type online-serving-cluster and click Delete.

  2. Delete the Cloud Storage bucket:

    1. Go to the Bucketspage:

      Go to Buckets

    2. Select the checkbox for PROJECT_ID -gke-bucket .

    3. Click Delete.

    4. To confirm deletion, type DELETE and click Delete.

  3. Delete the Google Cloud service account:

    1. Go to the Service accountspage:

      Go to Service accounts

    2. Select your project.

    3. Select the checkbox for gke-gpu-sa@ PROJECT_ID .iam.gserviceaccount.com .

    4. Click Delete.

    5. To confirm deletion, click Delete.

Delete the project

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: