Serve open source models using TPUs on GKE with Optimum TPU


This tutorial shows you how to serve large language model (LLM) open source models , using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Optimum TPU serving framework from Hugging Face . In this tutorial, you download open source models from Hugging Face and deploy the models on a GKE Standard cluster using a container that runs Optimum TPU.

This guide provides a starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.

This tutorial is intended for Generative AI customers in the Hugging Face ecosystem, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs.

As a reminder, you have multiple options for LLM inference on Google Cloud—which span offerings like Vertex AI, GKE, and Google Compute Engine—where you can incorporate serving libraries like JetStream, vLLM, and other partner offerings. For example, you can use JetStream to get the latest optimizations from the project. If you prefer Hugging Face options, you can use Optimum TPU.

Optimum TPU supports the following features:

  • Continuous batching
  • Token streaming
  • Greedy search and multinomial sampling using transformers.

Objectives

  1. Prepare a GKE Standard cluster with the recommended TPU topology based on the model characteristics.
  2. Deploy Optimum TPU on GKE.
  3. Use Optimum TPU to serve the supported models through curl.

Before you begin

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project .

  • Enable the required API.

    Enable the API

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project .

  • Enable the required API.

    Enable the API

  • Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/artifactregistry.admin

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access .
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save .

Prepare the environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you'll need for this tutorial, including kubectl and gcloud CLI .

To set up your environment with Cloud Shell, follow these steps:

  1. In the Google Cloud console, launch a Cloud Shell session by clickingCloud Shell activation icon Activate Cloud Shellin the Google Cloud console . This launches a session in the bottom pane of Google Cloud console.

  2. Set the default environment variables:

     gcloud  
    config  
     set 
      
    project  
     PROJECT_ID 
     export 
      
     PROJECT_ID 
     = 
     $( 
    gcloud  
    config  
    get  
    project ) 
     export 
      
     CLUSTER_NAME 
     = 
     CLUSTER_NAME 
     export 
      
     REGION 
     = 
     REGION_NAME 
     export 
      
     ZONE 
     = 
     ZONE 
     export 
      
     HF_TOKEN 
     = 
     HF_TOKEN 
     
    

    Replace the following values:

    • PROJECT_ID : your Google Cloud project ID .
    • CLUSTER_NAME : the name of your GKE cluster.
    • REGION_NAME : the region where your GKE cluster, Cloud Storage bucket, and TPU nodes are located. The region contains zones where TPU v5e machine types are available (for example, us-west1 , us-west4 , us-central1 , us-east1 , us-east5 , or europe-west4 ).
    • (Standard cluster only) ZONE : the zone where the TPU resources are available (for example, us-west4-a ). For Autopilot clusters, you don't need to specify the zone, only the region.
    • HF_TOKEN : your HuggingFace token.
  3. Clone the Optimum TPU repository:

     git  
    clone  
    https://github.com/huggingface/optimum-tpu.git 
    

Get access to the model

You can use the Gemma 2B or Llama3 8B models. This tutorial focuses on these two models, but Optimum TPU supports more models.

Gemma 2B

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.

You must sign the consent agreement to use Gemma. Follow these instructions:

  1. Access the model consent page .
  2. Verify consent using your Hugging Face account.
  3. Accept the model terms.

Generate an access token

Generate a new Hugging Face token if you don't already have one:

  1. Click Your Profile > Settings > Access Tokens.
  2. Click New Token.
  3. Specify a Name of your choice and a Role of at least Read .
  4. Click Generate a token.
  5. Copy the generated token to your clipboard.

Llama3 8B

You must sign the consent agreement to use Llama3 8b in the Hugging Face Repo

Generate an access token

Generate a new Hugging Face token if you don't already have one:

  1. Click Your Profile > Settings > Access Tokens.
  2. Select New Token.
  3. Specify a Name of your choice and a Role of at least Read .
  4. Select Generate a token.
  5. Copy the generated token to your clipboard.

Create a GKE cluster

Create a GKE Standard cluster with 1 CPU node:

 gcloud  
container  
clusters  
create  
 CLUSTER_NAME 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--num-nodes = 
 1 
  
 \ 
  
--location = 
 ZONE 
 

Create TPU node pool

Create a v5e TPU node pool with 1 node and 8 chips:

 gcloud  
container  
node-pools  
create  
tpunodepool  
 \ 
  
--location = 
 ZONE 
  
 \ 
  
--num-nodes = 
 1 
  
 \ 
  
--machine-type = 
ct5lp-hightpu-8t  
 \ 
  
--cluster = 
 CLUSTER_NAME 
 

If TPU resources are available, GKE provisions the node pool. If TPU resources are temporarily unavailable, the output shows a GCE_STOCKOUT error message. To troubleshoot TPU stockout errors, refer to Insufficient TPU resources to satisfy the TPU request .

Configure kubectl to communicate with your cluster:

 gcloud  
container  
clusters  
get-credentials  
 ${ 
 CLUSTER_NAME 
 } 
  
--location = 
 ${ 
 ZONE 
 } 
 

Build the container

Run the make command to build the image

  cd 
  
optimum-tpu && 
make  
tpu-tgi 

Push the image to the Artifact Registry

 gcloud  
artifacts  
repositories  
create  
optimum-tpu  
--repository-format = 
docker  
--location = 
 REGION_NAME 
 && 
 \ 
gcloud  
auth  
configure-docker  
 REGION_NAME 
-docker.pkg.dev && 
 \ 
docker  
image  
tag  
huggingface/optimum-tpu  
 REGION_NAME 
-docker.pkg.dev/ PROJECT_ID 
/optimum-tpu/tgi-tpu:latest && 
 \ 
docker  
push  
 REGION_NAME 
-docker.pkg.dev/ PROJECT_ID 
/optimum-tpu/tgi-tpu:latest 

Create a Kubernetes Secret for Hugging Face credentials

Create a Kubernetes Secret that contains the Hugging Face token:

 kubectl  
create  
secret  
generic  
hf-secret  
 \ 
  
--from-literal = 
 hf_api_token 
 = 
 ${ 
 HF_TOKEN 
 } 
  
 \ 
  
--dry-run = 
client  
-o  
yaml  
 | 
  
kubectl  
apply  
-f  
- 

Deploy Optimum TPU

To deploy Optimum TPU, this tutorial uses a Kubernetes Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.

Gemma 2B

  1. Save the following Deployment manifest as optimum-tpu-gemma-2b-2x4.yaml :

      apiVersion 
     : 
      
     apps/v1 
     kind 
     : 
      
     Deployment 
     metadata 
     : 
      
     name 
     : 
      
     tgi-tpu 
     spec 
     : 
      
     replicas 
     : 
      
     1 
      
     selector 
     : 
      
     matchLabels 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     template 
     : 
      
     metadata 
     : 
      
     labels 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     spec 
     : 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-tpu-topology 
     : 
      
     2x4 
      
     cloud.google.com/gke-tpu-accelerator 
     : 
      
     tpu-v5-lite-podslice 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     tgi-tpu 
      
     image 
     : 
      
      REGION_NAME 
     
    -docker.pkg.dev/ PROJECT_ID 
    /optimum-tpu/tgi-tpu:latest  
     args 
     : 
      
     - 
      
     --model-id=google/gemma-2b 
      
     - 
      
     --max-concurrent-requests=4 
      
     - 
      
     --max-input-length=8191 
      
     - 
      
     --max-total-tokens=8192 
      
     - 
      
     --max-batch-prefill-tokens=32768 
      
     - 
      
     --max-batch-size=16 
      
     securityContext 
     : 
      
     privileged 
     : 
      
     true 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     HF_TOKEN 
      
     valueFrom 
     : 
      
     secretKeyRef 
     : 
      
     name 
     : 
      
     hf-secret 
      
     key 
     : 
      
     hf_api_token 
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
     80 
      
     resources 
     : 
      
     limits 
     : 
      
     google.com/tpu 
     : 
      
     8 
      
     livenessProbe 
     : 
      
     httpGet 
     : 
      
     path 
     : 
      
     /health 
      
     port 
     : 
      
     80 
      
     initialDelaySeconds 
     : 
      
     300 
      
     periodSeconds 
     : 
      
     120 
     --- 
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
     service 
     spec 
     : 
      
     selector 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     ports 
     : 
      
     - 
      
     name 
     : 
      
     http 
      
     protocol 
     : 
      
     TCP 
      
     port 
     : 
      
     8080 
      
     targetPort 
     : 
      
     80 
     
    

    This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

  2. Apply the manifest

     kubectl  
    apply  
    -f  
    optimum-tpu-gemma-2b-2x4.yaml 
    

Llama3 8B

  1. Save the following manifest as optimum-tpu-llama3-8b-2x4.yaml :

      apiVersion 
     : 
      
     apps/v1 
     kind 
     : 
      
     Deployment 
     metadata 
     : 
      
     name 
     : 
      
     tgi-tpu 
     spec 
     : 
      
     replicas 
     : 
      
     1 
      
     selector 
     : 
      
     matchLabels 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     template 
     : 
      
     metadata 
     : 
      
     labels 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     spec 
     : 
      
     nodeSelector 
     : 
      
     cloud.google.com/gke-tpu-topology 
     : 
      
     2x4 
      
     cloud.google.com/gke-tpu-accelerator 
     : 
      
     tpu-v5-lite-podslice 
      
     containers 
     : 
      
     - 
      
     name 
     : 
      
     tgi-tpu 
      
     image 
     : 
      
      REGION_NAME 
     
    -docker.pkg.dev/ PROJECT_ID 
    /optimum-tpu/tgi-tpu:latest  
     args 
     : 
      
     - 
      
     --model-id=meta-llama/Meta-Llama-3-8B 
      
     - 
      
     --max-concurrent-requests=4 
      
     - 
      
     --max-input-length=8191 
      
     - 
      
     --max-total-tokens=8192 
      
     - 
      
     --max-batch-prefill-tokens=32768 
      
     - 
      
     --max-batch-size=16 
      
     env 
     : 
      
     - 
      
     name 
     : 
      
     HF_TOKEN 
      
     valueFrom 
     : 
      
     secretKeyRef 
     : 
      
     name 
     : 
      
     hf-secret 
      
     key 
     : 
      
     hf_api_token 
      
     ports 
     : 
      
     - 
      
     containerPort 
     : 
      
     80 
      
     resources 
     : 
      
     limits 
     : 
      
     google.com/tpu 
     : 
      
     8 
      
     livenessProbe 
     : 
      
     httpGet 
     : 
      
     path 
     : 
      
     /health 
      
     port 
     : 
      
     80 
      
     initialDelaySeconds 
     : 
      
     300 
      
     periodSeconds 
     : 
      
     120 
     --- 
     apiVersion 
     : 
      
     v1 
     kind 
     : 
      
     Service 
     metadata 
     : 
      
     name 
     : 
      
     service 
     spec 
     : 
      
     selector 
     : 
      
     app 
     : 
      
     tgi-tpu 
      
     ports 
     : 
      
     - 
      
     name 
     : 
      
     http 
      
     protocol 
     : 
      
     TCP 
      
     port 
     : 
      
     8080 
      
     targetPort 
     : 
      
     80 
     
    

    This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

  2. Apply the manifest

     kubectl  
    apply  
    -f  
    optimum-tpu-llama3-8b-2x4.yaml 
    

View the logs from the running Deployment:

 kubectl  
logs  
-f  
-l  
 app 
 = 
tgi-tpu 

The output should be similar to the following:

 2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0 

Make sure the model is fully downloaded before proceeding to the next section.

Serve the model

Set up port forwarding to the model:

 kubectl  
port-forward  
svc/service  
 8080 
:8080 

Interact with the model server using curl

Verify your deployed models:

In a new terminal session, use curl to chat with the model:

 curl  
 127 
.0.0.1:8080/generate  
-X  
POST  
-d  
 '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}' 
  
-H  
 'Content-Type: application/json' 
 

The output should be similar to the following:

 {"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"} 

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:

 gcloud  
container  
clusters  
delete  
 CLUSTER_NAME 
  
 \ 
  
--location = 
 ZONE 
 

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: