vLLM inference on v6e TPUs

This tutorial shows you how to run vLLM inference on v6e TPUs. It also shows you how to run the benchmark script for the Meta Llama-3.1-8B model.

To get started with vLLM on v6e TPUs, see the vLLM quickstart .

If you are using GKE, also see the GKE tutorial .

Before you begin

You must sign the consent agreement to use Llama3 family of models in the HuggingFace repo . Go to meta-llama/Llama-3.1-8B , fill out the consent agreement, and wait until you are approved.

Prepare to provision a TPU v6e with 4 chips:

  1. Follow Set up the Cloud TPU environment guide to set up a Google Cloud project, configure the Google Cloud CLI, enable the Cloud TPU API, and ensure you have access to use Cloud TPUs.

  2. Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.

    gcloud  
    auth  
    login
    gcloud  
    config  
     set 
      
    project  
     PROJECT_ID 
    gcloud  
    config  
     set 
      
    compute/zone  
     ZONE 
    

Secure capacity

When you are ready to secure TPU capacity, see Cloud TPU Quotas for more information about the Cloud TPU quotas. If you have additional questions about securing capacity, contact your Cloud TPU sales or account team.

Provision the Cloud TPU environment

You can provision TPU VMs with GKE , with GKE and XPK , or as queued resources .

Prerequisites

  • Verify that your project has enough TPUS_PER_TPU_FAMILY quota, which specifies the maximum number of chips you can access within your Google Cloud project.
  • Verify that your project has enough TPU quota for:
    • TPU VM quota
    • IP address quota
    • Hyperdisk Balanced quota
  • User project permissions

Provision a TPU v6e

  
gcloud  
alpha  
compute  
tpus  
queued-resources  
create  
 QUEUED_RESOURCE_ID 
  
 \ 
  
--node-id  
 TPU_NAME 
  
 \ 
  
--project  
 PROJECT_ID 
  
 \ 
  
--zone  
 ZONE 
  
 \ 
  
--accelerator-type  
 v6e-4 
  
 \ 
  
--runtime-version  
v2-alpha-tpuv6e  
 \ 
  
--service-account  
 SERVICE_ACCOUNT 

Command flag descriptions

Variable
Description
NODE_ID
The user-assigned ID of the TPU that is created when the queued resource request is allocated.
PROJECT_ID
The Google Cloud project name. Use an existing project or create a new one .
ZONE
See the TPU regions and zones document for the supported zones.
ACCELERATOR_TYPE
See the Accelerator Types documentation for the supported accelerator types.
RUNTIME_VERSION
v2-alpha-tpuv6e
SERVICE_ACCOUNT
This is the email address for your service account that you can find in Google Cloud console > IAM > Service Accounts.

For example: tpu-service-account@<your_project_ID>.iam.gserviceaccount.com

Use the list or describe commands to query the status of your queued resource.

gcloud  
alpha  
compute  
tpus  
queued-resources  
describe  
 QUEUED_RESOURCE_ID 
  
 \ 
  
--project  
 PROJECT_ID 
  
--zone  
 ZONE 

For a complete list of queued resource request statuses, see the Queued resources documentation.

Connect to the TPU using SSH

  
gcloud  
compute  
tpus  
tpu-vm  
ssh  
 TPU_NAME 

Install dependencies

  1. Create a directory for Miniconda:

    mkdir  
    -p  
    ~/miniconda3
  2. Download the Miniconda installer script:

    wget  
    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  
    -O  
    ~/miniconda3/miniconda.sh
  3. Install Miniconda:

    bash  
    ~/miniconda3/miniconda.sh  
    -b  
    -u  
    -p  
    ~/miniconda3
  4. Remove the Miniconda installer script:

    rm  
    -rf  
    ~/miniconda3/miniconda.sh
  5. Add Miniconda to your PATH variable:

     export 
      
     PATH 
     = 
     " 
     $HOME 
     /miniconda3/bin: 
     $PATH 
     " 
    
  6. Reload ~/.bashrc to apply the changes to the PATH variable:

     source 
      
    ~/.bashrc
  7. Create a Conda environment:

    conda  
    create  
    -n  
    vllm  
     python 
     = 
     3 
    .12  
    -y
    conda  
    activate  
    vllm
  8. Clone the vLLM repository and navigate to the vllm directory:

     git  
    clone  
    https://github.com/vllm-project/vllm.git && 
     cd 
      
    vllm 
    
  9. Clean up the existing torch and torch-xla packages:

     pip  
    uninstall  
    torch  
    torch-xla  
    -y 
    
  10. Install other build dependencies:

     pip  
    install  
    -r  
    requirements/tpu.txt VLLM_TARGET_DEVICE 
     = 
     "tpu" 
      
    python  
    -m  
    pip  
    install  
    --editable  
    .
    sudo  
    apt-get  
    install  
    libopenblas-base  
    libopenmpi-dev  
    libomp-dev 
    

Get access to the model

Generate a new Hugging Face token if you don't already have one:

  1. Go to Your Profile > Settings > Access Tokens.

  2. Select Create new token.

  3. Specify a Name of your choice and a Role with at least Read permissions.

  4. Select Generate a token.

  5. Copy the generated token to your clipboard, set it as an environment variable, and authenticate with the huggingface-cli:

     export 
      
     TOKEN 
     = 
     YOUR_TOKEN 
    git  
    config  
    --global  
    credential.helper  
    store
    huggingface-cli  
    login  
    --token  
     $TOKEN 
    

Launch the vLLM server

The following command downloads the model weights from Hugging Face Model Hub to the TPU VM's /tmp directory, pre-compiles a range of input shapes, and writes the model compilation to ~/.cache/vllm/xla_cache .

For more details, refer to the vLLM docs .

  cd 
  
~/vllm
vllm  
serve  
 "meta-llama/Llama-3.1-8B" 
  
--download_dir  
/tmp  
--swap-space  
 16 
  
--disable-log-requests  
--tensor_parallel_size = 
 4 
  
--max-model-len = 
 2048 
 &> 
serve.log  
& 

Run vLLM benchmarks

Run the vLLM benchmarking script:

 export 
  
 MODEL 
 = 
 "meta-llama/Llama-3.1-8B" 
pip  
install  
pandas
pip  
install  
datasets
python  
benchmarks/benchmark_serving.py  
 \ 
  
--backend  
vllm  
 \ 
  
--model  
 $MODEL 
  
 \ 
  
--dataset-name  
random  
 \ 
  
--random-input-len  
 1820 
  
 \ 
  
--random-output-len  
 128 
  
 \ 
  
--random-prefix-len  
 0 

Clean up

Delete the TPU:

gcloud  
compute  
tpus  
queued-resources  
delete  
 QUEUED_RESOURCE_ID 
  
 \ 
  
--project  
 PROJECT_ID 
  
 \ 
  
--zone  
 ZONE 
  
 \ 
  
--force  
 \ 
  
--async
Create a Mobile Website
View Site in Mobile | Classic
Share by: