Provision Managed Lustre on GKE using XPK

This document explains how to integrate Managed Lustre with GKE to create an optimized environment for demanding, data-intensive workloads like artificial intelligence (AI), machine learning (ML), and high performance computing (HPC).

In this document you provision a GKE cluster with XPK, create a Managed Lustre instance, and attach it to the cluster. To test this configuration, you run a workload on nodes that flex-start provisions.

This document is intended for Machine learning (ML) engineers and Data and AI specialists who are interested in exploring Kubernetes container orchestration capabilities backed by Managed Lustre instances . To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks .

Background

This section describes the key technologies used in this document:

XPK

XPK is a tool that simplifies the provisioning and management of GKE clusters and workloads, especially for AI/ML tasks. XPK helps generate preconfigured, training-optimized infrastructure, which makes it a good option for proofs-of-concept and testing environments.

You can create a cluster that uses TPUs by using the Google Cloud CLI or an Accelerated Processing Kit (XPK).

  • Use the gcloud CLI to manually create your GKE cluster instance for precise customization or expansion of existing production GKE environments.
  • Use XPK to quickly create GKE clusters and run workloads for proof-of-concept and testing . For more information, see the XPK README .

This document uses XPK exclusively for provisioning and managing resources.

For more information, see the Accelerated Processing Kit (XPK) documentation.

Flex-start

Flex-start lets you optimize TPU provisioning by paying only for the resources you need. Flex-start is recommended if your workload requires dynamically provisioned resources as needed, for up to seven days and cost-effective access.

This document uses flex-start as an example of consumption option, but you can also use other options, for example reservations or Spot. For more information, see About accelerator consumption options for AI/ML workloads in GKE .

Managed Lustre

Managed Lustre is a high-performance, parallel file system service designed for demanding workloads. The Managed Lustre CSI driver lets you integrate Managed Lustre instances with GKE, using standard Kubernetes Persistent Volume Claims (PVCs) and Persistent Volumes (PVs). This driver is particularly beneficial for AI, ML, and HPC workloads requiring persistent, scalable, and high-throughput storage.

For more information, see About the Managed Lustre CSI driver .

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Cloud Managed Lustre API and the Google Kubernetes Engine API.
  • Enable APIs
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.

Prepare your environment

This section shows you how prepare your cluster environment.

  1. In the new terminal window, create a virtual environment:

      VENV_DIR 
     = 
    ~/venvp4 ; 
    python3  
    -m  
    venv  
     $VENV_DIR 
     ; 
     source 
      
     $VENV_DIR 
    /bin/activate 
    
  2. Install XPK by following the steps in the XPK installation file. Use pip install instead of cloning from source.

  3. Set the default environment variables:

     gcloud  
    config  
     set 
      
    project  
     PROJECT_ID 
    gcloud  
    config  
     set 
      
    billing/quota_project  
     PROJECT_ID 
     export 
      
     PROJECT_ID 
     = 
     $( 
    gcloud  
    config  
    get  
    project ) 
     export 
      
     LOCATION 
     = 
     LOCATION 
     export 
      
     CLUSTER_NAME 
     = 
     CLUSTER_NAME 
     export 
      
     GKE_VERSION 
     = 
     VERSION 
     export 
      
     NETWORK_NAME 
     = 
     NETWORK_NAME 
     export 
      
     IP_RANGE_NAME 
     = 
     IP_RANGE_NAME 
     export 
      
     FIREWALL_RULE_NAME 
     = 
     FIREWALL_RULE_NAME 
     export 
      
     ACCELERATOR_TYPE 
     = 
    v6e-16 export 
      
     NUM_SLICES 
     = 
     1 
     
    

    Replace the following values:

    • PROJECT_ID : your Google Cloud project ID .
    • LOCATION : the zone of your GKE cluster. Select a zone for both flex-start and Managed Lustre instances. For example, us-west4-a . For valid throughput values, see About GPU and TPU provisioning with flex-start provisioning mode .
    • CLUSTER_NAME : the name of your GKE cluster.
    • VERSION : the GKE version. Ensure this is at least the minimum version that supports Managed Lustre . For example, 1.33.2-gke.1111000.
    • NETWORK_NAME : the name of the network that you create.
    • IP_RANGE_NAME : the name of the IP address range.
    • FIREWALL_RULE_NAME : the name of the firewall rule.

    The preceding commands configure a v6e-16 accelerator type. This configuration includes the following variables:

    • ACCELERATOR_TYPE=v6e-16 : corresponds to TPU Trillium with a 4x4 topology. This TPU version instructs GKE to provision a multi-host slice node pool. The v6e-16 maps to the ct6e-standard-4t machine type in GKE.
    • NUM_SLICES=1 : the number of TPU slice node pools that XPK creates for the ACCELERATOR_TYPE that you select.

    If you want to customize the ACCELERATOR_TYPE and NUM_SLICES variables, refer to the following documents to find the available combinations:

    • To identify the TPU version, machine type for GKE, topology, and available zone that you want to use, see Plan TPUs in GKE .
    • To map the GKE machine type with the accelerator type in the Cloud TPU API, see the TPU Trillium (v6e) documentation .

Prepare a VPC network

Prepare a Virtual Private Cloud network for your Managed Lustre instance and GKE cluster.

  1. Enable the Service Networking API:

     gcloud  
    services  
     enable 
      
    servicenetworking.googleapis.com  
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     
    
  2. Create a VPC network:

     gcloud  
    compute  
    networks  
    create  
     ${ 
     NETWORK_NAME 
     } 
      
     \ 
      
    --subnet-mode = 
    auto  
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
     \ 
      
    --mtu = 
     8896 
     
    
  3. Create an IP address range for VPC peering:

     gcloud  
    compute  
    addresses  
    create  
     ${ 
     IP_RANGE_NAME 
     } 
      
     \ 
      
    --global  
     \ 
      
    --purpose = 
    VPC_PEERING  
     \ 
      
    --prefix-length = 
     20 
      
     \ 
      
    --description = 
     "Managed Lustre VPC Peering" 
      
     \ 
      
    --network = 
     ${ 
     NETWORK_NAME 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     
    
  4. Get the CIDR range of the IP address range:

      CIDR_RANGE 
     = 
     $( 
      
    gcloud  
    compute  
    addresses  
    describe  
     ${ 
     IP_RANGE_NAME 
     } 
      
     \ 
      
    --global  
     \ 
      
    --format = 
     "value[separator=/](address, prefixLength)" 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     ) 
     
    
  5. Create a firewall rule to allow TCP traffic from the IP address range:

     gcloud  
    compute  
    firewall-rules  
    create  
     ${ 
     FIREWALL_RULE_NAME 
     } 
      
     \ 
      
    --allow = 
    tcp:988,tcp:6988  
     \ 
      
    --network = 
     ${ 
     NETWORK_NAME 
     } 
      
     \ 
      
    --source-ranges = 
     ${ 
     CIDR_RANGE 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     
    
  6. Connect the VPC peering.

     gcloud  
    services  
    vpc-peerings  
    connect  
     \ 
      
    --network = 
     ${ 
     NETWORK_NAME 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
     \ 
      
    --ranges = 
     ${ 
     IP_RANGE_NAME 
     } 
      
     \ 
      
    --service = 
    servicenetworking.googleapis.com 
    

Create a Managed Lustre storage instance

Create a Managed Lustre storage instance.

  1. Set storage instance variables:

      export 
      
     STORAGE_NAME 
     = 
     STORAGE_NAME 
     export 
      
     STORAGE_THROUGHPUT 
     = 
     STORAGE_THROUGHPUT 
     export 
      
     STORAGE_CAPACITY 
     = 
     STORAGE_CAPACITY_GIB 
     export 
      
     STORAGE_FS 
     = 
    lfs 
    

    Replace the following values:

    • STORAGE_NAME : the name of your Managed Lustre instance.
    • STORAGE_THROUGHPUT : the throughput of the Managed Lustre instance, in MiB/s per TiB. For valid throughput values, see Calculate your new capacity .
    • STORAGE_CAPACITY_GIB : the capacity of the Managed Lustre instance, in GiB. For valid capacity values, see Allowed capacity and throughput values .
  2. Create the Managed Lustre instance:

     gcloud  
    lustre  
    instances  
    create  
     ${ 
     STORAGE_NAME 
     } 
      
     \ 
      
    --per-unit-storage-throughput = 
     ${ 
     STORAGE_THROUGHPUT 
     } 
      
     \ 
      
    --capacity-gib = 
     ${ 
     STORAGE_CAPACITY 
     } 
      
     \ 
      
    --filesystem = 
     ${ 
     STORAGE_FS 
     } 
      
     \ 
      
    --location = 
     ${ 
     LOCATION 
     } 
      
     \ 
      
    --network = 
    projects/ ${ 
     PROJECT_ID 
     } 
    /global/networks/ ${ 
     NETWORK_NAME 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
     \ 
      
    --async  
     # Creates the instance asynchronously 
     
    

    The --async flag creates the instance asynchronously and provides an operation ID to track its status.

  3. Check the operation's status:

     gcloud  
    lustre  
    operations  
    describe  
     OPERATION_ID 
      
     \ 
      
    --location = 
     ${ 
     LOCATION 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     
    

    Replace OPERATION_ID with the ID from the output of the previous asynchronous command. If you don't have the ID, you can list all operations:

     gcloud  
    lustre  
    operations  
    list  
     \ 
      
    --location = 
     ${ 
     LOCATION 
     } 
      
     \ 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
     
    

    The instance is ready when the command output shows done: true .

Use XPK to create a GKE cluster

Use XPK to create a GKE cluster with a node pool.

Create a GKE cluster:

 xpk  
cluster  
create  
--cluster  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--num-slices = 
 ${ 
 NUM_SLICES 
 } 
  
 \ 
  
--tpu-type = 
 ${ 
 ACCELERATOR_TYPE 
 } 
  
 \ 
  
--zone = 
 ${ 
 LOCATION 
 } 
  
 \ 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
 \ 
  
--gke-version = 
 ${ 
 GKE_VERSION 
 } 
  
 \ 
  
--custom-cluster-arguments = 
 "--network= 
 ${ 
 NETWORK_NAME 
 } 
 " 
  
 \ 
  
--enable-lustre-csi-driver  
 \ 
  
--flex 

This command creates a GKE cluster by using XPK. The cluster is configured to use flex-start for node provisioning and has the Managed Lustre CSI driver enabled.

Attach the storage instance to the cluster

To configure the PersistentVolume (PV) and PersistentVolumeClaim (PVC), this section uses the XPK storage attach command ( xpk storage attach ) with a manifest file. This section uses an example manifest from the XPK source code.

Attach the Managed Lustre storage instance to your GKE cluster by completing these steps:

  1. Download the example manifest file to your current working directory and save it as lustre-manifest-attach.yaml .

  2. Update the manifest file with your Managed Lustre instance's information:

    1. In the PersistentVolume section, replace the following values:

      • STORAGE_SIZE : the size of the Managed Lustre instance, in GiB.
      • PROJECT_ID/ZONE/INSTANCE_NAME : the full resource path of your Managed Lustre instance.
      • IP_ADDRESS : the IP address of the Managed Lustre instance.
      • FILE_SYSTEM : the file system type, which is lfs .
    2. In the PersistentVolumeClaim section, replace the following values:

      • STORAGE_SIZE : The size of the PersistentVolumeClaim, in GiB.
  3. Attach the storage instance to the cluster:

     xpk  
    storage  
    attach  
     ${ 
     STORAGE_NAME 
     } 
      
     \ 
      
    --cluster = 
     ${ 
     CLUSTER_NAME 
     } 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
    --zone = 
     ${ 
     LOCATION 
     } 
      
     \ 
      
    --type = 
    lustre  
     \ 
      
    --mount-point = 
     '/lustre-data' 
      
     \ 
      
    --readonly = 
     false 
      
     \ 
      
    --auto-mount = 
     true 
      
     \ 
      
    --manifest = 
     './lustre-manifest-attach.yaml' 
     
    
  4. Verify that you attached the storage for the cluster:

     xpk  
    storage  
    list  
     \ 
      
    --cluster = 
     ${ 
     CLUSTER_NAME 
     } 
      
    --project = 
     ${ 
     PROJECT_ID 
     } 
      
    --zone = 
     ${ 
     LOCATION 
     } 
     
    

Run a workload

Run a workload with the attached Managed Lustre instance . The following example command lists available disks and creates a "hello" file in the Managed Lustre instance's directory.

Create and run the workload:

 xpk  
workload  
create  
--workload  
test-lustre  
 \ 
--cluster = 
 ${ 
 CLUSTER_NAME 
 } 
  
--project = 
 ${ 
 PROJECT_ID 
 } 
  
--zone = 
 ${ 
 LOCATION 
 } 
  
 \ 
--command = 
 "df -h && echo 'hello' > /lustre-data/hello.txt && cat /lustre-data/hello.txt" 
  
 \ 
--tpu-type = 
 ${ 
 ACCELERATOR_TYPE 
 } 
  
 \ 
--num-slices = 
 1 
  
 \ 
--flex 

Clean up

After completing the steps in this document, to prevent unwanted charges incurring on your account, delete the cluster:

   
xpk  
cluster  
delete  
--cluster  
 ${ 
 CLUSTER_NAME 
 } 
  
 \ 
  
--zone  
 ${ 
 LOCATION 
 } 
  
 \ 
  
--project  
 ${ 
 PROJECT_ID 
 } 
 

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: