Overview of storage services for AI and ML workloads in AI Hypercomputer

Storage services provide the essential data architecture that helps enable high-performance model training, inference, and fine tuning in the AI Hypercomputer ecosystem. While multiple storage services are available in Google Cloud, the most suitable choice depends on your requirements for I/O, throughput, scale, and latency for use cases within the artificial intelligence (AI) and machine learning (ML) lifecycle.

This document introduces and compares storage services in Google Cloud that can best help you optimize GPU or TPU performance. It also provides recommendations on the ideal service for specific AI and ML use cases.

Introduction to storage services

Google Cloud offers multiple storage solutions that are optimized for AI and ML use cases:

Cloud Storage is an object storage system that's designed for processing and storing massive datasets, like those required for training or bulk inference. Cloud Storage offers several capabilities to help you optimize your data storage for AI and ML tasks.
Google Cloud Managed Lustre is a fully managed and POSIX-compliant parallel file system that's designed for the specialized, low-latency, and high-concurrency metadata performance required for training and inference workloads.

The following sections provide more information about each storage service.

Cloud Storage

Cloud Storage is a foundational object store that's designed to offer global scalability, durability, and cost efficiency. When you use Cloud Storage, you store data as objects in containers called buckets . Cloud Storage offers multiple capabilities for your buckets that help optimize AI and ML workload performance:

Products in the Cloud Storage Rapid family are designed to clear data bottlenecks for your AI and ML workloads by bringing your data closer in proximity to your compute resources. These products let you colocate your data in the same zones as your compute workloads and enable high performance and cost-efficient data storage scaling for your GPU or TPU clusters. Cloud Storage Rapid products include the following:
- Rapid Bucket provides the fastest read and write performance in Cloud Storage for zonal buckets. Objects in zonal buckets are stored in the Rapid storage class , a high-performance storage class that's optimized for I/O-intensive workloads. In addition to lower latency, Rapid Bucket delivers significantly higher throughput (up to 15 TB/s) compared to other products and bucket locations in Cloud Storage.
- Rapid Cache accelerates data reads to existing buckets without requiring code changes. Rapid Cache is an SSD-backed zonal read cache for Cloud Storage buckets that's used to serve data for data read requests. The product offers higher throughput (up to 2.5 TB/s) and lower latency than buckets without a cache.
  
  Rapid Cache is often set up for multi-region buckets, where accelerator capacity is fragmented across Google Cloud regions. Data read from the cache incurs reduced data transfer fees than data read directly from a multi-region bucket.
Cloud Storage FUSE is an open source FUSE adapter that lets you mount buckets as local file systems, enabling applications to interact with object storage by using standard file system semantics. This capability lets you leverage the global scalability, durability, and cost efficiency of Cloud Storage with local file access. Cloud Storage FUSE is actively maintained and supported by Google.

Cloud Storage FUSE offers multiple client-side caching and tuning parameters, such as parallel downloads. These capabilities can abstract development complexities and help achieve peak performance by sharding or parallelizing streams.
Hierarchical namespace enables a true file system structure in buckets and provides efficient data management capabilities, including atomic folder renames and faster file lookups when the bucket is mounted with Cloud Storage FUSE. Hierarchical namespace offers 8 times higher queries per second (QPS) for object reads and writes than buckets without hierarchical namespace. For more information about the benefits of using hierarchical namespace, see performance and management benefits .

Enabling hierarchical namespace is highly recommended when you have workloads that require high-throughput data loading and frequent model checkpointing. Having hierarchical namespace enabled is required when creating zonal buckets with Rapid Bucket.

Managed Lustre

Google Cloud Managed Lustre is a high performance, POSIX-compliant, fully managed parallel file system that's optimized for AI and ML applications. The Managed Lustre architecture is ideally suited for high-throughput, low-latency, and high-metadata-concurrency AI/ML workloads, such as checkpointing, high-speed weight propagation in reinforcement learning, and Key-Value (KV) caching.

For more information about common use cases for Managed Lustre, see Business cases .

Comparison of storage services

The following table provides a high-level comparison of Cloud Storage and Managed Lustre across key characteristics:

Characteristics

Cloud Storage

Managed Lustre

Architecture

Object store

Data is stored in flat buckets by default. All bucket types (zonal, region, dual-region, and multi-region) offer geo-redundancy options that can be accelerated with Cloud Storage Rapid capabilities.
You can optionally enable hierarchical namespace to create buckets that support storing data in a file system structure.
You can optionally enable Cloud Storage FUSE to mount buckets as local file systems.

Parallel file system

Data is stored as files in Managed Lustre instances and mounted as local file systems across your accelerator clusters without any additional tuning needs.

Storage capacity

Scales up to EBs of capacity.

Scales up to 80 PB of capacity, depending on the instance's performance tier .

Performance

Supports the following:

Sub-millisecond latency for open files with Rapid Bucket
Tens of millions of IOPs/TiB with Rapid Bucket
Up to 2.5 TB/s of bandwidth with Rapid Cache
Up to 15 TB/s of bandwidth with Rapid Bucket
Bandwidth increase requests

Supports the following:

Sub-millisecond latency
Tens of millions of IOPs/TiB
Up to 10 TB/s of bandwidth

Pricing

For details, see Cloud Storage pricing .

For details, see Managed Lustre pricing .

Recommendations by requirements

Recommended for applications that need a scalable object store and general cost efficiency for training datasets, asynchronous multi-tier checkpointing, and model weight storage. In particular, Cloud Storage Rapid is recommended for high-performance and cost-efficient data scaling.

Recommended for applications that need a fully POSIX-compliant parallel file system or home directories. Also recommended for latency-sensitive or high-metadata-concurrency workloads, such as KV caching offloads, synchronous checkpointing, and high-speed weight propagation for reinforcement learning.

Storage service recommendations by use case

KV cache offloading	Primary recommendation: Managed Lustre	Managed Lustre provides sub-millisecond latency and parallel data access, allowing different nodes to "pull" the KV cache and resume chats without re-processing the whole history of the chat.

What's next

Learn more about Cloud Storage Rapid , a family of products in Cloud Storage that are designed for AI, ML, and data-intensive analytics.
Learn how to optimize performance when using Cloud Storage FUSE or the Cloud Storage FUSE CSI driver to download datasets.
Learn how to accelerate model loading on Google Kubernetes Engine .

Overview of storage services for AI and ML workloads in AI Hypercomputer Stay organized with collections Save and categorize content based on your preferences.

Introduction to storage services

Cloud Storage

Managed Lustre

Comparison of storage services

Storage service recommendations by use case

What's next

Overview of storage services for AI and ML workloads in AI Hypercomputer