AI zones

This document provides an overview of AI zones for Cloud Storage. AI zones are specialized Google Cloud zones that are designed to offer computing capacity for artificial intelligence (AI) and machine learning (ML) workloads. They provide significant ML accelerator (GPU and TPU) capacity.

AI zones are optimized for AI and ML workloads like the following:

  • Large-scale training
  • Small-scale training, fine-tuning, bulk inference, and retraining
  • Real-time ML inference

For background information about AI zones, see AI zones in the Compute Engine documentation.

Within a region, AI zones might be geographically located away from standard (non-AI) zones.

AI zones are compatible with other Cloud Storage and Google Cloud features.

Storage architecture recommendations

We recommend that you use a tiered storage architecture to balance cost, durability, and performance:

  • Cold storage layer: use regional Cloud Storage buckets in standard zones for persistent, highly durable storage (the "source of truth") of your training datasets and model checkpoints.

  • Performance layer: use specialized zonal storage services to act as a high-speed cache or temporary scratch space. This approach eliminates inter-zonal latency and maximizes throughput during active jobs.

The following storage solutions are recommended for optimizing AI and ML system performance with AI zones:

Storage service or product
Description
Use cases
Rapid Cache product of Cloud Storage

A fully managed, SSD-backed zonal read cache that brings frequently read data from a bucket into the AI zone.

Create a Rapid Cache instance in an AI zone for the regional source bucket that contains the training datasets or models that you want to serve. When your training job reads a file, the file is pulled into the fast, in-zone cache. Subsequent reads are served directly from the cache, bypassing the regional network. This is ideal for the repetitive data access patterns in model training and for low-latency model serving.

Recommended for:

  • Read-heavy workloads
  • Low-latency model training and serving
Rapid Bucket product of Cloud Storage

A high-performance capability that lets you locate buckets in zones and use the Rapid storage class. Optimized for I/O-intensive workloads, specifically designed for AI/ML applications that require colocation with the data they need to access.

To optimize your data storage for workloads that require low latency and high throughput, create a zonal bucket that uses Rapid storage and then mount the bucket as a directory on your virtual machine by using Cloud Storage FUSE . You can then configure your application to store or access data in the zonal bucket through the bucket's mount point. For example, to store model checkpoints in your zonal bucket, configure your checkpointing directory to the mount point in your training code.

For regional resilience, you can back up your checkpoints to regional, multi-regional, or dual-regional buckets. Use Storage Transfer Service to move data from a zonal bucket to a bucket in a different location. Use Object Lifecycle Management to expire and delete data after a set time period.

Recommended for:

  • High-throughput read and write workloads
  • High-speed checkpointing
  • Low-latency model training and serving

Best practices

Follow these best practices for storage when using AI zones:

  • Provision your performance layer in the same AI zone as your compute resources. Colocating compute and storage helps to ensure that GPUs and TPUs remain fully saturated, maximizing "goodput" (useful throughput).

  • For Rapid Cache, before you start the primary training epoch, perform a pre-read of your dataset to populate, or warm, the SSD-backed cache.

Available AI zones

The following table shows the AI zones and their parent Google Cloud regions.

Geographic area Parent region AI zone
Europe
europe-west4 europe-west4-ai1a
United States
us-central1 us-central1-ai1a
United States
us-south1 us-south1-ai1b

Considerations

  • You can access Google Cloud products in a Google Cloud region from the region's AI zone. However, accessing services in a Google Cloud region from an AI zone can add network latency, because the location of the AI zone might be physically separate from the locations of the region's standard zones.

  • We recommend that you run non-ML workloads in standard zones, not AI zones, because AI zones don't offer all Google Cloud services locally.

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: