Cloud TPU performance guide

Your first step when troubleshooting TPU performance is to profile your model. For more information on capturing a performance profile, see Profiling your model on Cloud TPU .

TPU model performance

This section describes general issues that can reduce model performance and how you can address them.

  1. Model is input bound

    TPUs perform calculations very fast. To ensure the TPU is not idle, it is important to make sure there is a steady stream of data being loaded onto the TPU. How this is done depends on how you load and preprocess your dataset. For example, you can read datafiles in parallel using tf.data.TFRecordset() and the num_parallel_reads parameter.

  2. Batch size is too small because of sharding (splitting batches across cores)

    The TPU runtime splits a batch across all 8 cores of a TPU device (for example v2-8 or v3-8). If you specify a global batch size of 128, each core receives a batch size of 16 (128 / 8).

    For optimum memory usage, use the largest batch size that fits into TPU memory. Each TPU core uses two-dimensional 8 X 128 vector registers for processing matrix multiplications. In general, your batch size should be evenly divisible by 8 or 128.

  3. Memory Management Tuning

    You can use the TPU_PREMAPPED_BUFFER_SIZE environment variables to fine-tune low-level runtime behaviors.

  • Description: TPU_PREMAPPED_BUFFER_SIZE sets the size of the host memory buffer (in bytes) that is pre-mapped and pinned for use by the TPU runtime for data transfers (for example, DMA). The default value is 4294967296 bytes. The value must be a multiple of 2^12 (4KB = 4 * 1024 Bytes = 4096 = 2^12).

    The following examples are valid TPU_PRE_MAPPED_BUFFER_SIZE values.

       
     17179869184 
      
     = 
      
     2 
    ^34  
     = 
      
     2 
    ^22  
    *  
     2 
    ^12  
     ( 
     2 
    ^22  
    4KB  
    pages  
    will  
    be  
    premapped ) 
    .  
     40000000000 
      
     = 
      
     5 
    ^10  
    *  
     2 
    ^12  
     = 
      
     ( 
     5 
    ^10  
    4KB  
    pages  
    will  
    be  
    premapped ) 
    . 
    
  • Impact:Increasing this size can potentially improve data transfer performance between the host and TPU device, especially for workloads with large tensors or frequent host-device communication. However, it also increases the amount of pinned host memory, reducing memory available for other processes.

    Buffer size

    If the pre-mapped buffer region isn't large enough to allocate memory during program runtime, the workload will fail and return a RESOURCE_EXHAUSTED error similar to:

    "Allocating buffer from premmaped region failed with: RESOURCE_EXHAUSTED : Attempting to allocate allocation_size . That was not possible. There are available_size free."

    If the buffer is excessively large, TPU initialization can take much longer (potentially more than 15 seconds), making it seem as if the TPU is stuck.

    To diagnose this, inspect the TPU runtime logs. These logs detail the operations being performed, including the pre-mapping of buffers. You can find the logs at /tmp/tpu_logs/tpu_driver.INFO or print them directly to the console by setting the environment variable TPU_STDERR_LOG_LEVEL=0. This setting will generate output similar to:

       
    I0604  
     12 
    :45:24.926233  
     62136 
      
    tpu_hal.cc:214 ] 
      
    Starting  
    premapped  
    memory  
    manager  
    initialization...  
    I0604  
     12 
    :45:29.411218  
     62136 
      
    system.cc:1059 ] 
      
    tpu::System  
    initialized,  
    current  
    host  
    id:  
     0 
    ,  
    logical  
    device  
    ids:  
     0 
      
    I0604  
     12 
    :45:29.411244  
     61600 
      
    tfrt_tpu_system_state.cc:216 ] 
      
    CreateTpuSystemState:  
    TPU  
    initialization  
    is  
    successful  
    and  
    it  
    took  
     5 
    .583190661s  
    I0604  
     12 
    :45:29.411267  
     61600 
      
    tfrt_tpu_system_state.cc:220 ] 
      
    CreateTpuSystemState:  
    using  
    TPU  
    host  
    premapped  
    buffer  
    of  
    size:  
     4294967296 
      
     ``` 
    This  
    output  
    will  
    tell  
    you  
    how  
    long  
    it  
    took  
    to  
    initialize  
    the  
    TPU  
    and
    the  
    size  
    of  
    the  
    premapped  
    buffer. 
    
  • Usage:If the premapped buffer is too small or too large, you can manually set the buffer size using the following environment variables.

     TPU_PREMAPPED_BUFFER_SIZE:  
    Sets  
    the  
    total  
    size  
     ( 
     in 
      
    bytes ) 
      
    of  
    the
    pre-mapped  
    buffer  
    region.
    TPU_PREMAPPED_BUFFER_TRANSFER_THRESHOLD_BYTES:  
    Sets  
    the  
    maximum  
    size  
    of
    a  
    single  
    buffer  
    that  
    can  
    be  
    allocated  
    from  
    the  
    pre-mapped  
    region. 
    

    For example, you can:

       
     export 
      
     TPU_PREMAPPED_BUFFER_SIZE 
     = 
     4294967296 
     
    

    to set the buffer size and:

       
     export 
      
    TPU_PREMAPPED_BUFFER_TRANSFER_THRESHOLD_BYTES  
     ``` 
      
    to  
     enable 
      
    it.  
    This  
     export 
      
    sets  
    the  
    size  
    to  
    the  
    default. 
    
  • Guidance:Adjust the value of TPU_PREMAPPED_BUFFER_SIZE if you suspect host-device data transfer is a bottleneck. Monitor host memory usage and model performance to find an optimal balance. The default value is typically sufficient for most use cases.

XLA compiler optimizations

XLA is a compiler for machine learning that can produce binaries for TPUs, CPUs, GPUs and other platforms. While XLA is part of the standard TensorFlow code base, it can also be used on PyTorch and JAX models. Models for Cloud TPU are translated to an XLA graph, which XLA then compiles to a TPU executable. For more information about XLA, see XLA: Optimizing Compiler for Machine Learning .

Padding

To use TPU memory efficiently, structure your data so that it can be tiled into 128 x 8 chunks. When the data for a matrix computation does not fill an entire 128 x 8 chunk, the XLA compiler pads tensors. There are two drawbacks to padding:

  1. Padded tensors under-utilize the TPU core.
  2. Padding increases the amount of on-chip memory storage required for a tensor and can lead to an out-of-memory error.

While padding is automatically performed by the XLA compiler when necessary, you can determine the amount of padding performed using the memory viewer tool. You can avoid padding by picking tensor dimensions that are well suited for TPU.

Tensor dimensions

To achieve peak FLOPs, dimensions of matrix multiplication should be larger than the MXU size for the TPU version you are using. MXU size is 256 x 256 for v6e and 128 x 128 for versions prior to v6e. For more information, see Cloud TPU system architecture .

Batch size

The XLA compiler rounds up the sizes of tensors stored in TPU HBM memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.

The TPU runtime lays out tensors in memory to maximize computational efficiency and minimize padding. To minimize memory overhead and maximize computational efficiency, one of the following must be true:

  1. The total batch size should be a multiple of 64 (8 per TPU core), and feature dimension sizes should be a multiple of 128.

  2. The total batch size should be a multiple of 1024 (128 per TPU core), and feature dimension sizes should be a multiple of 8.

Using a batch size of 1024 and feature dimensions that are a multiple of 128 results in the best efficiency, although this may not be possible for all models.

Fusion

Fusion is a general technique the XLA compiler uses to optimize programs. A fused operation is the combination of multiple constituent operations that are to be executed in combination.

For example, consider the following series of operations:

 tmp = tf.add(x, y)
    result = tf.multiply(tmp, z) 

This code is roughly equivalent to the following pseudo code:

   
 for 
  
 ( 
 i 
  
 = 
  
 0 
 ; 
  
 i 
 < 
 element_count 
 ; 
  
 i 
 ++ 
 ) 
  
 { 
  
 tmp 
 [ 
 i 
 ] 
  
 = 
  
 x 
 [ 
 i 
 ] 
  
 + 
  
 y 
 [ 
 i 
 ] 
 ; 
  
 } 
  
 for 
  
 ( 
 i 
  
 = 
  
 0 
 ; 
  
 i 
 < 
 element_count 
 ; 
  
 i 
 ++ 
 ) 
  
 { 
  
 result 
 [ 
 i 
 ] 
  
 = 
  
 tmp 
 [ 
 i 
 ] 
  
 * 
  
 z 
 [ 
 i 
 ] 
 ; 
  
 } 
 

With fusion, the array accesses happen at the same time:

   
 for 
  
 ( 
 i 
  
 = 
  
 0 
 ; 
  
 i 
 < 
 element_count 
 ; 
  
 i 
 ++ 
 ) 
  
 { 
  
 result 
 [ 
 i 
 ] 
  
 = 
  
 ( 
 x 
 [ 
 i 
 ] 
  
 + 
  
 y 
 [ 
 i 
 ] 
 ) 
  
 * 
  
 z 
 [ 
 i 
 ] 
 ; 
  
 } 
 

In this example, the number of memory round trips is reduced and XLA does not need to allocate any space for 'tmp'.

Fusion is a critical optimization and benefits the Cloud TPU in several ways:

  • It reduces memory transfers by removing the need to store intermediate results in main memory, which is slow.
  • It allows greater utilization of hardware units which would otherwise be unutilized.
  • It can reduce the memory utilization of a model as fewer buffers need to be live at the same time.

Broadcasting

Broadcasting implicitly occurs when two tensors with different, but compatible, shapes are combined.

For example, tf.add(vector, matrix) requires the vector to be broadcasted to the shape of the matrix. The result of the operation has the same shape as the matrix. For more details, see the guide to broadcasting arrays .

While broadcasts can often be fused with their consumers, forcing a broadcast may result in poor performance and increased memory usage.

In the following example, the broadcast implicit in the addition of a vector and matrix cannot be fused with the argmax resulting in a materialized broadcast:

  `tf.argmax(tf.add(vector, zero_matrix), axis=0)` 
 
Create a Mobile Website
View Site in Mobile | Classic
Share by: