Monitor TPUs

This guide explains how to use Cloud Monitoring to monitor your TPU VMs. Cloud Monitoring automatically collects metrics and logs from your TPU and its host VM. These data can be used to monitor the health of your TPU and Compute Engine.

Metrics enable you to track a numerical quantity over time, for example, CPU utilization, network usage, or TensorCore idle duration. Logs capture events at a specific point in time. Log entries are written by your own code, Google Cloud services, third-party applications, and the Google Cloud infrastructure. You can also generate metrics from the data present in a log entry by creating a log-based metric . You can also set alert policies based on metric values or log entries.

To monitor TPUs, you can also use Capacity Planner ( Preview ). With Capacity Planner, you can view TPU usage and forecast data for your project, folder, or organization. This data updates every 24 hours, and you can use it to analyze usage trends and plan for future capacity needs. For more information, see Capacity Planner overview .

Access TPU metrics

Compute Engine generates two types of TPU metrics: TPU runtime metrics and TPU VM infrastructure metrics. You can get the metrics in two ways:

TPU Monitoring Library: Get TPU runtime metrics from the LibTPU SDK using the TPU Monitoring Library. This enables your applications to get real-time telemetry from inside the guest environment. For more information, see TPU Monitoring Library .
AI Telemetry Collector: Get runtime metrics and VM infrastructure metrics through the AI Telemetry Collector. The AI Telemetry Collector runs inside the TPU VM and lets you access metrics through Cloud Monitoring or through your own Prometheus-based monitoring pipeline. For more information, see AI Telemetry Collector .

TPU metrics

Google Cloud metrics for Cloud TPU are automatically generated by Compute Engine VMs and the Cloud TPU runtime. The metrics in the following table are generated by Compute Engine VMs.

The "metric type" strings in this table must be prefixed with compute.googleapis.com/ . That prefix has been omitted from the entries in the table. When querying a label, use the metric.labels prefix; for example, metric.labels. LABEL =" VALUE " .

Metric type ^{Launch stage} (Resource hierarchy levels)
Display name

Kind, Type, Unit
Monitored resources

Description
Labels

instance/tpu/accelerator/duty_cycle ^BETA (project)
Accelerator Duty Cycle

GAUGE , DOUBLE , %
gce_instance

Percentage of time over the sample period during which the accelerator was actively processing. Values are in the range of [0,100].
accelerator_id : Device Id of Accelerator.

instance/tpu/accelerator/memory_bandwidth_utilization ^BETA (project)
Accelerator Memory Bandwidth Utilization

GAUGE , DOUBLE , %
gce_instance

Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period by the maximum supported bandwidth over the same sample period.
accelerator_id : Device Id of Accelerator.

instance/tpu/accelerator/memory_total ^BETA (project)
Accelerator Memory Total

GAUGE , INT64 , By
gce_instance

Total accelerator memory currently allocated in bytes.
accelerator_id : Device Id of Accelerator.

instance/tpu/accelerator/memory_used ^BETA (project)
Accelerator Memory Used

GAUGE , INT64 , By
gce_instance

Total accelerator memory currently used in bytes.
accelerator_id : Device Id of Accelerator.

instance/tpu/accelerator/tensorcore_utilization ^BETA (project)
Accelerator TensorCore Utilization

GAUGE , DOUBLE , %
gce_instance

Current percentage of the Tensorcore that is utilized. Computed by dividing the Tensorcore operations that were performed over a sample period by the supported number of Tensorcore operations over the same sample period.
accelerator_id : Device Id of Accelerator.

instance/tpu/active_chips ^BETA (project)
Active TPU Chips Count

GAUGE , INT64 , 1
gce_instance

The current count of chips that are actively being utilized (i.e) not idle.
accelerator_type : Accelerator type and generation.
reservation_id : The ID of the physical machine reservation.
provisioning_model : The associated provisioning model.
protection_tier : The associated protection model.
block_id : The ID of the block within the cluster hosting the VM.
subblock_id : The ID of the sub-block hosting the VM.
is_exr : (BOOL) Indicates if the chip is part of an extended reservation.

instance/tpu/chip_state ^BETA (project)
TPU Chip State Count

GAUGE , INT64 , 1
gce_instance

The count of TPU chips in various states like Healthy, Unhealthy and Unknown.
state : The state of the chip.
accelerator_type : Accelerator type and generation.
block_id : The ID of the block within the cluster hosting the VM.
subblock_id : The ID of the sub-block hosting the VM.
reservation_id : The ID of the physical machine reservation.
is_exr : (BOOL) Indicates if the chip is part of an extended reservation.

instance/tpu/infra_health ^BETA (project)
TPU Instance Health

GAUGE , INT64 , 1
gce_instance

Indicates the overall health status of a TPU instance. The metric labels help identify the specific health status and reasons for issues on degraded or unhealthy TPU instances, primarily focusing on TPU hardware and system health. Health status changes may take several minutes to be reflected in this metric. Sampled every 60 seconds. After sampling, data is not visible for up to 420 seconds.
health_status : The overall health state of the TPU instance. Possible values: HEALTHY (operating as expected), UNHEALTHY (critical issue detected), DEGRADED (performance impacting issue), UNKNOWN (status cannot be determined).
unhealthy_category : Explanation for the unhealthy VM status. This label is populated only when the value of the metric is Unhealthy.
machine_type : The machine type of the instance (e.g., ct6e-standard-4t-tpu).
machine_id : The ID of the physical machine hosting the VM.
block_id : The ID of the block within the cluster hosting the VM.
cluster_id : The ID of the cluster hosting the VM.
reservation_id : The ID of the physical machine reservation.
subblock_id : The ID of the sub-block hosting the VM.

instance/tpu/runtime/uptime ^BETA (project)
Runtime Uptime

GAUGE , INT64 , s
gce_instance

Uptime of the ML Runtime since the initialization of the runtime library (libtpu.so) by the ML job. During this period the runtime library blocks the TPU devices for use by the ML job.
ml_framework_name : Name of the ML framework.
ml_framework_version : Version of the ML framework.

instance/tpu/scheduled_chips ^BETA (project)
Scheduled TPU Chips Count

GAUGE , INT64 , 1
gce_instance

The current count of chips that are allocated to a VM which is HEALTHY and is NOT DISABLED for maintenance.
accelerator_type : Accelerator type and generation.
reservation_id : The ID of the physical machine reservation.
provisioning_model : The associated provisioning model.
protection_tier : The associated protection model.
block_id : The ID of the block within the cluster hosting the VM.
subblock_id : The ID of the sub-block hosting the VM.
is_exr : (BOOL) Indicates if the chip is part of an extended reservation.

instance/tpu/utilized_chips ^BETA (project)
Utilized TPU Chips

GAUGE , DOUBLE , 1
gce_instance

The current aggregate utilized capacity expressed as an effective number of active chips. It is equivalent to the sum of the fractional utilization (0.0 to 1.0) of all active chips.
accelerator_type : Accelerator type and generation.
reservation_id : The ID of the physical machine reservation.
provisioning_model : The associated provisioning model.
protection_tier : The associated protection model.
block_id : The ID of the block within the cluster hosting the VM.
subblock_id : The ID of the sub-block hosting the VM.
is_exr : (BOOL) Indicates if the chip is part of an extended reservation.

quota/tpus_per_tpu_family/exceeded ^ALPHA (project)
TPU count per TPU family. quota exceeded error

DELTA , INT64 , 1
compute.googleapis.com/Location

Number of attempts to exceed the limit on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds.
limit_name : The limit name.
tpu_family : TPU family custom dimension.

quota/tpus_per_tpu_family/limit ^ALPHA (project)
TPU count per TPU family. quota limit

GAUGE , INT64 , 1
compute.googleapis.com/Location

Current limit on quota metric compute.googleapis.com/tpus_per_tpu_family. Sampled every 60 seconds. After sampling, data is not visible for up to 150 seconds.
limit_name : The limit name.
tpu_family : TPU family custom dimension.

quota/tpus_per_tpu_family/usage ^ALPHA (project)
TPU count per TPU family. quota usage

GAUGE , INT64 , 1
compute.googleapis.com/Location

Current usage on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds.
limit_name : The limit name.
tpu_family : TPU family custom dimension.

tpu/multislice/accelerator/device_to_host_transfer_latencies ^BETA (project)
Device to Host Transfer Latencies