Monitor TPUs
This guide explains how to use Cloud Monitoring to monitor your TPU VMs. Cloud Monitoring automatically collects metrics and logs from your TPU and its host VM. These data can be used to monitor the health of your TPU and Compute Engine.
Metrics enable you to track a numerical quantity over time, for example, CPU utilization, network usage, or TensorCore idle duration. Logs capture events at a specific point in time. Log entries are written by your own code, Google Cloud services, third-party applications, and the Google Cloud infrastructure. You can also generate metrics from the data present in a log entry by creating a log-based metric . You can also set alert policies based on metric values or log entries.
To monitor TPUs, you can also use Capacity Planner ( Preview ). With Capacity Planner, you can view TPU usage and forecast data for your project, folder, or organization. This data updates every 24 hours, and you can use it to analyze usage trends and plan for future capacity needs. For more information, see Capacity Planner overview .
Access TPU metrics
Compute Engine generates two types of TPU metrics: TPU runtime metrics and TPU VM infrastructure metrics. You can get the metrics in two ways:
-
TPU Monitoring Library: Get TPU runtime metrics from the LibTPU SDK using the TPU Monitoring Library. This enables your applications to get real-time telemetry from inside the guest environment. For more information, see TPU Monitoring Library .
-
AI Telemetry Collector: Get runtime metrics and VM infrastructure metrics through the AI Telemetry Collector. The AI Telemetry Collector runs inside the TPU VM and lets you access metrics through Cloud Monitoring or through your own Prometheus-based monitoring pipeline. For more information, see AI Telemetry Collector .
TPU metrics
Google Cloud metrics for Cloud TPU are automatically generated by Compute Engine VMs and the Cloud TPU runtime. The metrics in the following table are generated by Compute Engine VMs.
The "metric type" strings in this table must be prefixed with compute.googleapis.com/
. That prefix has been omitted from the entries in the
table. When querying a label, use the metric.labels
prefix; for example, metric.labels. LABEL
=" VALUE
"
.
Display name
Monitored resources
Labels
instance/tpu/accelerator/duty_cycle
BETA
(project)
Accelerator Duty Cycle
accelerator_id
:
Device Id of Accelerator.instance/tpu/accelerator/memory_bandwidth_utilization
BETA
(project)
Accelerator Memory Bandwidth Utilization
accelerator_id
:
Device Id of Accelerator.instance/tpu/accelerator/memory_total
BETA
(project)
Accelerator Memory Total
accelerator_id
:
Device Id of Accelerator.instance/tpu/accelerator/memory_used
BETA
(project)
Accelerator Memory Used
accelerator_id
:
Device Id of Accelerator.instance/tpu/accelerator/tensorcore_utilization
BETA
(project)
Accelerator TensorCore Utilization
accelerator_id
:
Device Id of Accelerator.instance/tpu/active_chips
BETA
(project)
Active TPU Chips Count
accelerator_type
:
Accelerator type and generation.reservation_id
:
The ID of the physical machine reservation.provisioning_model
:
The associated provisioning model.protection_tier
:
The associated protection model.block_id
:
The ID of the block within the cluster hosting the VM.subblock_id
:
The ID of the sub-block hosting the VM.is_exr
:
(BOOL)
Indicates if the chip is part of an extended reservation.instance/tpu/chip_state
BETA
(project)
TPU Chip State Count
state
:
The state of the chip.accelerator_type
:
Accelerator type and generation.block_id
:
The ID of the block within the cluster hosting the VM.subblock_id
:
The ID of the sub-block hosting the VM.reservation_id
:
The ID of the physical machine reservation.is_exr
:
(BOOL)
Indicates if the chip is part of an extended reservation.instance/tpu/infra_health
BETA
(project)
TPU Instance Health
health_status
:
The overall health state of the TPU instance. Possible values: HEALTHY (operating as expected), UNHEALTHY (critical issue detected), DEGRADED (performance impacting issue), UNKNOWN (status cannot be determined).unhealthy_category
:
Explanation for the unhealthy VM status. This label is populated only when the value of the metric is Unhealthy.machine_type
:
The machine type of the instance (e.g., ct6e-standard-4t-tpu).machine_id
:
The ID of the physical machine hosting the VM.block_id
:
The ID of the block within the cluster hosting the VM.cluster_id
:
The ID of the cluster hosting the VM.reservation_id
:
The ID of the physical machine reservation.subblock_id
:
The ID of the sub-block hosting the VM.instance/tpu/runtime/uptime
BETA
(project)
Runtime Uptime
ml_framework_name
:
Name of the ML framework.ml_framework_version
:
Version of the ML framework.instance/tpu/scheduled_chips
BETA
(project)
Scheduled TPU Chips Count
accelerator_type
:
Accelerator type and generation.reservation_id
:
The ID of the physical machine reservation.provisioning_model
:
The associated provisioning model.protection_tier
:
The associated protection model.block_id
:
The ID of the block within the cluster hosting the VM.subblock_id
:
The ID of the sub-block hosting the VM.is_exr
:
(BOOL)
Indicates if the chip is part of an extended reservation.instance/tpu/utilized_chips
BETA
(project)
Utilized TPU Chips
accelerator_type
:
Accelerator type and generation.reservation_id
:
The ID of the physical machine reservation.provisioning_model
:
The associated provisioning model.protection_tier
:
The associated protection model.block_id
:
The ID of the block within the cluster hosting the VM.subblock_id
:
The ID of the sub-block hosting the VM.is_exr
:
(BOOL)
Indicates if the chip is part of an extended reservation.quota/tpus_per_tpu_family/exceeded
ALPHA
(project)
TPU count per TPU family. quota exceeded error
limit_name
:
The limit name.tpu_family
:
TPU family custom dimension.quota/tpus_per_tpu_family/limit
ALPHA
(project)
TPU count per TPU family. quota limit
limit_name
:
The limit name.tpu_family
:
TPU family custom dimension.quota/tpus_per_tpu_family/usage
ALPHA
(project)
TPU count per TPU family. quota usage
limit_name
:
The limit name.tpu_family
:
TPU family custom dimension.tpu/multislice/accelerator/device_to_host_transfer_latencies
BETA
(project)
Device to Host Transfer Latencies
buffer_size
:
Buffer size.tpu/multislice/accelerator/host_to_device_transfer_latencies
BETA
(project)
Host to Device Transfer Latencies
buffer_size
:
Buffer size.tpu/multislice/network/collective_end_to_end_latencies
BETA
(project)
Collective End-to-End Latencies
input_size
:
Input size of the collective operation.collective_type
:
Type of the collective operation.tpu/multislice/network/dcn_transfer_latencies
BETA
(project)
DCN Transfer Latencies
buffer_size
:
Buffer size.type
:
Type.tpu/multislice/network/grpc_client_call_latencies
BETA
(project)
gRPC Client Call Latencies
buffer_size
:
Buffer size.tpu/multislice/network/grpc_server_call_latencies
BETA
(project)
gRPC Server Call Latencies
buffer_size
:
Buffer size.tpu/multislice/network/grpc_tcp_delivery_rates
BETA
(project)
gRPC TCP Delivery Rates
tpu/multislice/network/grpc_tcp_min_round_trip_times
BETA
(project)
gRPC TCP Min Round Trip Times
tpu/multislice/network/grpc_tcp_packets_retransmitted_count
BETA
(project)
gRPC TCP Packets Retransmitted Count
tpu/multislice/network/grpc_tcp_packets_sent_count
BETA
(project)
gRPC TCP Packets Sent Count
tpu/slice/capacity/available_chips
BETA
(project)
Available TPU Chips Count
accelerator_type
:
Accelerator type and generation.reservation_id
:
The ID of the physical machine reservation.block_id
:
The block ID associated with the slice.subblock_id
:
The subblock ID associated with the slice.provisioning_model
:
The associated provisioning model.protection_tier
:
The associated protection model.tpu/slice/capacity/committed_chips
BETA
(project)
Purchased TPU Chips Count
accelerator_type
:
Accelerator type and generation.reservation_id
:
The ID of the physical machine reservation.block_id
:
The block ID associated with the slice.subblock_id
:
The subblock ID associated with the slice.provisioning_model
:
The associated provisioning model.protection_tier
:
The associated protection model.For a complete list of metrics generated by Compute Engine, see Compute Engine metrics .
AI Telemetry Collector
The AI Telemetry Collector collects and publishes TPU metrics under the compute.googleapis.com
namespace for TPUs created using the Compute Engine API.
These metrics are built-in system metrics, which provide visibility into health
and performance.
The AI Telemetry Collector architecture is designed as a lightweight, specialized OpenTelemetry (OTEL) Collector. It uses two primary receivers to capture data:
- TPU Runtime Receiver: Scrapes runtime and workload metrics (such as duty cycle and memory usage) directly from the TPU runtime when a machine learning workload is active.
- TPU Host Receiver: Captures hardware utilization metrics, such as TensorCore Utilization and Memory Bandwidth Utilization, directly from the device regardless of whether a workload is running.
The AI Telemetry Collector then uses processors to automatically apply necessary
resource tags (such as project_id
, instance_id
, and zone
) and securely
exports the telemetry directly to Cloud Monitoring.
The AI Telemetry Collector comes pre-installed in Google's TPU-optimized Ubuntu LTS images , and runs automatically when the VM boots. To use this setup, specify the official Google accelerator image project and family when creating a TPU VM instance or instance template. Once the VM starts, the AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards.
If you're building custom operating system images, you can use the AI Telemetry
Collector after installing and running the ai-telemetry-collector
Docker
image. For more information, see Use a custom OS
image
.
Configuration
The AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards, and does not require any additional configuration steps. However, you can configure the Snap package or Docker image to add external export destinations, alter metric collection intervals, and include debugging options.
You can either replace the default configuration with a new config file, or append an additional configuration file to the existing default configuration. When adding configurations, keys that don't already exist are added and keys that already exist are overwritten. However, arrays and lists are not additive, so new lists must include both existing and new values.
The following YAML file configures AI Telemetry Collector to send metrics to Prometheus , an open-source systems monitoring and alerting toolkit. It also enables the debugging option, which print metrics within the console.
exporters
:
prometheus
:
endpoint
:
0.0.0.0:8889
service
:
pipelines
:
metrics
:
exporters
:
-
prometheus
# For more: https://prometheus.io/docs/introduction/overview/
-
googlecloud
# If you do not include this, you'll lose Google Cloud Monitoring
-
debug
# print metrics within the console
Default OS
If you are using Google's TPU-optimized Ubuntu LTS images , run the following Snap command to add the new config file to the existing configuration:
sudo
snap
set
\
ai-telemetry-collector
\
extra-flags =
"--config /home/username/additional-config.yaml"
If you want to overwrite and replace the existing configuration, use the config-path
flag instead of extra-flags
:
sudo
snap
set
\
ai-telemetry-collector
\
config-path =
"/home/username/new-config.yaml"
The snap set
command should trigger an automatic restart of the AI Telemetry
Collector. To verify that the collector has restarted and successfully applied
your configurations, use the following command to view the logs:
sudo
snap
logs
-f
ai-telemetry-collector
Custom OS
If you are using a custom OS, run the following Docker command to add the new config file to the existing configuration:
# First apply the default configs via `--config=/etc/ai-telemetry-collector/config.yaml`
# Then apply your additional config by volume mount.
docker
run
--privileged
--net =
host
\
-v
<path>/additional-config.yaml:/etc/ai-telemetry-collector/additional-config.yaml
\
ai-telemetry-collector:latest
\
--config =
/etc/ai-telemetry-collector/config.yaml
\
--config =
/etc/ai-telemetry-collector/additional-config.yaml
If you want to overwrite and replace the existing configuration, use the following Docker command:
# Mount a volume (your config file) to `/etc/ai-telemetry-collector/config.yaml`
# The binary automatically picks up this file.
docker
run
--privileged
--net =
host
\
-v
<path>/my-config.yaml:/etc/ai-telemetry-collector/config.yaml
\
ai-telemetry-collector:latest
Audit logs
Google Cloud services generate audit logs that record administrative and access activities within your Google Cloud resources. For more information, see Compute Engine audit logging .

