Monitor Cloud TPU health
Cloud TPU health monitoring provides real-time information on the health status of TPU VMs and slices. Using either the Cloud Monitoring console or Google Cloud CLI , you can identify hardware failures as they occur and reallocate resources, avoiding job failures and minimizing performance degradation.
TPU health monitoring assigns either a HEALTHY
or UNHEALTHY
status to each
TPU instance in use. TPUs functioning as expected are assigned a HEALTHY
status, and those that are no longer functional or performing at a severely
reduced level are assigned an UNHEALTHY
status.
Unhealthy TPU detection
When TPU health monitoring assigns an UNHEALTHY
status to a TPU, it provides
identifying information of the failing TPU through the compute.googleapis.com/instance/tpu/infra_health"
service endpoint.
TPU health monitoring provides the following metric labels to identify the specific unhealthy TPU:
| Label name | Type | Description |
|---|---|---|
health_status
|
STRING
|
The overall health state of the TPU instance. |
unhealthy_category
|
STRING
|
The health status cause. This label is populated only when the value of the metric is UNHEALTHY
. |
machine_type
|
STRING
|
The Compute Engine machine type of the instance. |
reservation_id
|
STRING
|
The ID of the physical machine reservation. |
TPU health monitoring also provides the following resource labels:
| Label name | Type | Description |
|---|---|---|
project
|
STRING
|
The project number. |
service
|
STRING
|
The API service ( compute.googleapis.com
). |
resource_type
|
STRING
|
The VM instance. |
location
|
STRING
|
The zone of the instance. |
resource_id
|
STRING
|
The Compute Engine instance ID. |
TPU health monitoring provides the following additional metric labels for All Capacity mode reservations through TPU Cluster Director :
| Label name | Type | Description |
|---|---|---|
machine_id
|
STRING
|
The ID of the physical machine hosting the VM. |
block_id
|
STRING
|
The ID of the block within the cluster hosting the VM. |
cluster_id
|
STRING
|
The ID of the cluster hosting the VM. |
subblock_id
|
STRING
|
The ID of the sub-block hosting the VM. |
Monitoring with the console
The Monitoring Dashboard in the Google Cloud console provides real-time visualizations of machine health status, historical trends, and total counts.
To view a prebuilt a dashboard in Cloud Monitoring:
- In the Google Cloud console, go to the Cloud Monitoring page.
Go to the Monitoring console - In the navigation pane, click Dashboards.
- In the Filter searchfield, enter "TPU Bad Node".
The dashboard displays the following metrics by default:
- Overall TPU instance health: The percentage of TPU instances that are
categorized as
HEALTHY. - Number of instances: The total number of TPU instances in use.
- Number of healthy TPU instances: The total number of TPU instances
categorized as
HEALTHY. - Number of unhealthy TPU instances: The total number of TPU instances
categorized as
UNHEALTHY. - VM infra health distribution: The number of
HEALTHYinstances over time.
To add a custom query, click Add Queryand write a PromQL query. For example, the following query retrieves all unhealthy TPU instances:
count
(
{
__name__
=
"
compute.googleapis.com/instance/tpu/infra_health
",
monitored_resource
=
"
gce_instance
",
health_status
=
"
UNHEALTHY
"}
)
For more information on monitoring with the Google Cloud console, see Monitor Cloud TPU VMs .
Monitoring with the CLI
The Google Cloud CLI provides real-time TPU health data. The following example
request retrieves all UNHEALTHY
TPU:
export
TOKEN
=
$(
gcloud
auth
application-default
print-access-token )
export
PROMQL_QUERY
=
'count({__name__="compute.googleapis.com/instance/tpu/infra_health", monitored_resource="gce_instance", health_status="UNHEALTHY"})'
export
LOCATION
=
"global"
curl
-G
\
--header
"Authorization: Bearer
${
TOKEN
}
"
\
--header
"Content-Type: application/json"
\
--header
"X-Goog-User-Project:
${
PROJECT_ID
}
"
\
--data-urlencode
"query=
${
PROMQL_QUERY
}
"
\
"https://monitoring.googleapis.com/v1/projects/
${
PROJECT_ID
}
/location/
${
LOCATION
}
/prometheus/api/v1/query"
You should expect to see a response similar to the following:
{
"status"
:
"success"
,
"data"
:{
"resultType"
:
"vector"
,
"result"
:[
65
]}}
For more information on gcloud CLI, see gcloud CLI overview .

