Debugging Cloud TPU VMs
This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.
To use the cloud-tpu-diagnostics
PyPI package, you must install it by running pip install cloud-tpu-diagnostics
on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh
command. For example:
gcloud compute tpus tpu-vm ssh you-tpu-name \ --zone = your-zone \ --project = your-project-name \ --worker = all \ --command = "pip install cloud-tpu-diagnostics"
You must also add the following code to your scripts running on all TPU VMs.
from
cloud_tpu_diagnostics
import
diagnostic
from
cloud_tpu_diagnostics.configuration
import
debug_configuration
from
cloud_tpu_diagnostics.configuration
import
diagnostic_configuration
from
cloud_tpu_diagnostics.configuration
import
stack_trace_configuration
stack_trace_config
=
stack_trace_configuration
.
StackTraceConfig
(
collect_stack_trace
=
True
,
stack_trace_to_cloud
=
True
)
debug_config
=
debug_configuration
.
DebugConfig
(
stack_trace_config
=
stack_trace_config
)
diagnostic_config
=
diagnostic_configuration
.
DiagnosticConfig
(
debug_config
=
debug_config
)
By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:
stack_trace_config
=
stack_trace_configuration
.
StackTraceConfig
(
collect_stack_trace
=
True
,
stack_trace_to_cloud
=
True
,
stack_trace_interval_seconds
=
300
)
Wrap your main method with diagnose()
to periodically collect the stack traces:
with
diagnostic
.
diagnose
(
diagnostic_config
):
run_main
()
This configuration starts collecting stack traces inside the /tmp/debugging
directory on each TPU VM. There is an agent running on all TPU VMs that uploads
the traces from a temporary directory to Cloud Logging.

