Debugging Cloud TPU VMs

This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.

To use the cloud-tpu-diagnostics PyPI package, you must install it by running pip install cloud-tpu-diagnostics on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh command. For example:

  
gcloud  
compute  
tpus  
tpu-vm  
ssh  
 you-tpu-name 
  
 \ 
  
--zone = 
 your-zone 
  
 \ 
  
--project = 
 your-project-name 
  
 \ 
  
--worker = 
all  
 \ 
  
--command = 
 "pip install cloud-tpu-diagnostics"

You must also add the following code to your scripts running on all TPU VMs.

  from 
  
 cloud_tpu_diagnostics 
  
 import 
 diagnostic 
 from 
  
 cloud_tpu_diagnostics.configuration 
  
 import 
 debug_configuration 
 from 
  
 cloud_tpu_diagnostics.configuration 
  
 import 
 diagnostic_configuration 
 from 
  
 cloud_tpu_diagnostics.configuration 
  
 import 
 stack_trace_configuration 
 stack_trace_config 
 = 
 stack_trace_configuration 
 . 
 StackTraceConfig 
 ( 
 collect_stack_trace 
 = 
 True 
 , 
 stack_trace_to_cloud 
 = 
 True 
 ) 
 debug_config 
 = 
 debug_configuration 
 . 
 DebugConfig 
 ( 
 stack_trace_config 
 = 
 stack_trace_config 
 ) 
 diagnostic_config 
 = 
 diagnostic_configuration 
 . 
 DiagnosticConfig 
 ( 
 debug_config 
 = 
 debug_config 
 )

By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:

  stack_trace_config 
 = 
 stack_trace_configuration 
 . 
 StackTraceConfig 
 ( 
 collect_stack_trace 
 = 
 True 
 , 
 stack_trace_to_cloud 
 = 
 True 
 , 
 stack_trace_interval_seconds 
 = 
 300 
 )

Wrap your main method with diagnose() to periodically collect the stack traces:

  with 
 diagnostic 
 . 
 diagnose 
 ( 
 diagnostic_config 
 ): 
 run_main 
 ()

This configuration starts collecting stack traces inside the /tmp/debugging directory on each TPU VM. There is an agent running on all TPU VMs that uploads the traces from a temporary directory to Cloud Logging.