Troubleshooting JAX - TPU
This guide provides pointers to JAX troubleshooting information to help you identify and resolve problems you might encounter while training JAX models on Cloud TPU.
For a more general guide to getting started with Cloud TPU, see the JAX quickstart .
General JAX issues
If you run into issues while developing your training model or training with JAX, see the JAX FAQ.
For more general programming errors you might encounter when writing a training application with JAX, see JAX Errors.
Profile JAX performance
You can understand how your TPU resources are being utilized using the tools described in Profiling JAX performance.
Troubleshoot memory issues
You can monitor how the memory is used with the JAX device memory profiler , but you cannot directly manage how it is used.
The JAX device memory profiler can be used to:
- Figure out which arrays and executables are in TPU memory at a given time, or
- Track down memory leaks.
You cannot specify how TPU memory is allocated for specific operations. For more information on JAX-specific TPU performance issues, see Performance Notes for using TPUs with JAX .
Troubleshoot TPU issues
The following sections describe how to resolve some common issues you might encounter when you run a JAX program on a TPU.
How can I verify that the TPU is running?
Everything will be run on the TPU as long as JAX doesn't print "No GPU/TPU found, falling back to CPU."
You can verify the TPU is active by either looking at jax.devices()
, where
you should see several TPU devices displayed, or verify
programmatically with: assert jax.devices()[0].platform == 'tpu'
.
RuntimeError: Unable to initialize backend 'tpu': UNAVAILABLE: No TPU Platform available.
This runtime error message or finding the following in /tmp/tpu_logs/tpu_driver.WARNING
on the TPU VM: W1118 17:40:20.985243 23901 tpu_version_flag.cc:57] No hardware is found. Using default TPU version:xxxxxx
can indicate that you are running the wrong TPU VM version.
Verify that you are running the current JAX runtime version and retry.
Troubleshoot TPU and GKE issues
To help with troubleshooting, enable verbose logging in your GKE workload manifest, and then provide the logs to GKE support.
TPU_MIN_LOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=0 TPU_STDERR_LOG_LEVEL=0
The following sections describe error messages related to TPU and GKE setups and resolutions.
no endpoints available for service 'jobset-webhook-service'
This error means the jobset wasn't installed properly. Check to see if jobset-controller-manager deployment Kubernetes Pods are running. For more information, see the JobSet troubleshooting documentation .
TPU initialization failed: Failed to connect
Make sure your GKE node version is 1.30.4-gke.1348000 or later (GKE 1.31 is not supported).