Stay organized with collectionsSave and categorize content based on your preferences.
Improve your model's performance with bfloat16
By default, TPUs perform
matrix multiplication operations withbfloat16values and accumulations with IEEEfloat32values. Using reduced-precision floating point numbers decreases time to
convergence without losing accuracy.
The dynamic range ofbfloat16andfloat32are equivalent. However,bfloat16uses half of the memory space. For more information aboutbfloat16performance,
seeA Study of BFLOAT16 for Deep Learning Training.
Use bfloat16 explicitly
While automatic format conversion in TPUs lets you avoid thinking about numerical
precision, you can achieve performance improvements by explicitly casting values
tobfloat16. There are two reasons to explicitly cast values tobfloat16:
Storing values inbfloat16format saves on-chip memory, enabling Cloud TPUs
to train larger models or use larger batch sizes.
Some operations are memory-bandwidth-bound, which means the amount of time it
takes to load data from memory can slow down the overall time spent performing
the computation. Storing operands and outputs of those operations inbfloat16format reduces the amount of data that must be transferred, improving overall
speed.
The format conversion fromfloat32tobfloat16is automatically inserted by the
XLA compiler. On TPU, the rounding scheme in the conversion isround to nearest evenand overflow toinf. Also, thebfloat16on Cloud TPU does not support
subnormals, so all subnormals are flushed to zero during the conversion.
Special values, such asNaNandinf, are preserved in the conversion.
The format conversion frombfloat16tofloat32is also automatically inserted
by the XLA compiler. Sincefloat32can represent all exact values inbfloat16,
the conversion pads 16 zeros in the mantissa bits. Special values are
preserved in the conversion.
Checkpoints obtained from a model trained on Cloud TPUs can be deployed on other
hardware platforms (for example, inference or fine-tuning on CPUs or GPUs)
without extensive manual conversions.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Improve your model's performance with bfloat16\n==============================================\n\nBy default, TPUs perform\nmatrix multiplication operations with [`bfloat16`](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)\nvalues and accumulations with IEEE [`float32`](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)\nvalues. Using reduced-precision floating point numbers decreases time to\nconvergence without losing accuracy.\n\nThe dynamic range of `bfloat16` and `float32` are equivalent. However, `bfloat16`\nuses half of the memory space. For more information about `bfloat16` performance,\nsee [A Study of BFLOAT16 for Deep Learning Training](https://arxiv.org/abs/1905.12322).\n\nUse bfloat16 explicitly\n-----------------------\n\nWhile automatic format conversion in TPUs lets you avoid thinking about numerical\nprecision, you can achieve performance improvements by explicitly casting values\nto `bfloat16`. There are two reasons to explicitly cast values to `bfloat16`:\n\n1. Storing values in `bfloat16` format saves on-chip memory, enabling Cloud TPUs\n to train larger models or use larger batch sizes.\n\n2. Some operations are memory-bandwidth-bound, which means the amount of time it\n takes to load data from memory can slow down the overall time spent performing\n the computation. Storing operands and outputs of those operations in `bfloat16`\n format reduces the amount of data that must be transferred, improving overall\n speed.\n\nTo get started, we recommend getting some experience with one of the\n[Cloud TPU reference models](/tpu/docs/tutorials/supported-models). After\nthat, the [profiling tools guide](/tpu/docs/profile-tpu-vm), and\n[troubleshooting guide](/tpu/docs/troubleshooting/troubleshooting) provide\nin-depth technical information to help you create and optimize machine learning\nmodels on your own.\n\n### Format conversion details\n\nThe format conversion from `float32` to `bfloat16` is automatically inserted by the\nXLA compiler. On TPU, the rounding scheme in the conversion is\n[round to nearest even](https://en.wikipedia.org/wiki/IEEE_754#Roundings_to_nearest)\nand overflow to `inf`. Also, the `bfloat16` on Cloud TPU does not support\nsubnormals, so all subnormals are flushed to zero during the conversion.\nSpecial values, such as `NaN` and `inf`, are preserved in the conversion.\n\nThe format conversion from `bfloat16` to `float32` is also automatically inserted\nby the XLA compiler. Since `float32` can represent all exact values in `bfloat16`,\nthe conversion pads 16 zeros in the mantissa bits. Special values are\npreserved in the conversion.\n\nCheckpoints obtained from a model trained on Cloud TPUs can be deployed on other\nhardware platforms (for example, inference or fine-tuning on CPUs or GPUs)\nwithout extensive manual conversions."]]