LiteRT is Google's on-device framework for high-performance ML & GenAI deployment on edge platforms, using efficient conversion, runtime, and optimization.
The latest LiteRT 2.x release introduces the CompiledModel
API,
a modern runtime interface designed to maximize hardware acceleration. While the Interpreter
API (formerly TensorFlow Lite) remains available for backward
compatibility, the CompiledModel
API is the recommended choice for developers
seeking state-of-the-art performance in on-device AI applications.
Key LiteRT features
Streamline development with LiteRT
Automated accelerator selection versus explicit delegate creation. Efficient I/O buffer handling and async execution for superior performance. See on-device inference documentation .
Best-in-class GPU performance
Powered by ML Drift, now supporting both ML and Generative AI models on GPUs APIs. See GPU acceleration documentation .
Unified NPU acceleration
Accelerate your model using simplified NPU access from major chipset providers. See NPU acceleration documentation .
Superior LLM Support
LiteRT delivers high-performance deployment for Generative AI models across mobile, desktop, and web platforms. See GenAI deployment documentation .
Bindings for Multiple Languages
- C++ , including Prebuilt LiteRT C++ Binary
- Kotlin
- Python
- Rust crate
Broad ML framework support
LiteRT supports streamlined conversion from PyTorch, TensorFlow, and JAX
Frameworks to .tflite
or .litertlm
format. See model conversion documentation
.
Get Started with CompiledModel
API
-
For classical ML models, see the following demo apps.
- Image segmentation Kotlin App : CPU/GPU/NPU inference.
- Image segmentation C++ App : CPU/GPU/NPU inference with asyncexecution.
-
For GenAI models, see the following demo apps:
- EmbeddingGemma semantic similarity C++ App : CPU/GPU/NPU inference.
Development workflow
LiteRT runs inferences completely on-device on Android, iOS, Web, IoT, and on desktop/laptop. Regardless of device, the following is the most common workflow, with links to further instructions.
Identify the most suitable solution to the ML challenge
LiteRToffers users a high level of flexibility and customizability when it comes to solving machine learning problems, making it a good fit for users who require a specific model or a specialized implementation. Users looking for plug-and-play solutions may prefer MediaPipe Tasks , which provides ready-made solutions for common machine learning tasks like object detection, text classification, and LLM inference.

Obtain and preparing the model
A LiteRT model is represented in an efficient portable format known as FlatBuffers
, which uses the .tflite
file extension.
You can obtain a LiteRT model in the following ways:
-
Obtain a pre-trained model:for popular ML workloads like Image segmentation, Object detection etc.
The simplest approach is to use a LiteRT model already in the
.tfliteformat. These models don't require any added conversion steps.Model type Pre-trained model source Classical ML
(.tfliteformat)Visit Kaggle or HuggingFace
E.g. Image segmentation models and sample appGenerative AI
(.litertlmformat)LiteRT Hugging Face page
E.g. Gemma Family -
Convertyour chosen PyTorch, TensorFlow or JAX model into a LiteRT model if you choose to not use a pre-trained model. [PRO USER]
Model framework Sample models Conversion tool PytorchHugging Face
TorchvisionLink TensorFlowKaggle Models
Hugging FaceLink JaxHugging Face Link -
Author your LLMfor further optimization using Generative API [PRO USER]
Our Generative API library provides PyTorch built-in building blocks for composing Transformer models such as Gemma , TinyLlama and others using mobile-friendly abstractions, through which we can guarantee conversion, and performant execution on our mobile runtime, LiteRT. See Generative API documentation .
Optimize [PRO USER]
AI Edge Quantizer for advanced developers is a tool to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).
See more details from AI Edge Quantizer documentation .
Integrate the model into your app on edge platforms
LiteRT lets you to run ML models entirely on-device with high performance across Android, iOS, Web, Desktop, and IoT platforms.
Use the following guides to integrate a LiteRT model on your preferred platform:
| Supported platform | Supported devices | Supported APIs |
|---|---|---|
| Android mobile devices | C++/Kotlin | |
| iOS mobile devices, Macbooks | C++/Swift | |
| Device with Chrome, Firefox, or Safari | JavaScript | |
| Linux workstation or Linux-based IoT devices | C++/Python | |
| Windows workstation or laptops | C++/Python | |
| Embedded devices | C++ |
The following code snippets show a basic implementation in Kotlin and C++.
Kotlin
// Load model and initialize runtime
val
compiledModel
=
CompiledModel
.
create
(
"/path/to/mymodel.tflite"
,
CompiledModel
.
Options
(
Accelerator
.
CPU
))
// Preallocate input/output buffers
val
inputBuffers
=
compiledModel
.
createInputBuffers
()
val
outputBuffers
=
compiledModel
.
createOutputBuffers
()
// Fill the input buffer
inputBuffers
.
get
(
0
).
writeFloat
(
input0
)
inputBuffers
.
get
(
1
).
writeFloat
(
input1
)
// Invoke
compiledModel
.
run
(
inputBuffers
,
outputBuffers
)
// Read the output
val
output
=
outputBuffers
.
get
(
0
).
readFloat
()
C++
// Load model and initialize runtime
LITERT_ASSIGN_OR_RETURN
(
auto
env
,
GetEnvironment
());
LITERT_ASSIGN_OR_RETURN
(
auto
options
,
GetOptions
());
LITERT_ASSIGN_OR_RETURN
(
auto
compiled_model
,
CompiledModel
::
Create
(
env
,
"/path/to/mymodel.tflite"
,
options
));
// Preallocate input/output buffers
LITERT_ASSIGN_OR_RETURN
(
auto
input_buffers
,
compiled_model
.
CreateInputBuffers
(
signature_index
));
LITERT_ASSIGN_OR_RETURN
(
auto
output_buffers
,
compiled_model
.
CreateOutputBuffers
(
signature_index
));
// Fill the input buffer
LITERT_ABORT_IF_ERROR
(
input_buffers
[
0
].
Write
(
input0
));
LITERT_ABORT_IF_ERROR
(
input_buffers
[
1
].
Write
(
input1
));
// Invoke
LITERT_ABORT_IF_ERROR
(
compiled_model
.
Run
(
signature_index
,
input_buffers
,
output_buffers
));
// Read the output
LITERT_ABORT_IF_ERROR
(
output_buffers
[
0
].
Read
(
output0
));
Choose a backend
The most straightforward way to incorporate backends in LiteRT is to rely on
the runtime's built-in intelligence. With the CompiledModel
API, LiteRT
simplifies the setup significantly with the ability to specify the
target backend as an option. See on-device inference guide
for more
details.
| Android | iOS / macOS | Web | Linux | Windows | IoT | |
|---|---|---|---|---|---|---|
|
CPU
|
XNNPACK | XNNPACK | XNNPACK | XNNPACK | XNNPACK | XNNPACK |
|
GPU
|
OpenGL OpenCL |
Metal WebGPU |
WebGPU | WebGPU OpenCL |
WebGPU OpenCL |
WebGPU |
|
NPU
|
MediaTek Qualcomm |
- | - | Qualcomm | - | Qualcomm |
Additional documentation and support
-
LiteRT-Samples GitHub Repo for more LiteRT sample apps.
-
For existing users of TensorFlow Lite, see migration guide .
-
LiteRT Tools page for performance, profiling, error reporting etc.

