The Python API of LiteRT-LM for Linux, macOS and Windows. Features like multi-modality, tools use, and GPU and NPU accelerationare supported.
Introduction
Here is a sample terminal chat app built with the Python API:
import
litert_lm
litert_lm
.
set_min_log_severity
(
litert_lm
.
LogSeverity
.
ERROR
)
# Hide log for TUI app
with
litert_lm
.
Engine
(
"path/to/model.litertlm"
)
as
engine
:
with
engine
.
create_conversation
()
as
conversation
:
while
True
:
user_input
=
input
(
"
\n
>>> "
)
for
chunk
in
conversation
.
send_message_async
(
user_input
):
print
(
chunk
[
"content"
][
0
][
"text"
],
end
=
""
,
flush
=
True
)

Getting Started
LiteRT-LM is available as a Python library. You can install the package from PyPI:
# Using pip
pip
install
litert-lm-api # Using uv
uv
pip
install
litert-lm-api
Initialize the Engine
The Engine
is the entry point to the API. It handles model loading and
resource management. Using it as a context manager (with the with
statement)
ensures that resources are released promptly.
Note:Initializing the engine can take several seconds to load the model.
import
litert_lm
# Initialize with the model path and optionally specify the backend.
# backend can be Backend.CPU() (default), Backend.GPU() or Backend.NPU().
with
litert_lm
.
Engine
(
"path/to/your/model.litertlm"
,
backend
=
litert_lm
.
Backend
.
GPU
(),
# Optional: Pick a writable dir for caching compiled artifacts.
# cache_dir="/tmp/litert-lm-cache"
)
as
engine
:
# ... Use the engine to create a conversation ...
pass
Create a Conversation
A Conversation
manages the state and history of your interaction with the
model.
# Optional: Configure system instruction and initial messages
messages
=
[
litert_lm
.
Message
.
system
(
"You are a helpful assistant."
)]
# Create the conversation
with
engine
.
create_conversation
(
messages
=
messages
)
as
conversation
:
# ... Interact with the conversation ...
pass
Sending Messages
You can send messages synchronously or asynchronously (streaming).
The send_message
and send_message_async
methods accept:
- A
str(automatically wrapped as a user message). - A
litert_lm.Contentsobject (for multi-modal inputs). - A
litert_lm.Messageobject (for full message structure). - A json-like dictionary object as prompt template input.
Synchronous Example:
# Simple string input
response
=
conversation
.
send_message
(
"What is the capital of France?"
)
print
(
response
[
"content"
][
0
][
"text"
])
# Or with a Message object
# response = conversation.send_message(litert_lm.Message.user("What is the capital of France?"))
Asynchronous (Streaming) Example:
# sendMessageAsync returns an iterator of response chunks
stream
=
conversation
.
send_message_async
(
"Tell me a long story."
)
for
chunk
in
stream
:
# Chunks are dictionaries containing pieces of the response
for
item
in
chunk
.
get
(
"content"
,
[]):
if
item
.
get
(
"type"
)
==
"text"
:
print
(
item
[
"text"
],
end
=
""
,
flush
=
True
)
print
()
🔴 New: Multi-Token Prediction (MTP)
Multi-Token Prediction (MTP) is a performance optimization that significantly accelerates decode speeds. MTP is universally recommended for all tasks on GPU backends.
To use MTP, enable speculative decoding when initializing the engine.
import
litert_lm
# Enable MTP by setting enable_speculative_decoding=True
with
litert_lm
.
Engine
(
"path/to/your/model.litertlm"
,
backend
=
litert_lm
.
Backend
.
GPU
(),
enable_speculative_decoding
=
True
,
)
as
engine
:
with
engine
.
create_conversation
()
as
conversation
:
response
=
conversation
.
send_message
(
"What is the capital of France?"
)
print
(
response
[
"content"
][
0
][
"text"
])
Multi-Modality
# Initialize with vision and/or audio backends if needed
with
litert_lm
.
Engine
(
"path/to/multimodal_model.litertlm"
,
audio_backend
=
litert_lm
.
Backend
.
CPU
(),
vision_backend
=
litert_lm
.
Backend
.
GPU
(),
)
as
engine
:
with
engine
.
create_conversation
()
as
conversation
:
response
=
conversation
.
send_message
(
litert_lm
.
Contents
.
of
(
"Describe this audio."
,
litert_lm
.
Content
.
AudioFile
(
absolute_path
=
"/path/to/audio.wav"
),
)
)
print
(
response
[
"content"
][
0
][
"text"
])
Defining and Using Tools
You can define Python functions as tools that the model can call automatically.
def
add_numbers
(
a
:
float
,
b
:
float
)
-
> float
:
"""Adds two numbers.
Args:
a: The first number.
b: The second number.
"""
return
a
+
b
# Register the tool in the conversation
tools
=
[
add_numbers
]
with
engine
.
create_conversation
(
tools
=
tools
)
as
conversation
:
# The model will call add_numbers automatically if it needs to sum values
response
=
conversation
.
send_message
(
"What is 123 + 456?"
)
print
(
response
[
"content"
][
0
][
"text"
])
LiteRT-LM uses the function's docstring and type hints to generate the tool schema for the model.

