Learn how to run models and apply basic optimizations using the LiteRT-LM CLI.
Quick Start
Run the Gemma4 E2B model:
Linux/MacOS
litert-lm
run
\
--from-huggingface-repo =
litert-community/gemma-4-E2B-it-litert-lm
\
gemma-4-E2B-it.litertlm
\
--prompt =
"What is the capital of France?"
Windows
litert-lm
run
`
-
-from-huggingface-repo
=
litert-community
/
gemma
-
4-E2B-it-litert-lm
`
gemma
-
4-E2B-it
.
litertlm
`
-
-prompt
=
"What is the capital of France?"
Run a local model
Linux/MacOS
litert-lm
run
path/to/model.litertlm
Windows
litert-lm
run
path
\
to
\
model
.
litertlm
GPU Acceleration
To accelerate inference using your device's GPU, use the --backend=gpu
flag:
Linux/MacOS
litert-lm
run
\
--from-huggingface-repo =
litert-community/gemma-4-E2B-it-litert-lm
\
gemma-4-E2B-it.litertlm
\
--backend =
gpu
\
--prompt =
"What is the capital of France?"
Windows
litert-lm
run
`
-
-from-huggingface-repo
=
litert-community
/
gemma
-
4-E2B-it-litert-lm
`
gemma
-
4-E2B-it
.
litertlm
`
-
-backend
=
gpu
`
-
-prompt
=
"What is the capital of France?"
Multi-Token Prediction (MTP)
Multi-Token Prediction (MTP) is a performance optimization that significantly accelerates decode speeds. MTP is universally recommended for all tasks on GPU backends.
To enable MTP in the CLI, use the --enable-speculative-decoding=true
flag:
Linux/MacOS
litert-lm
run
\
--from-huggingface-repo =
litert-community/gemma-4-E2B-it-litert-lm
\
gemma-4-E2B-it.litertlm
\
--backend =
gpu
\
--enable-speculative-decoding =
true
\
--prompt =
"What is the capital of France?"
Windows
litert-lm
run
`
-
-from-huggingface-repo
=
litert-community
/
gemma
-
4-E2B-it-litert-lm
`
gemma
-
4-E2B-it
.
litertlm
`
-
-backend
=
gpu
`
-
-enable-speculative-decoding
=
true
`
-
-prompt
=
"What is the capital of France?"
Multi-Modality
The LiteRT-LM CLI supports running multimodal models with image and audio attachments.
Prerequisites
To use attachments, you must specify the appropriate backend for processing them:
- For images: Use the
--vision-backendoption. - For audio: Use the
--audio-backendoption.
Supported backends typically include cpu
and gpu
.
Image Attachments
To run a model with an image attachment:
litert-lm
run
<model-ref>
--vision-backend =
gpu
--attachment =
image.jpg
--prompt =
"Describe this image."
Audio Attachments
To run a model with an audio attachment:
litert-lm
run
<model-ref>
--audio-backend =
gpu
--attachment =
audio.wav
--prompt =
"Transcribe this audio."
Multiple Attachments
You can attach multiple files by repeating the --attachment
option:
litert-lm
run
<model-ref>
--audio-backend =
cpu
--vision-backend =
gpu
--attachment =
audio.wav
--attachment =
image.jpg
...
Function Calling
Augment your local LLMs with Python capabilities by running tools through presets.
Using Presets for Tool Use
You can run tools with presets. Create a preset.py
file to define your tools
and system instructions:
import
datetime
def
get_current_time
()
-
> str
:
"""Returns the current date and time."""
return
datetime
.
datetime
.
now
()
.
strftime
(
"%Y-%m-
%d
%H:%M:%S"
)
system_instruction
=
"You are a helpful assistant with access to tools."
tools
=
[
get_current_time
]
Run the model with the preset:
litert-lm
run
<model-ref>
--preset =
preset.py
How it Works
When you ask a question that requires external information (like the current time), the model recognizes that it needs to call a tool:
- Model Emits
tool_call: The model outputs a JSON request to call theget_current_timefunction. - CLI Executes Tool: The LiteRT-LM CLI intercepts this call and executes the corresponding Python function defined in your
preset.py. - CLI Sends
tool_response: The CLI sends the result back to the model. - Model Generates Final Answer: The model uses the tool response to compute and generate the final answer for the user.
Sample interactive session:
> what will the time be in two hours?
[tool_call] {"arguments": {}, "name": "get_current_time"}
[tool_response] {"name": "get_current_time", "response": "2026-03-25 21:54:07"}
The current time is 2026-03-25 21:54:07.
In two hours, it will be **2026-03-25 23:54:07**.
This "Function Calling" loop happens automatically within the CLI, allowing you to augment local LLMs with Python capabilities without writing any complex orchestration code.

