Introducing Google AI Edge Portal : Benchmark Edge AI at scale. Sign-up to request access during private preview.

Usage

Learn how to run models and apply basic optimizations using the LiteRT-LM CLI.

Quick Start

Run the Gemma4 E2B model:

Linux/MacOS

 litert-lm  
run  
 \ 
  
--from-huggingface-repo = 
litert-community/gemma-4-E2B-it-litert-lm  
 \ 
  
gemma-4-E2B-it.litertlm  
 \ 
  
--prompt = 
 "What is the capital of France?"

Windows

  litert-lm 
 run 
 ` 
 - 
 -from-huggingface-repo 
 = 
 litert-community 
 / 
 gemma 
 - 
 4-E2B-it-litert-lm 
 ` 
 gemma 
 - 
 4-E2B-it 
 . 
 litertlm 
 ` 
 - 
 -prompt 
 = 
 "What is the capital of France?"

Run a local model

Linux/MacOS

 litert-lm  
run  
path/to/model.litertlm

Windows

  litert-lm 
 run 
 path 
 \ 
 to 
 \ 
 model 
 . 
 litertlm

GPU Acceleration

To accelerate inference using your device's GPU, use the --backend=gpu flag:

Linux/MacOS

 litert-lm  
run  
 \ 
  
--from-huggingface-repo = 
litert-community/gemma-4-E2B-it-litert-lm  
 \ 
  
gemma-4-E2B-it.litertlm  
 \ 
  
--backend = 
gpu  
 \ 
  
--prompt = 
 "What is the capital of France?"

Windows

  litert-lm 
 run 
 ` 
 - 
 -from-huggingface-repo 
 = 
 litert-community 
 / 
 gemma 
 - 
 4-E2B-it-litert-lm 
 ` 
 gemma 
 - 
 4-E2B-it 
 . 
 litertlm 
 ` 
 - 
 -backend 
 = 
 gpu 
 ` 
 - 
 -prompt 
 = 
 "What is the capital of France?"

Multi-Token Prediction (MTP)

Note: MTP requires the model to ship with a drafter (speculative model). Not all .litertlm models support MTP.

Multi-Token Prediction (MTP) is a performance optimization that significantly accelerates decode speeds. MTP is universally recommended for all tasks on GPU backends.

To enable MTP in the CLI, use the --enable-speculative-decoding=true flag:

Linux/MacOS

 litert-lm  
run  
 \ 
  
--from-huggingface-repo = 
litert-community/gemma-4-E2B-it-litert-lm  
 \ 
  
gemma-4-E2B-it.litertlm  
 \ 
  
--backend = 
gpu  
 \ 
  
--enable-speculative-decoding = 
 true 
  
 \ 
  
--prompt = 
 "What is the capital of France?"

Windows

  litert-lm 
 run 
 ` 
 - 
 -from-huggingface-repo 
 = 
 litert-community 
 / 
 gemma 
 - 
 4-E2B-it-litert-lm 
 ` 
 gemma 
 - 
 4-E2B-it 
 . 
 litertlm 
 ` 
 - 
 -backend 
 = 
 gpu 
 ` 
 - 
 -enable-speculative-decoding 
 = 
 true 
 ` 
 - 
 -prompt 
 = 
 "What is the capital of France?"

Multi-Modality

The LiteRT-LM CLI supports running multimodal models with image and audio attachments.

Prerequisites

To use attachments, you must specify the appropriate backend for processing them:

For images: Use the --vision-backend option.
For audio: Use the --audio-backend option.

Supported backends typically include cpu and gpu .

Image Attachments

To run a model with an image attachment:

 litert-lm  
run  
<model-ref>  
--vision-backend = 
gpu  
--attachment = 
image.jpg  
--prompt = 
 "Describe this image."

Audio Attachments

To run a model with an audio attachment:

 litert-lm  
run  
<model-ref>  
--audio-backend = 
gpu  
--attachment = 
audio.wav  
--prompt = 
 "Transcribe this audio."

Multiple Attachments

You can attach multiple files by repeating the --attachment option:

 litert-lm  
run  
<model-ref>  
--audio-backend = 
cpu  
--vision-backend = 
gpu  
--attachment = 
audio.wav  
--attachment = 
image.jpg  
...

Function Calling

Augment your local LLMs with Python capabilities by running tools through presets.

Using Presets for Tool Use

You can run tools with presets. Create a preset.py file to define your tools and system instructions:

  import 
  
 datetime 
 def 
  
 get_current_time 
 () 
 - 
> str 
 : 
  
 """Returns the current date and time.""" 
 return 
 datetime 
 . 
 datetime 
 . 
 now 
 () 
 . 
 strftime 
 ( 
 "%Y-%m- 
 %d 
 %H:%M:%S" 
 ) 
 system_instruction 
 = 
 "You are a helpful assistant with access to tools." 
 tools 
 = 
 [ 
 get_current_time 
 ]

Run the model with the preset:

 litert-lm  
run  
<model-ref>  
--preset = 
preset.py

How it Works

When you ask a question that requires external information (like the current time), the model recognizes that it needs to call a tool:

Model Emits tool_call : The model outputs a JSON request to call the get_current_time function.
CLI Executes Tool: The LiteRT-LM CLI intercepts this call and executes the corresponding Python function defined in your preset.py .
CLI Sends tool_response : The CLI sends the result back to the model.
Model Generates Final Answer: The model uses the tool response to compute and generate the final answer for the user.

Sample interactive session:

 > what will the time be in two hours?
[tool_call] {"arguments": {}, "name": "get_current_time"}
[tool_response] {"name": "get_current_time", "response": "2026-03-25 21:54:07"}
The current time is 2026-03-25 21:54:07.

In two hours, it will be **2026-03-25 23:54:07**.

This "Function Calling" loop happens automatically within the CLI, allowing you to augment local LLMs with Python capabilities without writing any complex orchestration code.

Usage Stay organized with collections Save and categorize content based on your preferences.

Quick Start

Run the Gemma4 E2B model:

Linux/MacOS

Windows

Run a local model

Linux/MacOS

Windows

GPU Acceleration

Linux/MacOS

Windows

Multi-Token Prediction (MTP)

Linux/MacOS

Windows

Multi-Modality

Prerequisites

Image Attachments

Audio Attachments

Multiple Attachments

Function Calling

Using Presets for Tool Use

How it Works

Usage