Introducing Google AI Edge Portal : Benchmark Edge AI at scale. Sign-up to request access during private preview.

OpenAI-Compatible Server

LiteRT-LM CLI can start a local HTTP server that is compatible with the OpenAI API. This lets you use LiteRT-LM as a drop-in replacement for OpenAI in your existing applications and workflows.

Import Models

To serve a model, it must be present in your local registry. If you haven't imported a model yet, you can import the Gemma 4 12B model with the following command:

Linux/macOS

 litert-lm  
import  
 \ 
  
--from-huggingface-repo = 
litert-community/gemma-4-12B-it-litert-lm  
 \ 
  
gemma-4-12B-it.litertlm  
 \ 
  
gemma4-12b

Windows

  litert-lm 
 import 
 ` 
 - 
 -from-huggingface-repo 
 = 
 litert-community 
 / 
 gemma 
 - 
 4 
 - 
 12B-it-litert-lm 
 ` 
 gemma 
 - 
 4 
 - 
 12B-it 
 . 
 litertlm 
 ` 
 gemma4 
 - 
 12b

For more information on importing and managing models, see the Model Management guide.

Start the Server

Use the serve command to start the server. By default, it starts an OpenAI-compatible server on port 9379 .

The server dynamically loads and serves any models in your local registry.

 litert-lm  
serve

Configuration Options

You can customize the server using the following options:

--host : The host to listen on (default: 0.0.0.0 ).
--port : The port to listen on (default: 9379 ).
--verbose : Enable verbose logging.

Example with custom host and port:

 litert-lm  
serve  
--host  
 127 
.0.0.1  
--port  
 8080

Supported Endpoints

The server emulates the following OpenAI API endpoints:

List Models: GET /v1/models Lists the models that are available to the server.
Chat Completions: POST /v1/chat/completions Generates text completions for a given chat conversation. Supports streaming responses.

Choosing the Backend and Configuration

When sending requests to the server, you can dynamically choose the execution backend (CPU, GPU, or NPU) and configure the maximum number of tokens (context length) by formatting the model field in your request payload.

The model field supports the following format:

 model_id[,backend][,max_tokens]

Where model_id corresponds to any model ID in your local registry (see Model Management for details on how to list or import models).

Examples

gemma4-12b,gpu : GPU backend with default max tokens.
gemma4-12b,gpu,32768 : GPU backend with max tokens 32768.

Usage Example

Once the server is running, you can interact with it by sending HTTP requests..

Sending HTTP Requests

Linux/macOS

 curl  
http://localhost:9379/v1/chat/completions  
 \ 
  
-H  
 "Content-Type: application/json" 
  
 \ 
  
-d  
 '{ 
 "model": "gemma4-12b,gpu", 
 "messages": [ 
 {"role": "user", "content": "Hello!"} 
 ] 
 }'

Windows

  Invoke-RestMethod 
 -Uri 
 "http://localhost:9379/v1/chat/completions" 
 ` 
 -Method 
 Post 
 ` 
 -ContentType 
 "application/json" 
 ` 
 -Body 
 '{"model": "gemma4-12b,gpu", "messages": [{"role": "user", "content": "Hello!"}]}'

OpenAI-Compatible Server Stay organized with collections Save and categorize content based on your preferences.

Import Models

Linux/macOS

Windows

Start the Server

Configuration Options

Supported Endpoints

Choosing the Backend and Configuration

Examples

Usage Example

Sending HTTP Requests

Linux/macOS

Windows

OpenAI-Compatible Server