OpenAI-Compatible Server

LiteRT-LM CLI can start a local HTTP server that is compatible with the OpenAI API. This lets you use LiteRT-LM as a drop-in replacement for OpenAI in your existing applications and workflows.

Import Models

To serve a model, it must be present in your local registry. If you haven't imported a model yet, you can import the Gemma 4 12B model with the following command:

Linux/macOS

 litert-lm  
import  
 \ 
  
--from-huggingface-repo = 
litert-community/gemma-4-12B-it-litert-lm  
 \ 
  
gemma-4-12B-it.litertlm  
 \ 
  
gemma4-12b 

Windows

  litert-lm 
 import 
 ` 
 - 
 -from-huggingface-repo 
 = 
 litert-community 
 / 
 gemma 
 - 
 4 
 - 
 12B-it-litert-lm 
 ` 
 gemma 
 - 
 4 
 - 
 12B-it 
 . 
 litertlm 
 ` 
 gemma4 
 - 
 12b 
 

For more information on importing and managing models, see the Model Management guide.

Start the Server

Use the serve command to start the server. By default, it starts an OpenAI-compatible server on port 9379 .

The server dynamically loads and serves any models in your local registry.

 litert-lm  
serve 

Configuration Options

You can customize the server using the following options:

  • --host : The host to listen on (default: 0.0.0.0 ).
  • --port : The port to listen on (default: 9379 ).
  • --verbose : Enable verbose logging.

Example with custom host and port:

 litert-lm  
serve  
--host  
 127 
.0.0.1  
--port  
 8080 
 

Supported Endpoints

The server emulates the following OpenAI API endpoints:

  • List Models: GET /v1/models Lists the models that are available to the server.
  • Chat Completions: POST /v1/chat/completions Generates text completions for a given chat conversation. Supports streaming responses.

Choosing the Backend and Configuration

When sending requests to the server, you can dynamically choose the execution backend (CPU, GPU, or NPU) and configure the maximum number of tokens (context length) by formatting the model field in your request payload.

The model field supports the following format:

 model_id[,backend][,max_tokens] 

Where model_id corresponds to any model ID in your local registry (see Model Management for details on how to list or import models).

Examples

  • gemma4-12b,gpu : GPU backend with default max tokens.
  • gemma4-12b,gpu,32768 : GPU backend with max tokens 32768.

Usage Example

Once the server is running, you can interact with it by sending HTTP requests..

Sending HTTP Requests

Linux/macOS

 curl  
http://localhost:9379/v1/chat/completions  
 \ 
  
-H  
 "Content-Type: application/json" 
  
 \ 
  
-d  
 '{ 
 "model": "gemma4-12b,gpu", 
 "messages": [ 
 {"role": "user", "content": "Hello!"} 
 ] 
 }' 
 

Windows

  Invoke-RestMethod 
 -Uri 
 "http://localhost:9379/v1/chat/completions" 
 ` 
 -Method 
 Post 
 ` 
 -ContentType 
 "application/json" 
 ` 
 -Body 
 '{"model": "gemma4-12b,gpu", "messages": [{"role": "user", "content": "Hello!"}]}' 
 
Create a Mobile Website
View Site in Mobile | Classic
Share by: