GPU acceleration with LiteRT

Graphics Processing Units (GPUs) are commonly used for deep learning acceleration due to their massive parallel throughput compared to CPUs. LiteRT simplifies the process of using GPU acceleration by allowing users to specify the hardware acceleration as a parameter when creating a Compiled Model ( CompiledModel ).

With LiteRT's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism.

Get Started

Add GPU dependency

Use the following steps to add GPU dependency to your Kotlin or C++ application.

Kotlin

For Kotlin users, the GPU accelerator is built-in and does not require additional steps beyond the Get Started guide.

C++

For C++ users, you must build the dependencies of the application with LiteRT GPU acceleration. The cc_binary rule that packages the core application logic (e.g., main.cc ) requires the following runtime components:

  • LiteRT C API shared library: the data attribute must include the LiteRT C API shared library ( //litert/c:litert_runtime_c_api_shared_lib ) and GPU-specific components ( litert_gpu_accelerator_prebuilts ).
  • Attribute dependencies: The deps attribute typically includes GLES dependencies gles_deps() , and linkopts typically includes gles_linkopts() . Both are highly relevant for GPU acceleration, since LiteRT often uses OpenGLES on Android.
  • Model files and other assets: Included through the data attribute.

The following is an example of a cc_binary rule:

  load 
 ( 
 "//litert/build_common:special_rule.bzl" 
 , 
  
 "litert_gpu_accelerator_prebuilts" 
 ) 
 cc_binary 
 ( 
  
 name 
  
 = 
  
 "your_application" 
 , 
  
 srcs 
  
 = 
  
 [ 
  
 "main.cc" 
 , 
  
 ], 
  
 data 
  
 = 
  
 [ 
  
 ... 
  
 # litert c api shared library 
  
 "//litert/c:litert_runtime_c_api_shared_lib" 
 , 
  
 ] 
  
 + 
  
 litert_gpu_accelerator_prebuilts 
 (), 
  
 linkopts 
  
 = 
  
 select 
 ({ 
  
 "@org_tensorflow//tensorflow:android" 
 : 
  
 [ 
 "-landroid" 
 ], 
  
 "//conditions:default" 
 : 
  
 [], 
  
 }) 
  
 + 
  
 gles_linkopts 
 (), 
  
 # 
  
 gles 
  
 link 
  
 options 
  
 deps 
  
 = 
  
 [ 
  
 ... 
  
 "//litert/cc:litert_tensor_buffer" 
 , 
  
 # 
  
 litert 
  
 cc 
  
 library 
  
 ... 
  
 ] 
  
 + 
  
 gles_deps 
 (), 
  
 # 
  
 gles 
  
 dependencies 
 ) 
 

This setup allows your compiled binary to dynamically load and use the GPU for accelerated machine learning inference.

Prebuilt GPU Accelerators

The new LiteRT GPU Accelerator isn't open sourced yet. But prebuilts are available. For the Kotlin users, the LiteRT Maven package already contains GPU Accelerators. For C++ SDK users, you need to download it separately using this .

In Bazel, you can use the following rule to add dependency to your target. cpp load("//litert/build_common:special_rule.bzl", "litert_gpu_accelerator_prebuilts")

Use GPU with CompiledModel API

To get started using the GPU accelerator, pass the GPU parameter when creating the Compiled Model ( CompiledModel ). The following code snippet shows a basic implementation of the entire process:

C++

  // 1. Create a compiled model targeting GPU 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // 2. Prepare input/output buffers 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 input_buffers 
 , 
  
 compiled_model 
 . 
 CreateInputBuffers 
 ()); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 output_buffers 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // 3. Fill input data (if you have CPU-based data) 
 input_buffers 
 [ 
 0 
 ]. 
 Write<float> 
 ( 
 absl 
 :: 
 MakeConstSpan 
 ( 
 cpu_data 
 , 
  
 data_size 
 )); 
 // 4. Execute 
 compiled_model 
 . 
 Run 
 ( 
 input_buffers 
 , 
  
 output_buffers 
 ); 
 // 5. Access model output 
 std 
 :: 
 vector<float> 
  
 data 
 ( 
 output_data_size 
 ); 
 output_buffers 
 . 
 Read<float> 
 ( 
 absl 
 :: 
 MakeSpan 
 ( 
 data 
 )); 
 

Kotlin

  // Load model and initialize runtime 
 val 
  
 model 
  
 = 
  
 CompiledModel 
 . 
 create 
 ( 
  
 context 
 . 
 assets 
 , 
  
 "mymodel.tflite" 
 , 
  
 CompiledModel 
 . 
 Options 
 ( 
 Accelerator 
 . 
 GPU 
 ), 
  
 env 
 , 
  
 ) 
 // Preallocate input/output buffers 
 val 
  
 inputBuffers 
  
 = 
  
 model 
 . 
 createInputBuffers 
 () 
 val 
  
 outputBuffers 
  
 = 
  
 model 
 . 
 createOutputBuffers 
 () 
 // Fill the first input 
 inputBuffers 
 [ 
 0 
 ] 
 . 
 writeFloat 
 ( 
 FloatArray 
 ( 
 data_size 
 ) 
  
 { 
  
 data_value 
  
 /* your data */ 
  
 }) 
 // Invoke 
 model 
 . 
 run 
 ( 
 inputBuffers 
 , 
  
 outputBuffers 
 ) 
 // Read the output 
 val 
  
 outputFloatArray 
  
 = 
  
 outputBuffers 
 [ 
 0 
 ] 
 . 
 readFloat 
 () 
 

For more information, see the Get Started with C++ or Get Started with Kotlin guides.

Zero-copy with GPU acceleration

Using zero-copy enables a GPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.

The following code is an example implementation of Zero-Copy GPU with OpenGL , an API for rendering vector graphics. The code passes images in the OpenGL buffer format directly to LiteRT:

  // Suppose you have an OpenGL buffer consisting of: 
 // target (GLenum), id (GLuint), size_bytes (size_t), and offset (size_t) 
 // Load model and compile for GPU 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // Create a TensorBuffer that wraps the OpenGL buffer. 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 tensor_type 
 , 
  
 model 
 . 
 GetInputTensorType 
 ( 
 "input_tensor_name" 
 )); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 gl_input_buffer 
 , 
  
 TensorBuffer 
 :: 
 CreateFromGlBuffer 
 ( 
 env 
 , 
  
 tensor_type 
 , 
  
 opengl_buffer 
 . 
 target 
 , 
  
 opengl_buffer 
 . 
 id 
 , 
  
 opengl_buffer 
 . 
 size_bytes 
 , 
  
 opengl_buffer 
 . 
 offset 
 )); 
 std 
 :: 
 vector<TensorBuffer> 
  
 input_buffers 
 { 
 gl_input_buffer 
 }; 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 output_buffers 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // Execute 
 compiled_model 
 . 
 Run 
 ( 
 input_buffers 
 , 
  
 output_buffers 
 ); 
 // If your output is also GPU-backed, you can fetch an OpenCL buffer or re-wrap it as an OpenGL buffer: 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 out_cl_buffer 
 , 
  
 output_buffers 
 [ 
 0 
 ]. 
 GetOpenClBuffer 
 ()); 
 

Asynchronous execution

LiteRT's asynchronous methods, like RunAsync() , let you schedule GPU inference while continuing other tasks using the CPU or the NPU. In complex pipelines, GPU is often used asynchronously alongside CPU or NPUs.

The following code snippet builds on the code provided in the Zero-copy GPU acceleration example. The code uses both CPU and GPU asynchronously and attaches a LiteRT Event to the input buffer. LiteRT Event is responsible for managing different types of synchronization primitives, and the following code creates a managed LiteRT Event object of type LiteRtEventTypeEglSyncFence . This Event object ensures that we don't read from the input buffer until the GPU is done. All this is done without involving the CPU.

  LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // 1. Prepare input buffer (OpenGL buffer) 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 gl_input 
 , 
  
 TensorBuffer 
 :: 
 CreateFromGlBuffer 
 ( 
 env 
 , 
  
 tensor_type 
 , 
  
 opengl_tex 
 )); 
 std 
 :: 
 vector<TensorBuffer> 
  
 inputs 
 { 
 gl_input 
 }; 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 outputs 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // 2. If the GL buffer is in use, create and set an event object to synchronize with the GPU. 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 input_event 
 , 
  
 Event 
 :: 
 CreateManagedEvent 
 ( 
 env 
 , 
  
 LiteRtEventTypeEglSyncFence 
 )); 
 inputs 
 [ 
 0 
 ]. 
 SetEvent 
 ( 
 std 
 :: 
 move 
 ( 
 input_event 
 )); 
 // 3. Kick off the GPU inference 
 compiled_model 
 . 
 RunAsync 
 ( 
 inputs 
 , 
  
 outputs 
 ); 
 // 4. Meanwhile, do other CPU work... 
 // CPU Stays busy .. 
 // 5. Access model output 
 std 
 :: 
 vector<float> 
  
 data 
 ( 
 output_data_size 
 ); 
 outputs 
 [ 
 0 
 ]. 
 Read<float> 
 ( 
 absl 
 :: 
 MakeSpan 
 ( 
 data 
 )); 
 

Supported backend

LiteRT supports the following GPU backend for each platform.

Platform Backend
Android OpenCL + OpenGL
Linux WebGPU (Vulkan)
macOS Metal
Windows WebGPU (Direct3D)
Android OpenCL + OpenGL

Supported models

LiteRT supports GPU acceleration with the following models. Benchmark results are based on tests run on a Samsung Galaxy S24 device.

Model LiteRT GPU Acceleration LiteRT GPU (ms)
Fully delegated 19.6
Fully delegated 8.7
Fully delegated 8.0
Fully delegated 9.1
Fully delegated 9.8
Fully delegated 43.1
Fully delegated 3.7
Fully delegated 9.7
Fully delegated 12.1
Fully delegated 4.6
Fully delegated 4.8
Fully delegated 3.3
Fully delegated 5.7
Fully delegated 35.1
Fully delegated 24.5
Fully delegated 13.9
Fully delegated 3.6
Fully delegated 4.7
Fully delegated 5.0
Fully delegated 6.1
Fully delegated 7.6
Fully delegated 8.6
Fully delegated 11.2
Fully delegated 14.7
Fully delegated 19.9
Fully delegated 3.9
Fully delegated 8.6
Fully delegated 3.3
Fully delegated 2.4
Fully delegated 2.8
Fully delegated 2.8
Fully delegated 2.3
Fully delegated 15.0
Fully delegated 4.3
Fully delegated 6.9
Fully delegated 2.9
Fully delegated 2.5
Fully delegated 13.4
Fully delegated 25.0
Fully delegated 13.4
Fully delegated 98.3
Fully delegated 51.4
Partially delegated 251.9
Partially delegated 13.7
Partially delegated 17.1
Partially delegated 52.1
Partially delegated 40.9
Partially delegated 17.6
Partially delegated 16.1
Partially delegated 73.5
Partially delegated 246.7
Partially delegated 88.9
Partially delegated 21.5
Partially delegated 8.2
Partially delegated 194.0
Partially delegated 9.5
Partially delegated 164.4
Partially delegated 6832.0
Partially delegated 2617.8
Partially delegated 11.2
Design a Mobile Site
View Site in Mobile | Classic
Share by: