Introducing Google AI Edge Portal : Benchmark Edge AI at scale. Sign-up to request access during private preview.

GPU acceleration with LiteRT

Graphics Processing Units (GPUs) are commonly used for deep learning acceleration due to their massive parallel throughput compared to CPUs. LiteRT simplifies the process of using GPU acceleration by allowing users to specify the hardware acceleration as a parameter when creating a Compiled Model ( CompiledModel ).

With LiteRT's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism.

Get Started

For classical ML models, see the following demo apps.
- Image segmentation Kotlin App : CPU/GPU/NPU inference.
- Image segmentation C++ App : CPU/GPU/NPU inference with asyncexecution.
For GenAI models, see the following demos and guide:
- EmbeddingGemma semantic similarity C++ App : CPU/GPU/NPU inference.
- Guide on running LLMs using LiteRT-LM .

Add GPU dependency

Use the following steps to add GPU dependency to your Kotlin or C++ application.

Kotlin

For Kotlin users, the GPU accelerator is built-in and does not require additional steps beyond the Get Started guide.

C++

For C++ users, you must build the dependencies of the application with LiteRT GPU acceleration. The cc_binary rule that packages the core application logic (e.g., main.cc ) requires the following runtime components:

LiteRT C API shared library: the data attribute must include the LiteRT C API shared library ( //litert/c:litert_runtime_c_api_shared_lib ) and GPU-specific components ( litert_gpu_accelerator_prebuilts ).
Attribute dependencies: The deps attribute typically includes GLES dependencies gles_deps() , and linkopts typically includes gles_linkopts() . Both are highly relevant for GPU acceleration, since LiteRT often uses OpenGLES on Android.
Model files and other assets: Included through the data attribute.

The following is an example of a cc_binary rule:

  load 
 ( 
 "//litert/build_common:special_rule.bzl" 
 , 
  
 "litert_gpu_accelerator_prebuilts" 
 ) 
 cc_binary 
 ( 
  
 name 
  
 = 
  
 "your_application" 
 , 
  
 srcs 
  
 = 
  
 [ 
  
 "main.cc" 
 , 
  
 ], 
  
 data 
  
 = 
  
 [ 
  
 ... 
  
 # litert c api shared library 
  
 "//litert/c:litert_runtime_c_api_shared_lib" 
 , 
  
 ] 
  
 + 
  
 litert_gpu_accelerator_prebuilts 
 (), 
  
 linkopts 
  
 = 
  
 select 
 ({ 
  
 "@org_tensorflow//tensorflow:android" 
 : 
  
 [ 
 "-landroid" 
 ], 
  
 "//conditions:default" 
 : 
  
 [], 
  
 }) 
  
 + 
  
 gles_linkopts 
 (), 
  
 # 
  
 gles 
  
 link 
  
 options 
  
 deps 
  
 = 
  
 [ 
  
 ... 
  
 "//litert/cc:litert_tensor_buffer" 
 , 
  
 # 
  
 litert 
  
 cc 
  
 library 
  
 ... 
  
 ] 
  
 + 
  
 gles_deps 
 (), 
  
 # 
  
 gles 
  
 dependencies 
 )

This setup allows your compiled binary to dynamically load and use the GPU for accelerated machine learning inference.

Prebuilt GPU Accelerators

The new LiteRT GPU Accelerator isn't open sourced yet. But prebuilts are available. For the Kotlin users, the LiteRT Maven package already contains GPU Accelerators. For C++ SDK users, you need to download it separately using this .

In Bazel, you can use the following rule to add dependency to your target. cpp load("//litert/build_common:special_rule.bzl", "litert_gpu_accelerator_prebuilts")

Use GPU with `CompiledModel` API

To get started using the GPU accelerator, pass the GPU parameter when creating the Compiled Model ( CompiledModel ). The following code snippet shows a basic implementation of the entire process:

C++

  // 1. Create a compiled model targeting GPU 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // 2. Prepare input/output buffers 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 input_buffers 
 , 
  
 compiled_model 
 . 
 CreateInputBuffers 
 ()); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 output_buffers 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // 3. Fill input data (if you have CPU-based data) 
 input_buffers 
 [ 
 0 
 ]. 
 Write<float> 
 ( 
 absl 
 :: 
 MakeConstSpan 
 ( 
 cpu_data 
 , 
  
 data_size 
 )); 
 // 4. Execute 
 compiled_model 
 . 
 Run 
 ( 
 input_buffers 
 , 
  
 output_buffers 
 ); 
 // 5. Access model output 
 std 
 :: 
 vector<float> 
  
 data 
 ( 
 output_data_size 
 ); 
 output_buffers 
 . 
 Read<float> 
 ( 
 absl 
 :: 
 MakeSpan 
 ( 
 data 
 ));

Kotlin

  // Load model and initialize runtime 
 val 
  
 model 
  
 = 
  
 CompiledModel 
 . 
 create 
 ( 
  
 context 
 . 
 assets 
 , 
  
 "mymodel.tflite" 
 , 
  
 CompiledModel 
 . 
 Options 
 ( 
 Accelerator 
 . 
 GPU 
 ), 
  
 env 
 , 
  
 ) 
 // Preallocate input/output buffers 
 val 
  
 inputBuffers 
  
 = 
  
 model 
 . 
 createInputBuffers 
 () 
 val 
  
 outputBuffers 
  
 = 
  
 model 
 . 
 createOutputBuffers 
 () 
 // Fill the first input 
 inputBuffers 
 [ 
 0 
 ] 
 . 
 writeFloat 
 ( 
 FloatArray 
 ( 
 data_size 
 ) 
  
 { 
  
 data_value 
  
 /* your data */ 
  
 }) 
 // Invoke 
 model 
 . 
 run 
 ( 
 inputBuffers 
 , 
  
 outputBuffers 
 ) 
 // Read the output 
 val 
  
 outputFloatArray 
  
 = 
  
 outputBuffers 
 [ 
 0 
 ] 
 . 
 readFloat 
 ()

For more information, see the Get Started with C++ or Get Started with Kotlin guides.

Zero-copy with GPU acceleration

Using zero-copy enables a GPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.

The following code is an example implementation of Zero-Copy GPU with OpenGL , an API for rendering vector graphics. The code passes images in the OpenGL buffer format directly to LiteRT:

  // Suppose you have an OpenGL buffer consisting of: 
 // target (GLenum), id (GLuint), size_bytes (size_t), and offset (size_t) 
 // Load model and compile for GPU 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // Create a TensorBuffer that wraps the OpenGL buffer. 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 tensor_type 
 , 
  
 model 
 . 
 GetInputTensorType 
 ( 
 "input_tensor_name" 
 )); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 gl_input_buffer 
 , 
  
 TensorBuffer 
 :: 
 CreateFromGlBuffer 
 ( 
 env 
 , 
  
 tensor_type 
 , 
  
 opengl_buffer 
 . 
 target 
 , 
  
 opengl_buffer 
 . 
 id 
 , 
  
 opengl_buffer 
 . 
 size_bytes 
 , 
  
 opengl_buffer 
 . 
 offset 
 )); 
 std 
 :: 
 vector<TensorBuffer> 
  
 input_buffers 
 { 
 gl_input_buffer 
 }; 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 output_buffers 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // Execute 
 compiled_model 
 . 
 Run 
 ( 
 input_buffers 
 , 
  
 output_buffers 
 ); 
 // If your output is also GPU-backed, you can fetch an OpenCL buffer or re-wrap it as an OpenGL buffer: 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 out_cl_buffer 
 , 
  
 output_buffers 
 [ 
 0 
 ]. 
 GetOpenClBuffer 
 ());

Asynchronous execution

LiteRT's asynchronous methods, like RunAsync() , let you schedule GPU inference while continuing other tasks using the CPU or the NPU. In complex pipelines, GPU is often used asynchronously alongside CPU or NPUs.

The following code snippet builds on the code provided in the Zero-copy GPU acceleration example. The code uses both CPU and GPU asynchronously and attaches a LiteRT Event to the input buffer. LiteRT Event is responsible for managing different types of synchronization primitives, and the following code creates a managed LiteRT Event object of type LiteRtEventTypeEglSyncFence . This Event object ensures that we don't read from the input buffer until the GPU is done. All this is done without involving the CPU.

  LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 env 
 , 
  
 Environment 
 :: 
 Create 
 ({})); 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 compiled_model 
 , 
  
 CompiledModel 
 :: 
 Create 
 ( 
 env 
 , 
  
 "mymodel.tflite" 
 , 
  
 kLiteRtHwAcceleratorGpu 
 )); 
 // 1. Prepare input buffer (OpenGL buffer) 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 gl_input 
 , 
  
 TensorBuffer 
 :: 
 CreateFromGlBuffer 
 ( 
 env 
 , 
  
 tensor_type 
 , 
  
 opengl_tex 
 )); 
 std 
 :: 
 vector<TensorBuffer> 
  
 inputs 
 { 
 gl_input 
 }; 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 outputs 
 , 
  
 compiled_model 
 . 
 CreateOutputBuffers 
 ()); 
 // 2. If the GL buffer is in use, create and set an event object to synchronize with the GPU. 
 LITERT_ASSIGN_OR_RETURN 
 ( 
 auto 
  
 input_event 
 , 
  
 Event 
 :: 
 CreateManagedEvent 
 ( 
 env 
 , 
  
 LiteRtEventTypeEglSyncFence 
 )); 
 inputs 
 [ 
 0 
 ]. 
 SetEvent 
 ( 
 std 
 :: 
 move 
 ( 
 input_event 
 )); 
 // 3. Kick off the GPU inference 
 compiled_model 
 . 
 RunAsync 
 ( 
 inputs 
 , 
  
 outputs 
 ); 
 // 4. Meanwhile, do other CPU work... 
 // CPU Stays busy .. 
 // 5. Access model output 
 std 
 :: 
 vector<float> 
  
 data 
 ( 
 output_data_size 
 ); 
 outputs 
 [ 
 0 
 ]. 
 Read<float> 
 ( 
 absl 
 :: 
 MakeSpan 
 ( 
 data 
 ));

Supported backend

LiteRT supports the following GPU backend for each platform.

Platform	Backend
Android	OpenCL + OpenGL
Linux	WebGPU (Vulkan)
macOS	Metal
Windows	WebGPU (Direct3D)
Android	OpenCL + OpenGL

Supported models

LiteRT supports GPU acceleration with the following models. Benchmark results are based on tests run on a Samsung Galaxy S24 device.

Model	LiteRT GPU Acceleration	LiteRT GPU (ms)
hf_mms_300m	Fully delegated	19.6
hf_mobilevit_small	Fully delegated	8.7
hf_mobilevit_small_e2e	Fully delegated	8.0
hf_wav2vec2_base_960h	Fully delegated	9.1
hf_wav2vec2_base_960h_dynamic	Fully delegated	9.8
isnet	Fully delegated	43.1
timm_efficientnet	Fully delegated	3.7
timm_nfnet	Fully delegated	9.7
timm_regnety_120	Fully delegated	12.1
torchaudio_deepspeech	Fully delegated	4.6
torchaudio_wav2letter	Fully delegated	4.8
torchvision_alexnet	Fully delegated	3.3
torchvision_deeplabv3_mobilenet_v3_large	Fully delegated	5.7
torchvision_deeplabv3_resnet101	Fully delegated	35.1
torchvision_deeplabv3_resnet50	Fully delegated	24.5
torchvision_densenet121	Fully delegated	13.9
torchvision_efficientnet_b0	Fully delegated	3.6
torchvision_efficientnet_b1	Fully delegated	4.7
torchvision_efficientnet_b2	Fully delegated	5.0
torchvision_efficientnet_b3	Fully delegated	6.1
torchvision_efficientnet_b4	Fully delegated	7.6
torchvision_efficientnet_b5	Fully delegated	8.6
torchvision_efficientnet_b6	Fully delegated	11.2
torchvision_efficientnet_b7	Fully delegated	14.7
torchvision_fcn_resnet50	Fully delegated	19.9
torchvision_googlenet	Fully delegated	3.9
torchvision_inception_v3	Fully delegated	8.6
torchvision_lraspp_mobilenet_v3_large	Fully delegated	3.3
torchvision_mnasnet0_5	Fully delegated	2.4
torchvision_mobilenet_v2	Fully delegated	2.8
torchvision_mobilenet_v3_large	Fully delegated	2.8
torchvision_mobilenet_v3_small	Fully delegated	2.3
torchvision_resnet152	Fully delegated	15.0
torchvision_resnet18	Fully delegated	4.3
torchvision_resnet50	Fully delegated	6.9
torchvision_squeezenet1_0	Fully delegated	2.9
torchvision_squeezenet1_1	Fully delegated	2.5
torchvision_vgg16	Fully delegated	13.4
torchvision_wide_resnet101_2	Fully delegated	25.0
torchvision_wide_resnet50_2	Fully delegated	13.4
u2net_full	Fully delegated	98.3
u2net_lite	Fully delegated	51.4
hf_distil_whisper_small_no_cache	Partially delegated	251.9
hf_distilbert	Partially delegated	13.7
hf_tinyroberta_squad2	Partially delegated	17.1
hf_tinyroberta_squad2_dynamic_batch	Partially delegated	52.1
snapml_StyleTransferNet	Partially delegated	40.9
timm_efficientformer_l1	Partially delegated	17.6
timm_efficientformerv2_s0	Partially delegated	16.1
timm_pvt_v2_b1	Partially delegated	73.5
timm_pvt_v2_b3	Partially delegated	246.7
timm_resnest14d	Partially delegated	88.9
torchaudio_conformer	Partially delegated	21.5
torchvision_convnext_tiny	Partially delegated	8.2
torchvision_maxvit_t	Partially delegated	194.0
torchvision_shufflenet_v2	Partially delegated	9.5
torchvision_swin_tiny	Partially delegated	164.4
torchvision_video_resnet2plus1d_18	Partially delegated	6832.0
torchvision_video_swin3d_tiny	Partially delegated	2617.8
yolox_tiny	Partially delegated	11.2

GPU acceleration with LiteRT Stay organized with collections Save and categorize content based on your preferences.

Get Started

Add GPU dependency

Kotlin

C++

Prebuilt GPU Accelerators

Use GPU with CompiledModel API

C++

Kotlin

Zero-copy with GPU acceleration

Asynchronous execution

Supported backend

Supported models

GPU acceleration with LiteRT

Use GPU with `CompiledModel` API