Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
Gemma 4 is licensed under the Apache-2.0 license. For more details, see the Gemma 4 Model Card .
🔴 What's New: Multi-Token Prediction
Multi-Token Prediction (MTP) is a new performance optimization that significantly accelerates decode speeds across CPU and GPU backends with zero quality degradation.
- Performance Gains:
- GPU:Massive acceleration, delivering up to 2.2x decode speedup on mobile GPUs.
- CPU:Performance boosts up to 1.5x speedup on mobile CPUs and significant acceleration on SME-enabled hardware (e.g., M4 MacBooks).
- Recommendations:MTP is universally recommended for all tasks on GPU backends and for the Gemma4-E4B model on CPU. For the Gemma4-E2B model on CPU, it is highly valuable for rewrite, summarize, and coding tasks, but should be enabled selectively as it may cause a slight slowdown during freeform prompting or generative tasks.
To try it out, see the platform-specific guides:
Get Started
Chat with Gemma4-E2B, hosted on the Hugging Face LiteRT Community.
uv
tool
install
litert-lm
litert-lm
run
\
--from-huggingface-repo =
litert-community/gemma-4-E2B-it-litert-lm
\
gemma-4-E2B-it.litertlm
\
--prompt =
"What is the capital of France?"
Deploy from Safetensors
Follow these steps to deploy Gemma 4 starting from your custom safetensors (for example, after fine-tuning the model for your use-case):
-
Convert to a
.litertlmformat:uv tool install litert-torch-nightly litert-torch export_hf \ --model = google/gemma-4-E2B-it \ --output_dir = /tmp/gemma4_2b \ --externalize_embedder \ --jinja_chat_template_override = litert-community/gemma-4-E2B-it-litert-lm -
Deploy using LiteRT-LM cross-platform APIs :
litert-lm run \ /tmp/gemma4_2b/model.litertlm \ --prompt = "What is the capital of France?"
Performance Summary
Gemma-4-E2B
- Model Size: 2.58 GB
-
Additional technical details are in the HuggingFace model card
Platform (Device)BackendPrefill (tk/s)Decode (tk/s)Time to First Token (seconds)Peak CPU Memory (MB)Android (S26 Ultra)CPU557471.81733GPU3808520.3676iOS (iPhone 17 Pro)CPU532251.9607GPU2878560.31450Linux (Arm 2.3 & 2.8 GHz, NVIDIA GeForce RTX 4090)CPU2603541628GPU112341430.1913macOS (MacBook Pro M4)CPU901421.1736GPU78351600.11623Windows (Intel LunarLake)CPU435302.43505GPU3751480.33540IoT (Raspberry Pi 5 16GB)CPU13387.81546
Gemma-4-E4B
- Model Size: 3.65 GB
-
Additional technical details are in the HuggingFace model card
Platform (Device)BackendPrefill (tk/s)Decode (tk/s)Time to First Token (seconds)Peak CPU Memory (MB)Android (S26 Ultra)CPU195185.33283GPU1293220.8710iOS (iPhone 17 Pro)CPU159106.5961GPU1189250.93380Linux (Arm 2.3 & 2.8GHz / RTX 4090)CPU821812.63139GPU7260910.21119macOS (MacBook Pro M4 Max)CPU277273.7890GPU25601010.43217Windows (Intel LunarLake)CPU173176.09372GPU1202250.97147IoT (Raspberry Pi 5 16GB)CPU51320.53069

