LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native

Thu, 07 May 2026 00:00:00 +0900

Overview

Google’s on-device LLM runtime LiteRT-LM shipped v0.11.0. Two headline items: Single Position Multi-token Prediction (MTP) for Gemma 4 — more than 2x faster decode on mobile GPUs — and native Windows support (CPU and GPU). Workstation-class results from the same week (DGX Spark + Qwen3.5 with MTP-2 hitting +36%) suggest MTP is hardening into the common decode-acceleration technique spanning mobile up through workstation.

graph TD
 Input["Input position t"] --> Target["Gemma 4 target model"]
 Input --> Drafter["MTP drafter <br/> (lightweight)"]
 Drafter --> Draft["Draft tokens t+1, t+2, ..., t+k"]
 Draft --> Verify["Target verifies in one forward pass"]
 Target --> Verify
 Verify --> Accept["Accept matching prefix <br/> + 1 extra token"]
 Accept --> Output["Multiple tokens emitted in a single step"]

1. Gemma 4 Multi-token Prediction Support

The opening line of the release notes: ">2x faster decode on mobile GPUs with zero quality degradation." The mechanism is laid out in the Google blog post on MTP for Gemma 4 and the official docs.

The trick is a flavor of speculative decoding:

At a single position, a lightweight drafter predicts multiple future tokens at once
The full target model (e.g., Gemma 4 26B / 31B) verifies the entire draft sequence in one forward pass
If the target agrees, it accepts the whole prefix and emits one additional token of its own

Standard LLM inference is memory-bandwidth bound — most cycles are spent shuffling parameters around. MTP bends that bottleneck by extracting more tokens per memory pass.

Speedups by platform:

Platform	Backend	Speedup
Mobile GPU (Samsung S26 Ultra, iPhone 17 Pro)	GPU	up to 2.2x decode
Mobile CPU	CPU	up to 1.5x decode
Apple Silicon (M4 MacBook Pro)	CPU + SME	substantial (~2.2x at batch 4–8)
NVIDIA RTX PRO 6000 (26B)	GPU	~50% latency reduction
NVIDIA RTX 4090 / Linux ARM	GPU	consistent acceleration

Important caveat — universally recommended on GPU; recommended on CPU for the E4B model. For E2B on CPU, freeform generation may run slightly slower — but rewrite, summarization, and coding tasks (which have long input prefixes) still come out ahead.

Supported models start with Gemma-4-E2B (2.58 GB) and Gemma-4-E4B (3.65 GB); 26B A4B and 31B are coming soon.

2. Native Windows Support

The LiteRT-LM CLI now runs natively on Windows with both CPU and GPU backends. Previously Linux, macOS, and Android were the focus, so Windows developers had to go through WSL.

litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm

The unstated intent is loud — bring workstation and laptop developers in directly. The friction of needing an Android device just to try things is gone.

3. The LiteRT Stack — TF Lite’s Successor

Step back and the placement makes sense:

TensorFlow Lite (former name) → LiteRT (Light Runtime, 2024 rebrand)
LiteRT-LM = the LLM-specialized variant of LiteRT
Model family: Gemma — Google’s open-weight LLMs
Target: on-device inference — mobile, edge, embedded, desktop

Apache 2.0. CPU + GPU + (on Apple Silicon) SME backends. The litert-community repo on Hugging Face plugs in directly.

4. MTP Is Becoming the Standard

The interesting part: MTP isn’t a one-company, one-model trick.

A few days ago, the albond DGX Spark + Qwen3.5 post reported MTP-2 giving +36% decode on workstation-class GPUs.
Gemma 4 + LiteRT-LM gets 2.2x on mobile GPUs from the same idea.
Both report zero quality degradation — because the target model still does final verification.

MTP’s emerging position is the de facto standard for inference-time acceleration. The way attention became standard, expect MTP-style speculation to land in nearly every production decoder over the next year, in some form.

5. Cloud and Edge Advancing in Parallel

Same day, OpenAI shipped three Realtime voice models and MRC supercomputer networking; same day, Google shipped LiteRT-LM v0.11.0. One side: a single company anchoring a five-vendor consortium to set supercomputer networking standards. The other: making LLMs production-ready inside something that fits in your hand. What’s load-bearing is that both are production-ready — LLMs are no longer “cloud or edge” but both improving simultaneously.

Insights

LiteRT-LM v0.11.0 looks like a small minor release but carries two signals together. First, MTP reaching mobile GPUs means speculative-decoding-family techniques are no longer a data-center luxury — they now run within the battery and thermal budget of a phone. Second, native Windows support is not just an OS port; it repositions LiteRT-LM from a mobile demo library to a developer’s first screen. Qwen3.5’s MTP-2 and Gemma 4’s MTP landing in the same week is not coincidence — it signals that decode-speed wins are about to matter as much as model-size wins through late 2026. While the cloud side moves with GPT-Realtime-2 + MRC, the edge side keeps pace with Gemma 4 + LiteRT-LM, and this is the first quarter where both fronts go production-ready at the same time. For developers wanting to try it immediately, the entry path is one line on Windows: litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm.

References

Release

Model and runtime docs

MTP technique references

Google: Multi-token Prediction for Gemma 4
Speculative decoding background paper (arXiv)
Workstation comparison from the same family of techniques: DGX Spark + Qwen3.5 with MTP-2 hitting +36% decode (previous post)

On Device on ICE-ICE-BEAR-BLOG