Mtp on ICE-ICE-BEAR-BLOG

LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native

Thu, 07 May 2026 00:00:00 +0900

Overview

Google’s on-device LLM runtime LiteRT-LM shipped v0.11.0. Two headline items: Single Position Multi-token Prediction (MTP) for Gemma 4 — more than 2x faster decode on mobile GPUs — and native Windows support (CPU and GPU). Workstation-class results from the same week (DGX Spark + Qwen3.5 with MTP-2 hitting +36%) suggest MTP is hardening into the common decode-acceleration technique spanning mobile up through workstation.

graph TD
 Input["Input position t"] --> Target["Gemma 4 target model"]
 Input --> Drafter["MTP drafter <br/> (lightweight)"]
 Drafter --> Draft["Draft tokens t+1, t+2, ..., t+k"]
 Draft --> Verify["Target verifies in one forward pass"]
 Target --> Verify
 Verify --> Accept["Accept matching prefix <br/> + 1 extra token"]
 Accept --> Output["Multiple tokens emitted in a single step"]

1. Gemma 4 Multi-token Prediction Support

The opening line of the release notes: ">2x faster decode on mobile GPUs with zero quality degradation." The mechanism is laid out in the Google blog post on MTP for Gemma 4 and the official docs.

The trick is a flavor of speculative decoding:

At a single position, a lightweight drafter predicts multiple future tokens at once
The full target model (e.g., Gemma 4 26B / 31B) verifies the entire draft sequence in one forward pass
If the target agrees, it accepts the whole prefix and emits one additional token of its own

Standard LLM inference is memory-bandwidth bound — most cycles are spent shuffling parameters around. MTP bends that bottleneck by extracting more tokens per memory pass.

Speedups by platform:

Platform	Backend	Speedup
Mobile GPU (Samsung S26 Ultra, iPhone 17 Pro)	GPU	up to 2.2x decode
Mobile CPU	CPU	up to 1.5x decode
Apple Silicon (M4 MacBook Pro)	CPU + SME	substantial (~2.2x at batch 4–8)
NVIDIA RTX PRO 6000 (26B)	GPU	~50% latency reduction
NVIDIA RTX 4090 / Linux ARM	GPU	consistent acceleration

Important caveat — universally recommended on GPU; recommended on CPU for the E4B model. For E2B on CPU, freeform generation may run slightly slower — but rewrite, summarization, and coding tasks (which have long input prefixes) still come out ahead.

Supported models start with Gemma-4-E2B (2.58 GB) and Gemma-4-E4B (3.65 GB); 26B A4B and 31B are coming soon.

2. Native Windows Support

The LiteRT-LM CLI now runs natively on Windows with both CPU and GPU backends. Previously Linux, macOS, and Android were the focus, so Windows developers had to go through WSL.

litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm

The unstated intent is loud — bring workstation and laptop developers in directly. The friction of needing an Android device just to try things is gone.

3. The LiteRT Stack — TF Lite’s Successor

Step back and the placement makes sense:

TensorFlow Lite (former name) → LiteRT (Light Runtime, 2024 rebrand)
LiteRT-LM = the LLM-specialized variant of LiteRT
Model family: Gemma — Google’s open-weight LLMs
Target: on-device inference — mobile, edge, embedded, desktop

Apache 2.0. CPU + GPU + (on Apple Silicon) SME backends. The litert-community repo on Hugging Face plugs in directly.

4. MTP Is Becoming the Standard

The interesting part: MTP isn’t a one-company, one-model trick.

A few days ago, the albond DGX Spark + Qwen3.5 post reported MTP-2 giving +36% decode on workstation-class GPUs.
Gemma 4 + LiteRT-LM gets 2.2x on mobile GPUs from the same idea.
Both report zero quality degradation — because the target model still does final verification.

MTP’s emerging position is the de facto standard for inference-time acceleration. The way attention became standard, expect MTP-style speculation to land in nearly every production decoder over the next year, in some form.

5. Cloud and Edge Advancing in Parallel

Same day, OpenAI shipped three Realtime voice models and MRC supercomputer networking; same day, Google shipped LiteRT-LM v0.11.0. One side: a single company anchoring a five-vendor consortium to set supercomputer networking standards. The other: making LLMs production-ready inside something that fits in your hand. What’s load-bearing is that both are production-ready — LLMs are no longer “cloud or edge” but both improving simultaneously.

Insights

LiteRT-LM v0.11.0 looks like a small minor release but carries two signals together. First, MTP reaching mobile GPUs means speculative-decoding-family techniques are no longer a data-center luxury — they now run within the battery and thermal budget of a phone. Second, native Windows support is not just an OS port; it repositions LiteRT-LM from a mobile demo library to a developer’s first screen. Qwen3.5’s MTP-2 and Gemma 4’s MTP landing in the same week is not coincidence — it signals that decode-speed wins are about to matter as much as model-size wins through late 2026. While the cloud side moves with GPT-Realtime-2 + MRC, the edge side keeps pace with Gemma 4 + LiteRT-LM, and this is the first quarter where both fronts go production-ready at the same time. For developers wanting to try it immediately, the entry path is one line on Windows: litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm.

References

Release

Model and runtime docs

MTP technique references

Google: Multi-token Prediction for Gemma 4
Speculative decoding background paper (arXiv)
Workstation comparison from the same family of techniques: DGX Spark + Qwen3.5 with MTP-2 hitting +36% decode (previous post)

Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark

Thu, 07 May 2026 00:00:00 +0900

Overview

albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 is a recipe that pushes Qwen3.5-122B-A10B from 28.3 to 51 tok/s on a single NVIDIA DGX Spark, an 80 percent gain. It stacks five orthogonal techniques on top of vLLM 0.19: AutoRound INT4 quantization, an FP8 dense-layer hybrid, MTP-2 speculative decoding, an INT8 LM head, and optional TurboQuant KV cache compression — all while preserving 256K context. Apache 2.0, 171 stars on GitHub. The interesting question it answers in the affirmative: can a single workstation actually serve a 100B-class MoE model at production speed?

flowchart LR
 Base["Baseline <br/> 28.3 tok/s"] --> S1["+ Hybrid INT4+FP8 <br/> 30.8 tok/s"]
 S1 --> S2["+ MTP-2 Speculative <br/> 38.4 tok/s"]
 S2 --> V2["v2: + INT8 LM Head <br/> 51 tok/s"]
 V2 --> TQ["v2-tq: + TurboQuant KV <br/> 39 tok/s <br/> 1.4M KV"]

Results

Build	tok/s	Gain	Image
Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)	28.3	—	—
+ Hybrid INT4+FP8 dense layers	30.8	+8.8%	step 1
+ MTP-2 speculative decoding	38.4	+35.7%	step 2
v2 (+ INT8 LM head v2)	51	+80%	`Dockerfile.v2`
v2-tq (+ TurboQuant KV cache)	39	+38%	`Dockerfile.v2-tq`

The same stack pushes Qwen3.5-35B-A3B (the smaller sibling) to 112 tok/s.

256K context tradeoff

Build	KV cache	256K concurrent users
v2 (standard)	355K tokens	1
v2-tq (TurboQuant)	1.4M tokens	5

The model in one paragraph

Qwen3.5-122B-A10B is a hybrid MoE that activates 10B of its 122B parameters per token: 256 experts with 8 routed plus 1 shared, 48 layers alternating Gated DeltaNet and Gated Attention at a 12:1 ratio, native 262K context (extensible to 1M with YaRN), Apache 2.0. The starting point for this recipe is Intel/Qwen3.5-122B-A10B-int4-AutoRound, produced with Intel AutoRound at group size 128 with shared_expert left out of quantization.

The five techniques

1. Hybrid INT4 + FP8 dense layers (+9%)

Replace the BF16 shared-expert weights inside the AutoRound INT4 model with FP8 weights from the official Qwen checkpoint. Net effect: experts stay INT4, dense layers run FP8. Memory and compute drop without touching accuracy.

2. MTP-2 speculative decoding (+36%)

Multi-Token Prediction generates 2 tokens per step with roughly 80 percent acceptance, the single largest jump in the chain. Notably there is no separate draft model — the main model itself runs multi-head prediction, which keeps the deployment simple.

3. INT8 LM head v2 (Triton kernel)

Quantizes the final vocabulary projection to INT8 via a custom Triton kernel. This is the biggest jump in the v2 build (38.4 to 51 tok/s). LM heads are usually exempt from quantization, but on models with very large vocabularies the cost is high enough that revisiting the assumption pays off.

4. TurboQuant KV cache (optional)

TurboQuant compresses the KV cache 4x. Absolute throughput drops slightly versus v2, but concurrent 256K-context users go from 1 to 5 — a meaningful tradeoff for long-context multi-tenant workloads.

Environment

vLLM 0.19.1, CUDA 13.0, Docker-based
Inference stack: vLLM 0.19 + FlashInfer
Model: Intel/Qwen3.5-122B-A10B-int4-AutoRound
One-shot ./install.sh runs steps 0 through 4, idempotent

Insights

51 tok/s on a 100B-class model from a single workstation lands close to the 60 tok/s zone that feels native in a chat UI, which is the real news here. For a 171-star repo the engineering is unusually tight — bench tables, step-wise Dockerfiles, install.sh, vLLM/CUDA version notes — and you can run it as written. The deeper lesson is that the five techniques are orthogonal: hybrid quant attacks memory and accuracy, MTP attacks decoding parallelism, INT8 LM head attacks compute, and TurboQuant attacks KV memory. The 80 percent number is not one big trick but a sequence of bottleneck migrations. The v2 versus v2-tq split also shows that throughput and concurrency are different axes — pick the build that matches your workload, not the highest single-stream number. Expect this hybrid-quant plus speculative plus custom-kernel stack to land as a default in vLLM and SGLang within a quarter or two, at which point “100B in one box” stops being a demo.

Mtp on ICE-ICE-BEAR-BLOG

LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native

Overview

1. Gemma 4 Multi-token Prediction Support

2. Native Windows Support

3. The LiteRT Stack — TF Lite’s Successor

4. MTP Is Becoming the Standard

5. Cloud and Edge Advancing in Parallel

Insights

References

Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark

Overview

Results

256K context tradeoff

The model in one paragraph

The five techniques

1. Hybrid INT4 + FP8 dense layers (+9%)

2. MTP-2 speculative decoding (+36%)

3. INT8 LM head v2 (Triton kernel)

4. TurboQuant KV cache (optional)

Environment

Insights

References

Repo and model cards

Inference frameworks

Optimization techniques