<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>On Device on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/on-device/</link><description>Recent content in On Device on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/on-device/index.xml" rel="self" type="application/rss+xml"/><item><title>LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Google&amp;rsquo;s on-device LLM runtime &lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM&lt;/a&gt; shipped &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;v0.11.0&lt;/a&gt;. Two headline items: &lt;strong&gt;Single Position Multi-token Prediction (MTP)&lt;/strong&gt; for Gemma 4 — more than 2x faster decode on mobile GPUs — and &lt;strong&gt;native Windows support&lt;/strong&gt; (CPU and GPU). Workstation-class results from the same week (DGX Spark + Qwen3.5 with MTP-2 hitting +36%) suggest MTP is hardening into &lt;strong&gt;the common decode-acceleration technique&lt;/strong&gt; spanning mobile up through workstation.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Input["Input position t"] --&gt; Target["Gemma 4 target model"]
 Input --&gt; Drafter["MTP drafter &amp;lt;br/&amp;gt; (lightweight)"]
 Drafter --&gt; Draft["Draft tokens t+1, t+2, ..., t+k"]
 Draft --&gt; Verify["Target verifies in one forward pass"]
 Target --&gt; Verify
 Verify --&gt; Accept["Accept matching prefix &amp;lt;br/&amp;gt; + 1 extra token"]
 Accept --&gt; Output["Multiple tokens emitted in a single step"]&lt;/pre&gt;&lt;h2 id="1-gemma-4-multi-token-prediction-support"&gt;1. Gemma 4 Multi-token Prediction Support
&lt;/h2&gt;&lt;p&gt;The opening line of the &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;release notes&lt;/a&gt;: &lt;strong&gt;&amp;quot;&amp;gt;2x faster decode on mobile GPUs with zero quality degradation.&amp;quot;&lt;/strong&gt; The mechanism is laid out in the &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google blog post on MTP for Gemma 4&lt;/a&gt; and the &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;official docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The trick is a flavor of &lt;strong&gt;speculative decoding&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At a single position, a lightweight &lt;strong&gt;drafter&lt;/strong&gt; predicts multiple future tokens at once&lt;/li&gt;
&lt;li&gt;The full &lt;strong&gt;target&lt;/strong&gt; model (e.g., Gemma 4 26B / 31B) verifies the entire draft sequence in one forward pass&lt;/li&gt;
&lt;li&gt;If the target agrees, it accepts the whole prefix and emits one additional token of its own&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Standard LLM inference is &lt;strong&gt;memory-bandwidth bound&lt;/strong&gt; — most cycles are spent shuffling parameters around. MTP bends that bottleneck by extracting more tokens per memory pass.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speedups by platform:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;Backend&lt;/th&gt;
 &lt;th&gt;Speedup&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile GPU (Samsung S26 Ultra, iPhone 17 Pro)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;up to 2.2x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile CPU&lt;/td&gt;
 &lt;td&gt;CPU&lt;/td&gt;
 &lt;td&gt;up to 1.5x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Apple Silicon (M4 MacBook Pro)&lt;/td&gt;
 &lt;td&gt;CPU + SME&lt;/td&gt;
 &lt;td&gt;substantial (~2.2x at batch 4–8)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX PRO 6000 (26B)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;~50% latency reduction&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX 4090 / Linux ARM&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;consistent acceleration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt; — universally recommended on GPU; recommended on CPU for the E4B model. &lt;strong&gt;For E2B on CPU, freeform generation may run slightly slower&lt;/strong&gt; — but rewrite, summarization, and coding tasks (which have long input prefixes) still come out ahead.&lt;/p&gt;
&lt;p&gt;Supported models start with &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;&lt;code&gt;Gemma-4-E2B&lt;/code&gt;&lt;/a&gt; (2.58 GB) and &lt;code&gt;Gemma-4-E4B&lt;/code&gt; (3.65 GB); 26B A4B and 31B are coming soon.&lt;/p&gt;
&lt;h2 id="2-native-windows-support"&gt;2. Native Windows Support
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI&lt;/a&gt; now runs natively on Windows with &lt;strong&gt;both CPU and GPU backends&lt;/strong&gt;. Previously Linux, macOS, and Android were the focus, so Windows developers had to go through WSL.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;litert-lm run --from-huggingface-repo&lt;span class="o"&gt;=&lt;/span&gt;litert-community/gemma-4-E2B-it-litert-lm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The unstated intent is loud — &lt;strong&gt;bring workstation and laptop developers in directly.&lt;/strong&gt; The friction of needing an Android device just to try things is gone.&lt;/p&gt;
&lt;h2 id="3-the-litert-stack--tf-lites-successor"&gt;3. The LiteRT Stack — TF Lite&amp;rsquo;s Successor
&lt;/h2&gt;&lt;p&gt;Step back and the placement makes sense:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TensorFlow Lite&lt;/strong&gt; (former name) → &lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT&lt;/a&gt; (Light Runtime, 2024 rebrand)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LiteRT-LM&lt;/strong&gt; = the LLM-specialized variant of LiteRT&lt;/li&gt;
&lt;li&gt;Model family: &lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma&lt;/a&gt; — Google&amp;rsquo;s open-weight LLMs&lt;/li&gt;
&lt;li&gt;Target: &lt;strong&gt;on-device inference&lt;/strong&gt; — mobile, edge, embedded, desktop&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache 2.0. CPU + GPU + (on Apple Silicon) SME backends. The &lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;&lt;code&gt;litert-community&lt;/code&gt;&lt;/a&gt; repo on Hugging Face plugs in directly.&lt;/p&gt;
&lt;h2 id="4-mtp-is-becoming-the-standard"&gt;4. MTP Is Becoming the Standard
&lt;/h2&gt;&lt;p&gt;The interesting part: MTP isn&amp;rsquo;t a one-company, one-model trick.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A few days ago, the &lt;a class="link" href="#" &gt;albond DGX Spark + Qwen3.5 post&lt;/a&gt; reported &lt;strong&gt;MTP-2&lt;/strong&gt; giving +36% decode on workstation-class GPUs.&lt;/li&gt;
&lt;li&gt;Gemma 4 + LiteRT-LM gets &lt;strong&gt;2.2x on mobile GPUs&lt;/strong&gt; from the same idea.&lt;/li&gt;
&lt;li&gt;Both report &lt;strong&gt;zero quality degradation&lt;/strong&gt; — because the target model still does final verification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MTP&amp;rsquo;s emerging position is &lt;strong&gt;the de facto standard for inference-time acceleration.&lt;/strong&gt; The way attention became standard, expect MTP-style speculation to land in nearly every production decoder over the next year, in some form.&lt;/p&gt;
&lt;h2 id="5-cloud-and-edge-advancing-in-parallel"&gt;5. Cloud and Edge Advancing in Parallel
&lt;/h2&gt;&lt;p&gt;Same day, OpenAI shipped &lt;a class="link" href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api" target="_blank" rel="noopener"
 &gt;three Realtime voice models&lt;/a&gt; and &lt;a class="link" href="https://openai.com/index/mrc-supercomputer-networking" target="_blank" rel="noopener"
 &gt;MRC supercomputer networking&lt;/a&gt;; same day, Google shipped LiteRT-LM v0.11.0. One side: a single company anchoring a five-vendor consortium to &lt;strong&gt;set supercomputer networking standards.&lt;/strong&gt; The other: making LLMs &lt;strong&gt;production-ready inside something that fits in your hand.&lt;/strong&gt; What&amp;rsquo;s load-bearing is that both are production-ready — LLMs are no longer &amp;ldquo;cloud or edge&amp;rdquo; but &lt;strong&gt;both improving simultaneously.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;LiteRT-LM v0.11.0 looks like a small minor release but carries two signals together. First, &lt;strong&gt;MTP reaching mobile GPUs&lt;/strong&gt; means speculative-decoding-family techniques are no longer a data-center luxury — they now run within the battery and thermal budget of a phone. Second, &lt;strong&gt;native Windows support&lt;/strong&gt; is not just an OS port; it repositions LiteRT-LM from a mobile demo library to &lt;strong&gt;a developer&amp;rsquo;s first screen.&lt;/strong&gt; Qwen3.5&amp;rsquo;s MTP-2 and Gemma 4&amp;rsquo;s MTP landing in the same week is not coincidence — it signals that &lt;strong&gt;decode-speed wins are about to matter as much as model-size wins&lt;/strong&gt; through late 2026. While the cloud side moves with GPT-Realtime-2 + MRC, the edge side keeps pace with Gemma 4 + LiteRT-LM, and this is the first quarter where both fronts go production-ready at the same time. For developers wanting to try it immediately, the entry path is one line on Windows: &lt;code&gt;litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM v0.11.0 release page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Model and runtime docs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT homepage (ai.google.dev/edge/litert)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM official docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;Gemma 4 with LiteRT-LM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma model family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.tensorflow.org/lite" target="_blank" rel="noopener"
 &gt;TensorFlow Lite (LiteRT predecessor)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;Hugging Face — litert-community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MTP technique references&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google: Multi-token Prediction for Gemma 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener"
 &gt;Speculative decoding background paper (arXiv)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Workstation comparison from the same family of techniques: DGX Spark + Qwen3.5 with MTP-2 hitting +36% decode (previous post)&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>