<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Unity on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/unity/</link><description>Recent content in Unity on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/unity/index.xml" rel="self" type="application/rss+xml"/><item><title>LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Google&amp;rsquo;s on-device LLM runtime &lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM&lt;/a&gt; shipped &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;v0.11.0&lt;/a&gt;. Two headline items: &lt;strong&gt;Single Position Multi-token Prediction (MTP)&lt;/strong&gt; for Gemma 4 — more than 2x faster decode on mobile GPUs — and &lt;strong&gt;native Windows support&lt;/strong&gt; (CPU and GPU). Workstation-class results from the same week (DGX Spark + Qwen3.5 with MTP-2 hitting +36%) suggest MTP is hardening into &lt;strong&gt;the common decode-acceleration technique&lt;/strong&gt; spanning mobile up through workstation.&lt;/p&gt;
&lt;p&gt;Update 2026-05-11: A community Unity wrapper has surfaced — covered in a new section below.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Input["Input position t"] --&gt; Target["Gemma 4 target model"]
 Input --&gt; Drafter["MTP drafter &amp;lt;br/&amp;gt; (lightweight)"]
 Drafter --&gt; Draft["Draft tokens t+1, t+2, ..., t+k"]
 Draft --&gt; Verify["Target verifies in one forward pass"]
 Target --&gt; Verify
 Verify --&gt; Accept["Accept matching prefix &amp;lt;br/&amp;gt; + 1 extra token"]
 Accept --&gt; Output["Multiple tokens emitted in a single step"]&lt;/pre&gt;&lt;h2 id="1-gemma-4-multi-token-prediction-support"&gt;1. Gemma 4 Multi-token Prediction Support
&lt;/h2&gt;&lt;p&gt;The opening line of the &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;release notes&lt;/a&gt;: &lt;strong&gt;&amp;quot;&amp;gt;2x faster decode on mobile GPUs with zero quality degradation.&amp;quot;&lt;/strong&gt; The mechanism is laid out in the &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google blog post on MTP for Gemma 4&lt;/a&gt; and the &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;official docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The trick is a flavor of &lt;strong&gt;speculative decoding&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At a single position, a lightweight &lt;strong&gt;drafter&lt;/strong&gt; predicts multiple future tokens at once&lt;/li&gt;
&lt;li&gt;The full &lt;strong&gt;target&lt;/strong&gt; model (e.g., Gemma 4 26B / 31B) verifies the entire draft sequence in one forward pass&lt;/li&gt;
&lt;li&gt;If the target agrees, it accepts the whole prefix and emits one additional token of its own&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Standard LLM inference is &lt;strong&gt;memory-bandwidth bound&lt;/strong&gt; — most cycles are spent shuffling parameters around. MTP bends that bottleneck by extracting more tokens per memory pass.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speedups by platform:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;Backend&lt;/th&gt;
 &lt;th&gt;Speedup&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile GPU (Samsung S26 Ultra, iPhone 17 Pro)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;up to 2.2x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile CPU&lt;/td&gt;
 &lt;td&gt;CPU&lt;/td&gt;
 &lt;td&gt;up to 1.5x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Apple Silicon (M4 MacBook Pro)&lt;/td&gt;
 &lt;td&gt;CPU + SME&lt;/td&gt;
 &lt;td&gt;substantial (~2.2x at batch 4–8)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX PRO 6000 (26B)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;~50% latency reduction&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX 4090 / Linux ARM&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;consistent acceleration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt; — universally recommended on GPU; recommended on CPU for the E4B model. &lt;strong&gt;For E2B on CPU, freeform generation may run slightly slower&lt;/strong&gt; — but rewrite, summarization, and coding tasks (which have long input prefixes) still come out ahead.&lt;/p&gt;
&lt;p&gt;Supported models start with &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;&lt;code&gt;Gemma-4-E2B&lt;/code&gt;&lt;/a&gt; (2.58 GB) and &lt;code&gt;Gemma-4-E4B&lt;/code&gt; (3.65 GB); 26B A4B and 31B are coming soon.&lt;/p&gt;
&lt;h2 id="2-native-windows-support"&gt;2. Native Windows Support
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI&lt;/a&gt; now runs natively on Windows with &lt;strong&gt;both CPU and GPU backends&lt;/strong&gt;. Previously Linux, macOS, and Android were the focus, so Windows developers had to go through WSL.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;litert-lm run --from-huggingface-repo&lt;span class="o"&gt;=&lt;/span&gt;litert-community/gemma-4-E2B-it-litert-lm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The unstated intent is loud — &lt;strong&gt;bring workstation and laptop developers in directly.&lt;/strong&gt; The friction of needing an Android device just to try things is gone.&lt;/p&gt;
&lt;h2 id="3-the-litert-stack--tf-lites-successor"&gt;3. The LiteRT Stack — TF Lite&amp;rsquo;s Successor
&lt;/h2&gt;&lt;p&gt;Step back and the placement makes sense:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TensorFlow Lite&lt;/strong&gt; (former name) → &lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT&lt;/a&gt; (Light Runtime, 2024 rebrand)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LiteRT-LM&lt;/strong&gt; = the LLM-specialized variant of LiteRT&lt;/li&gt;
&lt;li&gt;Model family: &lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma&lt;/a&gt; — Google&amp;rsquo;s open-weight LLMs&lt;/li&gt;
&lt;li&gt;Target: &lt;strong&gt;on-device inference&lt;/strong&gt; — mobile, edge, embedded, desktop&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache 2.0. CPU + GPU + (on Apple Silicon) SME backends. The &lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;&lt;code&gt;litert-community&lt;/code&gt;&lt;/a&gt; repo on Hugging Face plugs in directly.&lt;/p&gt;
&lt;h2 id="4-mtp-is-becoming-the-standard"&gt;4. MTP Is Becoming the Standard
&lt;/h2&gt;&lt;p&gt;The interesting part: MTP isn&amp;rsquo;t a one-company, one-model trick.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A few days ago, the &lt;a class="link" href="#" &gt;albond DGX Spark + Qwen3.5 post&lt;/a&gt; reported &lt;strong&gt;MTP-2&lt;/strong&gt; giving +36% decode on workstation-class GPUs.&lt;/li&gt;
&lt;li&gt;Gemma 4 + LiteRT-LM gets &lt;strong&gt;2.2x on mobile GPUs&lt;/strong&gt; from the same idea.&lt;/li&gt;
&lt;li&gt;Both report &lt;strong&gt;zero quality degradation&lt;/strong&gt; — because the target model still does final verification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MTP&amp;rsquo;s emerging position is &lt;strong&gt;the de facto standard for inference-time acceleration.&lt;/strong&gt; The way attention became standard, expect MTP-style speculation to land in nearly every production decoder over the next year, in some form.&lt;/p&gt;
&lt;h2 id="5-cloud-and-edge-advancing-in-parallel"&gt;5. Cloud and Edge Advancing in Parallel
&lt;/h2&gt;&lt;p&gt;Same day, OpenAI shipped &lt;a class="link" href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api" target="_blank" rel="noopener"
 &gt;three Realtime voice models&lt;/a&gt; and &lt;a class="link" href="https://openai.com/index/mrc-supercomputer-networking" target="_blank" rel="noopener"
 &gt;MRC supercomputer networking&lt;/a&gt;; same day, Google shipped LiteRT-LM v0.11.0. One side: a single company anchoring a five-vendor consortium to &lt;strong&gt;set supercomputer networking standards.&lt;/strong&gt; The other: making LLMs &lt;strong&gt;production-ready inside something that fits in your hand.&lt;/strong&gt; What&amp;rsquo;s load-bearing is that both are production-ready — LLMs are no longer &amp;ldquo;cloud or edge&amp;rdquo; but &lt;strong&gt;both improving simultaneously.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="6-unity-ports"&gt;6. Unity ports
&lt;/h2&gt;&lt;p&gt;Days after the runtime cut, a community Unity integration sample dropped: &lt;a class="link" href="https://github.com/Leuconoe/LiteRT-LM-Unity" target="_blank" rel="noopener"
 &gt;Leuconoe/LiteRT-LM-Unity&lt;/a&gt;. Built on Unity &lt;code&gt;6000.4.6f1&lt;/code&gt;, it wires LiteRT-LM into the engine two ways: a &lt;strong&gt;Windows Editor&lt;/strong&gt; path that drives &lt;code&gt;litert_lm_main.windows_x86_64.exe&lt;/code&gt; through a stable PowerShell wrapper, and an &lt;strong&gt;Android&lt;/strong&gt; path through a custom &lt;code&gt;litertlm-unity-bridge.aar&lt;/code&gt; built with Bazel for OpenCL GPU inference. Critically, the patch is pinned to LiteRT-LM commit &lt;code&gt;c8718952&lt;/code&gt; — the &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;v0.11.0 tag&lt;/a&gt; — so the MTP acceleration that just shipped flows straight into the Unity build; the Gemma 4 rows in the device table are explicitly built with speculative decoding turned on. On a Qualcomm SM8250 device with 7.52 GiB RAM, &lt;code&gt;gemma-4-E2B-it.litertlm&lt;/code&gt; passes on GPU at 396 tok/s prefill and 9.98 tok/s decode, with chat turns at 1.561s then 0.582s; &lt;code&gt;Qwen2.5-0.5B-Instruct-q8.litertlm&lt;/code&gt; hits 26.55 tok/s decode on CPU. The C# layer uses IMGUI with IME-aware input, so Korean and other non-ASCII prompts work out of the box.&lt;/p&gt;
&lt;p&gt;Why does an on-device LLM running inside a game engine matter? Routing NPC dialogue through cloud inference — &lt;a class="link" href="https://docs.mistral.ai/guides/use_cases/npc/" target="_blank" rel="noopener"
 &gt;Mistral&amp;rsquo;s NPC dialogue guide&lt;/a&gt;, &lt;a class="link" href="https://developer.nvidia.com/ace" target="_blank" rel="noopener"
 &gt;NVIDIA ACE&lt;/a&gt; — means latency, per-call cost, and no offline play. Streaming tokens directly on-device flips that: function calls can fire in-engine commands (display, volume, date queries) without a round trip, which is exactly the 20-prompt Unity command benchmark &lt;a class="link" href="https://github.com/Leuconoe/LiteRT-LM-Unity" target="_blank" rel="noopener"
 &gt;Leuconoe/LiteRT-LM-Unity&lt;/a&gt; is measuring. A 2x mobile-GPU decode speedup is not an abstract number — it lands as &lt;strong&gt;an NPC that answers half a beat sooner&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For coordinates, prior Unity-meets-LLM efforts mostly wrapped GGUF runtimes: &lt;a class="link" href="https://github.com/ggml-org/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; via bindings like &lt;a class="link" href="https://github.com/eublefar/llama.cpp-Unity" target="_blank" rel="noopener"
 &gt;llama.cpp-Unity&lt;/a&gt; or &lt;a class="link" href="https://github.com/undreamai/LLMUnity" target="_blank" rel="noopener"
 &gt;LLMUnity&lt;/a&gt;, or &lt;a class="link" href="https://github.com/mlc-ai/mlc-llm" target="_blank" rel="noopener"
 &gt;MLC LLM&lt;/a&gt; through its TVM backend. Those approaches fit a general-purpose LLM runtime onto the engine — which means vendor-side wins like mobile GPU acceleration, MTP, and Gemma 4 land with a lag. &lt;a class="link" href="https://github.com/Leuconoe/LiteRT-LM-Unity" target="_blank" rel="noopener"
 &gt;Leuconoe/LiteRT-LM-Unity&lt;/a&gt; flips the direction: &lt;strong&gt;pull Google&amp;rsquo;s first-party runtime straight into Unity&lt;/strong&gt;. License is unspecified and stargazer count is still zero — early days — but the patch is exactly aligned with v0.11.0, which means it&amp;rsquo;s likely to track future LiteRT-LM releases tightly.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;LiteRT-LM v0.11.0 looks like a small minor release but carries two signals together. First, &lt;strong&gt;MTP reaching mobile GPUs&lt;/strong&gt; means speculative-decoding-family techniques are no longer a data-center luxury — they now run within the battery and thermal budget of a phone. Second, &lt;strong&gt;native Windows support&lt;/strong&gt; is not just an OS port; it repositions LiteRT-LM from a mobile demo library to &lt;strong&gt;a developer&amp;rsquo;s first screen.&lt;/strong&gt; Qwen3.5&amp;rsquo;s MTP-2 and Gemma 4&amp;rsquo;s MTP landing in the same week is not coincidence — it signals that &lt;strong&gt;decode-speed wins are about to matter as much as model-size wins&lt;/strong&gt; through late 2026. While the cloud side moves with GPT-Realtime-2 + MRC, the edge side keeps pace with Gemma 4 + LiteRT-LM, and this is the first quarter where both fronts go production-ready at the same time. And the Unity wrapper appearing pinned to v0.11.0 this week is another signal — &lt;strong&gt;secondary application surfaces&lt;/strong&gt; like game engines, XR, and robotics are starting to follow runtime releases within days, not quarters. For developers wanting to try it immediately, the entry path is one line on Windows: &lt;code&gt;litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM v0.11.0 release page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Source repos&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/Leuconoe/LiteRT-LM-Unity" target="_blank" rel="noopener"
 &gt;Leuconoe/LiteRT-LM-Unity — community Unity integration (pinned to v0.11.0)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/undreamai/LLMUnity" target="_blank" rel="noopener"
 &gt;undreamai/LLMUnity — llama.cpp-based Unity bindings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/mlc-ai/mlc-llm" target="_blank" rel="noopener"
 &gt;mlc-ai/mlc-llm — TVM-backed multi-backend LLM runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/ggml-org/llama.cpp" target="_blank" rel="noopener"
 &gt;ggml-org/llama.cpp — general-purpose local LLM runtime for comparison&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Model and runtime docs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT homepage (ai.google.dev/edge/litert)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM official docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;Gemma 4 with LiteRT-LM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma model family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.tensorflow.org/lite" target="_blank" rel="noopener"
 &gt;TensorFlow Lite (LiteRT predecessor)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;Hugging Face — litert-community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MTP technique references&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google: Multi-token Prediction for Gemma 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener"
 &gt;Speculative decoding background paper (arXiv)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Workstation comparison from the same family of techniques: DGX Spark + Qwen3.5 with MTP-2 hitting +36% decode (previous post)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Game engine x LLM background&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.mistral.ai/guides/use_cases/npc/" target="_blank" rel="noopener"
 &gt;Mistral — NPC dialogue guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://developer.nvidia.com/ace" target="_blank" rel="noopener"
 &gt;NVIDIA ACE — cloud-side NPC AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>