<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Mtp on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/mtp/</link><description>Recent content in Mtp on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/mtp/index.xml" rel="self" type="application/rss+xml"/><item><title>LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-litert-lm-v0-11-0-gemma4-mtp/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post LiteRT-LM v0.11.0 — Gemma 4 MTP Doubles Mobile GPU Decode, Windows Goes Native" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Google&amp;rsquo;s on-device LLM runtime &lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM&lt;/a&gt; shipped &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;v0.11.0&lt;/a&gt;. Two headline items: &lt;strong&gt;Single Position Multi-token Prediction (MTP)&lt;/strong&gt; for Gemma 4 — more than 2x faster decode on mobile GPUs — and &lt;strong&gt;native Windows support&lt;/strong&gt; (CPU and GPU). Workstation-class results from the same week (DGX Spark + Qwen3.5 with MTP-2 hitting +36%) suggest MTP is hardening into &lt;strong&gt;the common decode-acceleration technique&lt;/strong&gt; spanning mobile up through workstation.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Input["Input position t"] --&gt; Target["Gemma 4 target model"]
 Input --&gt; Drafter["MTP drafter &amp;lt;br/&amp;gt; (lightweight)"]
 Drafter --&gt; Draft["Draft tokens t+1, t+2, ..., t+k"]
 Draft --&gt; Verify["Target verifies in one forward pass"]
 Target --&gt; Verify
 Verify --&gt; Accept["Accept matching prefix &amp;lt;br/&amp;gt; + 1 extra token"]
 Accept --&gt; Output["Multiple tokens emitted in a single step"]&lt;/pre&gt;&lt;h2 id="1-gemma-4-multi-token-prediction-support"&gt;1. Gemma 4 Multi-token Prediction Support
&lt;/h2&gt;&lt;p&gt;The opening line of the &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;release notes&lt;/a&gt;: &lt;strong&gt;&amp;quot;&amp;gt;2x faster decode on mobile GPUs with zero quality degradation.&amp;quot;&lt;/strong&gt; The mechanism is laid out in the &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google blog post on MTP for Gemma 4&lt;/a&gt; and the &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;official docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The trick is a flavor of &lt;strong&gt;speculative decoding&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At a single position, a lightweight &lt;strong&gt;drafter&lt;/strong&gt; predicts multiple future tokens at once&lt;/li&gt;
&lt;li&gt;The full &lt;strong&gt;target&lt;/strong&gt; model (e.g., Gemma 4 26B / 31B) verifies the entire draft sequence in one forward pass&lt;/li&gt;
&lt;li&gt;If the target agrees, it accepts the whole prefix and emits one additional token of its own&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Standard LLM inference is &lt;strong&gt;memory-bandwidth bound&lt;/strong&gt; — most cycles are spent shuffling parameters around. MTP bends that bottleneck by extracting more tokens per memory pass.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speedups by platform:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Platform&lt;/th&gt;
 &lt;th&gt;Backend&lt;/th&gt;
 &lt;th&gt;Speedup&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile GPU (Samsung S26 Ultra, iPhone 17 Pro)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;up to 2.2x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Mobile CPU&lt;/td&gt;
 &lt;td&gt;CPU&lt;/td&gt;
 &lt;td&gt;up to 1.5x decode&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Apple Silicon (M4 MacBook Pro)&lt;/td&gt;
 &lt;td&gt;CPU + SME&lt;/td&gt;
 &lt;td&gt;substantial (~2.2x at batch 4–8)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX PRO 6000 (26B)&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;~50% latency reduction&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;NVIDIA RTX 4090 / Linux ARM&lt;/td&gt;
 &lt;td&gt;GPU&lt;/td&gt;
 &lt;td&gt;consistent acceleration&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt; — universally recommended on GPU; recommended on CPU for the E4B model. &lt;strong&gt;For E2B on CPU, freeform generation may run slightly slower&lt;/strong&gt; — but rewrite, summarization, and coding tasks (which have long input prefixes) still come out ahead.&lt;/p&gt;
&lt;p&gt;Supported models start with &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;&lt;code&gt;Gemma-4-E2B&lt;/code&gt;&lt;/a&gt; (2.58 GB) and &lt;code&gt;Gemma-4-E4B&lt;/code&gt; (3.65 GB); 26B A4B and 31B are coming soon.&lt;/p&gt;
&lt;h2 id="2-native-windows-support"&gt;2. Native Windows Support
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI&lt;/a&gt; now runs natively on Windows with &lt;strong&gt;both CPU and GPU backends&lt;/strong&gt;. Previously Linux, macOS, and Android were the focus, so Windows developers had to go through WSL.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;litert-lm run --from-huggingface-repo&lt;span class="o"&gt;=&lt;/span&gt;litert-community/gemma-4-E2B-it-litert-lm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The unstated intent is loud — &lt;strong&gt;bring workstation and laptop developers in directly.&lt;/strong&gt; The friction of needing an Android device just to try things is gone.&lt;/p&gt;
&lt;h2 id="3-the-litert-stack--tf-lites-successor"&gt;3. The LiteRT Stack — TF Lite&amp;rsquo;s Successor
&lt;/h2&gt;&lt;p&gt;Step back and the placement makes sense:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TensorFlow Lite&lt;/strong&gt; (former name) → &lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT&lt;/a&gt; (Light Runtime, 2024 rebrand)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LiteRT-LM&lt;/strong&gt; = the LLM-specialized variant of LiteRT&lt;/li&gt;
&lt;li&gt;Model family: &lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma&lt;/a&gt; — Google&amp;rsquo;s open-weight LLMs&lt;/li&gt;
&lt;li&gt;Target: &lt;strong&gt;on-device inference&lt;/strong&gt; — mobile, edge, embedded, desktop&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache 2.0. CPU + GPU + (on Apple Silicon) SME backends. The &lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;&lt;code&gt;litert-community&lt;/code&gt;&lt;/a&gt; repo on Hugging Face plugs in directly.&lt;/p&gt;
&lt;h2 id="4-mtp-is-becoming-the-standard"&gt;4. MTP Is Becoming the Standard
&lt;/h2&gt;&lt;p&gt;The interesting part: MTP isn&amp;rsquo;t a one-company, one-model trick.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A few days ago, the &lt;a class="link" href="#" &gt;albond DGX Spark + Qwen3.5 post&lt;/a&gt; reported &lt;strong&gt;MTP-2&lt;/strong&gt; giving +36% decode on workstation-class GPUs.&lt;/li&gt;
&lt;li&gt;Gemma 4 + LiteRT-LM gets &lt;strong&gt;2.2x on mobile GPUs&lt;/strong&gt; from the same idea.&lt;/li&gt;
&lt;li&gt;Both report &lt;strong&gt;zero quality degradation&lt;/strong&gt; — because the target model still does final verification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MTP&amp;rsquo;s emerging position is &lt;strong&gt;the de facto standard for inference-time acceleration.&lt;/strong&gt; The way attention became standard, expect MTP-style speculation to land in nearly every production decoder over the next year, in some form.&lt;/p&gt;
&lt;h2 id="5-cloud-and-edge-advancing-in-parallel"&gt;5. Cloud and Edge Advancing in Parallel
&lt;/h2&gt;&lt;p&gt;Same day, OpenAI shipped &lt;a class="link" href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api" target="_blank" rel="noopener"
 &gt;three Realtime voice models&lt;/a&gt; and &lt;a class="link" href="https://openai.com/index/mrc-supercomputer-networking" target="_blank" rel="noopener"
 &gt;MRC supercomputer networking&lt;/a&gt;; same day, Google shipped LiteRT-LM v0.11.0. One side: a single company anchoring a five-vendor consortium to &lt;strong&gt;set supercomputer networking standards.&lt;/strong&gt; The other: making LLMs &lt;strong&gt;production-ready inside something that fits in your hand.&lt;/strong&gt; What&amp;rsquo;s load-bearing is that both are production-ready — LLMs are no longer &amp;ldquo;cloud or edge&amp;rdquo; but &lt;strong&gt;both improving simultaneously.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;LiteRT-LM v0.11.0 looks like a small minor release but carries two signals together. First, &lt;strong&gt;MTP reaching mobile GPUs&lt;/strong&gt; means speculative-decoding-family techniques are no longer a data-center luxury — they now run within the battery and thermal budget of a phone. Second, &lt;strong&gt;native Windows support&lt;/strong&gt; is not just an OS port; it repositions LiteRT-LM from a mobile demo library to &lt;strong&gt;a developer&amp;rsquo;s first screen.&lt;/strong&gt; Qwen3.5&amp;rsquo;s MTP-2 and Gemma 4&amp;rsquo;s MTP landing in the same week is not coincidence — it signals that &lt;strong&gt;decode-speed wins are about to matter as much as model-size wins&lt;/strong&gt; through late 2026. While the cloud side moves with GPT-Realtime-2 + MRC, the edge side keeps pace with Gemma 4 + LiteRT-LM, and this is the first quarter where both fronts go production-ready at the same time. For developers wanting to try it immediately, the entry path is one line on Windows: &lt;code&gt;litert-lm run --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Release&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM v0.11.0 release page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM" target="_blank" rel="noopener"
 &gt;google-ai-edge/LiteRT-LM repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Model and runtime docs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert" target="_blank" rel="noopener"
 &gt;LiteRT homepage (ai.google.dev/edge/litert)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm" target="_blank" rel="noopener"
 &gt;LiteRT-LM official docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/models/gemma-4" target="_blank" rel="noopener"
 &gt;Gemma 4 with LiteRT-LM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/edge/litert-lm/cli" target="_blank" rel="noopener"
 &gt;LiteRT-LM CLI docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma model family&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.tensorflow.org/lite" target="_blank" rel="noopener"
 &gt;TensorFlow Lite (LiteRT predecessor)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/litert-community" target="_blank" rel="noopener"
 &gt;Hugging Face — litert-community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MTP technique references&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google: Multi-token Prediction for Gemma 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener"
 &gt;Speculative decoding background paper (arXiv)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Workstation comparison from the same family of techniques: DGX Spark + Qwen3.5 with MTP-2 hitting +36% decode (previous post)&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; is a recipe that pushes &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; from 28.3 to 51 tok/s on a single &lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;, an 80 percent gain. It stacks five orthogonal techniques on top of vLLM 0.19: AutoRound INT4 quantization, an FP8 dense-layer hybrid, MTP-2 speculative decoding, an INT8 LM head, and optional TurboQuant KV cache compression — all while preserving 256K context. Apache 2.0, 171 stars on GitHub. The interesting question it answers in the affirmative: can a single workstation actually serve a 100B-class MoE model at production speed?&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 Base["Baseline &amp;lt;br/&amp;gt; 28.3 tok/s"] --&gt; S1["+ Hybrid INT4+FP8 &amp;lt;br/&amp;gt; 30.8 tok/s"]
 S1 --&gt; S2["+ MTP-2 Speculative &amp;lt;br/&amp;gt; 38.4 tok/s"]
 S2 --&gt; V2["v2: + INT8 LM Head &amp;lt;br/&amp;gt; 51 tok/s"]
 V2 --&gt; TQ["v2-tq: + TurboQuant KV &amp;lt;br/&amp;gt; 39 tok/s &amp;lt;br/&amp;gt; 1.4M KV"]&lt;/pre&gt;&lt;h2 id="results"&gt;Results
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;tok/s&lt;/th&gt;
 &lt;th&gt;Gain&lt;/th&gt;
 &lt;th&gt;Image&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)&lt;/td&gt;
 &lt;td&gt;28.3&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ Hybrid INT4+FP8 dense layers&lt;/td&gt;
 &lt;td&gt;30.8&lt;/td&gt;
 &lt;td&gt;+8.8%&lt;/td&gt;
 &lt;td&gt;step 1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ MTP-2 speculative decoding&lt;/td&gt;
 &lt;td&gt;38.4&lt;/td&gt;
 &lt;td&gt;+35.7%&lt;/td&gt;
 &lt;td&gt;step 2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;v2&lt;/strong&gt; (+ INT8 LM head v2)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;51&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;+80%&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (+ TurboQuant KV cache)&lt;/td&gt;
 &lt;td&gt;39&lt;/td&gt;
 &lt;td&gt;+38%&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2-tq&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The same stack pushes Qwen3.5-35B-A3B (the smaller sibling) to 112 tok/s.&lt;/p&gt;
&lt;h3 id="256k-context-tradeoff"&gt;256K context tradeoff
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;KV cache&lt;/th&gt;
 &lt;th&gt;256K concurrent users&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;v2 (standard)&lt;/td&gt;
 &lt;td&gt;355K tokens&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (TurboQuant)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;1.4M tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="the-model-in-one-paragraph"&gt;The model in one paragraph
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; is a hybrid MoE that activates 10B of its 122B parameters per token: 256 experts with 8 routed plus 1 shared, 48 layers alternating Gated DeltaNet and Gated Attention at a 12:1 ratio, native 262K context (extensible to 1M with YaRN), Apache 2.0. The starting point for this recipe is &lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt;, produced with &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound&lt;/a&gt; at group size 128 with &lt;code&gt;shared_expert&lt;/code&gt; left out of quantization.&lt;/p&gt;
&lt;h2 id="the-five-techniques"&gt;The five techniques
&lt;/h2&gt;&lt;h3 id="1-hybrid-int4--fp8-dense-layers-9"&gt;1. Hybrid INT4 + FP8 dense layers (+9%)
&lt;/h3&gt;&lt;p&gt;Replace the BF16 shared-expert weights inside the AutoRound INT4 model with FP8 weights from the official Qwen checkpoint. Net effect: experts stay INT4, dense layers run FP8. Memory and compute drop without touching accuracy.&lt;/p&gt;
&lt;h3 id="2-mtp-2-speculative-decoding-36"&gt;2. MTP-2 speculative decoding (+36%)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction&lt;/a&gt; generates 2 tokens per step with roughly 80 percent acceptance, the single largest jump in the chain. Notably there is no separate draft model — the main model itself runs multi-head prediction, which keeps the deployment simple.&lt;/p&gt;
&lt;h3 id="3-int8-lm-head-v2-triton-kernel"&gt;3. INT8 LM head v2 (Triton kernel)
&lt;/h3&gt;&lt;p&gt;Quantizes the final vocabulary projection to INT8 via a custom Triton kernel. This is the biggest jump in the v2 build (38.4 to 51 tok/s). LM heads are usually exempt from quantization, but on models with very large vocabularies the cost is high enough that revisiting the assumption pays off.&lt;/p&gt;
&lt;h3 id="4-turboquant-kv-cache-optional"&gt;4. TurboQuant KV cache (optional)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt; compresses the KV cache 4x. Absolute throughput drops slightly versus v2, but concurrent 256K-context users go from 1 to 5 — a meaningful tradeoff for long-context multi-tenant workloads.&lt;/p&gt;
&lt;h2 id="environment"&gt;Environment
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;vLLM 0.19.1, CUDA 13.0, Docker-based&lt;/li&gt;
&lt;li&gt;Inference stack: &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM 0.19&lt;/a&gt; + &lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Model: &lt;code&gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;One-shot &lt;code&gt;./install.sh&lt;/code&gt; runs steps 0 through 4, idempotent&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;51 tok/s on a 100B-class model from a single workstation lands close to the 60 tok/s zone that feels native in a chat UI, which is the real news here. For a 171-star repo the engineering is unusually tight — bench tables, step-wise Dockerfiles, install.sh, vLLM/CUDA version notes — and you can run it as written. The deeper lesson is that the five techniques are orthogonal: hybrid quant attacks memory and accuracy, MTP attacks decoding parallelism, INT8 LM head attacks compute, and TurboQuant attacks KV memory. The 80 percent number is not one big trick but a sequence of bottleneck migrations. The v2 versus v2-tq split also shows that throughput and concurrency are different axes — pick the build that matches your workload, not the highest single-stream number. Expect this hybrid-quant plus speculative plus custom-kernel stack to land as a default in vLLM and SGLang within a quarter or two, at which point &amp;ldquo;100B in one box&amp;rdquo; stops being a demo.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;h3 id="repo-and-model-cards"&gt;Repo and model cards
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; — 171 stars, Apache 2.0&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen/Qwen3.5-122B-A10B&lt;/a&gt; — 122B/10B hybrid MoE, 262K context&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt; — INT4 group128&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="inference-frameworks"&gt;Inference frameworks
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="optimization-techniques"&gt;Optimization techniques
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (arXiv:2309.05516)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction (arXiv:2404.19737)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>