<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Dgx Spark on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/dgx-spark/</link><description>Recent content in Dgx Spark on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/dgx-spark/index.xml" rel="self" type="application/rss+xml"/><item><title>Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; is a recipe that pushes &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; from 28.3 to 51 tok/s on a single &lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;, an 80 percent gain. It stacks five orthogonal techniques on top of vLLM 0.19: AutoRound INT4 quantization, an FP8 dense-layer hybrid, MTP-2 speculative decoding, an INT8 LM head, and optional TurboQuant KV cache compression — all while preserving 256K context. Apache 2.0, 171 stars on GitHub. The interesting question it answers in the affirmative: can a single workstation actually serve a 100B-class MoE model at production speed?&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 Base["Baseline &amp;lt;br/&amp;gt; 28.3 tok/s"] --&gt; S1["+ Hybrid INT4+FP8 &amp;lt;br/&amp;gt; 30.8 tok/s"]
 S1 --&gt; S2["+ MTP-2 Speculative &amp;lt;br/&amp;gt; 38.4 tok/s"]
 S2 --&gt; V2["v2: + INT8 LM Head &amp;lt;br/&amp;gt; 51 tok/s"]
 V2 --&gt; TQ["v2-tq: + TurboQuant KV &amp;lt;br/&amp;gt; 39 tok/s &amp;lt;br/&amp;gt; 1.4M KV"]&lt;/pre&gt;&lt;h2 id="results"&gt;Results
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;tok/s&lt;/th&gt;
 &lt;th&gt;Gain&lt;/th&gt;
 &lt;th&gt;Image&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)&lt;/td&gt;
 &lt;td&gt;28.3&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ Hybrid INT4+FP8 dense layers&lt;/td&gt;
 &lt;td&gt;30.8&lt;/td&gt;
 &lt;td&gt;+8.8%&lt;/td&gt;
 &lt;td&gt;step 1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ MTP-2 speculative decoding&lt;/td&gt;
 &lt;td&gt;38.4&lt;/td&gt;
 &lt;td&gt;+35.7%&lt;/td&gt;
 &lt;td&gt;step 2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;v2&lt;/strong&gt; (+ INT8 LM head v2)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;51&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;+80%&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (+ TurboQuant KV cache)&lt;/td&gt;
 &lt;td&gt;39&lt;/td&gt;
 &lt;td&gt;+38%&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2-tq&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The same stack pushes Qwen3.5-35B-A3B (the smaller sibling) to 112 tok/s.&lt;/p&gt;
&lt;h3 id="256k-context-tradeoff"&gt;256K context tradeoff
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;KV cache&lt;/th&gt;
 &lt;th&gt;256K concurrent users&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;v2 (standard)&lt;/td&gt;
 &lt;td&gt;355K tokens&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (TurboQuant)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;1.4M tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="the-model-in-one-paragraph"&gt;The model in one paragraph
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; is a hybrid MoE that activates 10B of its 122B parameters per token: 256 experts with 8 routed plus 1 shared, 48 layers alternating Gated DeltaNet and Gated Attention at a 12:1 ratio, native 262K context (extensible to 1M with YaRN), Apache 2.0. The starting point for this recipe is &lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt;, produced with &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound&lt;/a&gt; at group size 128 with &lt;code&gt;shared_expert&lt;/code&gt; left out of quantization.&lt;/p&gt;
&lt;h2 id="the-five-techniques"&gt;The five techniques
&lt;/h2&gt;&lt;h3 id="1-hybrid-int4--fp8-dense-layers-9"&gt;1. Hybrid INT4 + FP8 dense layers (+9%)
&lt;/h3&gt;&lt;p&gt;Replace the BF16 shared-expert weights inside the AutoRound INT4 model with FP8 weights from the official Qwen checkpoint. Net effect: experts stay INT4, dense layers run FP8. Memory and compute drop without touching accuracy.&lt;/p&gt;
&lt;h3 id="2-mtp-2-speculative-decoding-36"&gt;2. MTP-2 speculative decoding (+36%)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction&lt;/a&gt; generates 2 tokens per step with roughly 80 percent acceptance, the single largest jump in the chain. Notably there is no separate draft model — the main model itself runs multi-head prediction, which keeps the deployment simple.&lt;/p&gt;
&lt;h3 id="3-int8-lm-head-v2-triton-kernel"&gt;3. INT8 LM head v2 (Triton kernel)
&lt;/h3&gt;&lt;p&gt;Quantizes the final vocabulary projection to INT8 via a custom Triton kernel. This is the biggest jump in the v2 build (38.4 to 51 tok/s). LM heads are usually exempt from quantization, but on models with very large vocabularies the cost is high enough that revisiting the assumption pays off.&lt;/p&gt;
&lt;h3 id="4-turboquant-kv-cache-optional"&gt;4. TurboQuant KV cache (optional)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt; compresses the KV cache 4x. Absolute throughput drops slightly versus v2, but concurrent 256K-context users go from 1 to 5 — a meaningful tradeoff for long-context multi-tenant workloads.&lt;/p&gt;
&lt;h2 id="environment"&gt;Environment
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;vLLM 0.19.1, CUDA 13.0, Docker-based&lt;/li&gt;
&lt;li&gt;Inference stack: &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM 0.19&lt;/a&gt; + &lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Model: &lt;code&gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;One-shot &lt;code&gt;./install.sh&lt;/code&gt; runs steps 0 through 4, idempotent&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;51 tok/s on a 100B-class model from a single workstation lands close to the 60 tok/s zone that feels native in a chat UI, which is the real news here. For a 171-star repo the engineering is unusually tight — bench tables, step-wise Dockerfiles, install.sh, vLLM/CUDA version notes — and you can run it as written. The deeper lesson is that the five techniques are orthogonal: hybrid quant attacks memory and accuracy, MTP attacks decoding parallelism, INT8 LM head attacks compute, and TurboQuant attacks KV memory. The 80 percent number is not one big trick but a sequence of bottleneck migrations. The v2 versus v2-tq split also shows that throughput and concurrency are different axes — pick the build that matches your workload, not the highest single-stream number. Expect this hybrid-quant plus speculative plus custom-kernel stack to land as a default in vLLM and SGLang within a quarter or two, at which point &amp;ldquo;100B in one box&amp;rdquo; stops being a demo.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;h3 id="repo-and-model-cards"&gt;Repo and model cards
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; — 171 stars, Apache 2.0&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen/Qwen3.5-122B-A10B&lt;/a&gt; — 122B/10B hybrid MoE, 262K context&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt; — INT4 group128&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="inference-frameworks"&gt;Inference frameworks
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="optimization-techniques"&gt;Optimization techniques
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (arXiv:2309.05516)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction (arXiv:2404.19737)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>