Inference on ICE-ICE-BEAR-BLOG

Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark

Thu, 07 May 2026 00:00:00 +0900

Overview

albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 is a recipe that pushes Qwen3.5-122B-A10B from 28.3 to 51 tok/s on a single NVIDIA DGX Spark, an 80 percent gain. It stacks five orthogonal techniques on top of vLLM 0.19: AutoRound INT4 quantization, an FP8 dense-layer hybrid, MTP-2 speculative decoding, an INT8 LM head, and optional TurboQuant KV cache compression — all while preserving 256K context. Apache 2.0, 171 stars on GitHub. The interesting question it answers in the affirmative: can a single workstation actually serve a 100B-class MoE model at production speed?

flowchart LR
 Base["Baseline <br/> 28.3 tok/s"] --> S1["+ Hybrid INT4+FP8 <br/> 30.8 tok/s"]
 S1 --> S2["+ MTP-2 Speculative <br/> 38.4 tok/s"]
 S2 --> V2["v2: + INT8 LM Head <br/> 51 tok/s"]
 V2 --> TQ["v2-tq: + TurboQuant KV <br/> 39 tok/s <br/> 1.4M KV"]

Results

Build	tok/s	Gain	Image
Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)	28.3	—	—
+ Hybrid INT4+FP8 dense layers	30.8	+8.8%	step 1
+ MTP-2 speculative decoding	38.4	+35.7%	step 2
v2 (+ INT8 LM head v2)	51	+80%	`Dockerfile.v2`
v2-tq (+ TurboQuant KV cache)	39	+38%	`Dockerfile.v2-tq`

The same stack pushes Qwen3.5-35B-A3B (the smaller sibling) to 112 tok/s.

256K context tradeoff

Build	KV cache	256K concurrent users
v2 (standard)	355K tokens	1
v2-tq (TurboQuant)	1.4M tokens	5

The model in one paragraph

Qwen3.5-122B-A10B is a hybrid MoE that activates 10B of its 122B parameters per token: 256 experts with 8 routed plus 1 shared, 48 layers alternating Gated DeltaNet and Gated Attention at a 12:1 ratio, native 262K context (extensible to 1M with YaRN), Apache 2.0. The starting point for this recipe is Intel/Qwen3.5-122B-A10B-int4-AutoRound, produced with Intel AutoRound at group size 128 with shared_expert left out of quantization.

The five techniques

1. Hybrid INT4 + FP8 dense layers (+9%)

Replace the BF16 shared-expert weights inside the AutoRound INT4 model with FP8 weights from the official Qwen checkpoint. Net effect: experts stay INT4, dense layers run FP8. Memory and compute drop without touching accuracy.

2. MTP-2 speculative decoding (+36%)

Multi-Token Prediction generates 2 tokens per step with roughly 80 percent acceptance, the single largest jump in the chain. Notably there is no separate draft model — the main model itself runs multi-head prediction, which keeps the deployment simple.

3. INT8 LM head v2 (Triton kernel)

Quantizes the final vocabulary projection to INT8 via a custom Triton kernel. This is the biggest jump in the v2 build (38.4 to 51 tok/s). LM heads are usually exempt from quantization, but on models with very large vocabularies the cost is high enough that revisiting the assumption pays off.

4. TurboQuant KV cache (optional)

TurboQuant compresses the KV cache 4x. Absolute throughput drops slightly versus v2, but concurrent 256K-context users go from 1 to 5 — a meaningful tradeoff for long-context multi-tenant workloads.

Environment

vLLM 0.19.1, CUDA 13.0, Docker-based
Inference stack: vLLM 0.19 + FlashInfer
Model: Intel/Qwen3.5-122B-A10B-int4-AutoRound
One-shot ./install.sh runs steps 0 through 4, idempotent

Insights

51 tok/s on a 100B-class model from a single workstation lands close to the 60 tok/s zone that feels native in a chat UI, which is the real news here. For a 171-star repo the engineering is unusually tight — bench tables, step-wise Dockerfiles, install.sh, vLLM/CUDA version notes — and you can run it as written. The deeper lesson is that the five techniques are orthogonal: hybrid quant attacks memory and accuracy, MTP attacks decoding parallelism, INT8 LM head attacks compute, and TurboQuant attacks KV memory. The 80 percent number is not one big trick but a sequence of bottleneck migrations. The v2 versus v2-tq split also shows that throughput and concurrency are different axes — pick the build that matches your workload, not the highest single-stream number. Expect this hybrid-quant plus speculative plus custom-kernel stack to land as a default in vLLM and SGLang within a quarter or two, at which point “100B in one box” stops being a demo.

References

Repo and model cards

albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 — 171 stars, Apache 2.0
Qwen/Qwen3.5-122B-A10B — 122B/10B hybrid MoE, 262K context
Intel/Qwen3.5-122B-A10B-int4-AutoRound — INT4 group128
NVIDIA DGX Spark

Inference frameworks

Optimization techniques

The LLMLingua Series — Microsoft's Underrated Prompt Compression Stack

Wed, 06 May 2026 00:00:00 +0900

Overview

Someone dropped LLMLingua in a chat, another member replied “yes, very underrated.” The repo has 6,156 stars, MIT license, and six papers in the series stretching from EMNLP 2023 through CoLM 2025 — and yet production case studies are surprisingly thin on the ground. Compression up to 20x with minimal performance loss should be a no-brainer; why isn’t the adoption faster? Unpack the word “underrated” from that chat and you find the research-to-production gap in plain sight.

graph TD
 Origin["LLMLingua <br/> EMNLP 2023"] --> Long["LongLLMLingua <br/> ACL 2024"]
 Origin --> V2["LLMLingua-2 <br/> ACL 2024 Findings"]
 Long --> MInf["MInference <br/> 2024"]
 V2 --> MInf
 MInf --> SCB["SCBench <br/> 2024"]
 SCB --> Sec["SecurityLingua <br/> CoLM 2025"]

 Origin -.->|small LLM token pruning| Theme1["20x compression"]
 Long -.->|"lost-in-middle fix"| Theme2["RAG +21.4%"]
 V2 -.->|GPT-4 distill BERT| Theme3["3-6x faster"]
 MInf -.->|long-context prefill| Theme4["1M token 10x"]

Six Papers, One Table

Paper	Year	Headline result
LLMLingua	EMNLP 2023	Use a small LLM (GPT2-small, LLaMA-7B) to drop low-value tokens — 20x compression with minimal quality loss
LongLLMLingua	ACL 2024	Mitigates “lost in the middle.” RAG accuracy +21.4% at 1/4 the tokens
LLMLingua-2	ACL 2024 Findings	BERT-class encoder distilled from GPT-4 — 3-6x faster and stronger out-of-domain
MInference	2024	Long-context inference acceleration. 10x prefill on 1M tokens on A100
SCBench	2024	A benchmark suite for KV-cache-centric long-context methods
SecurityLingua	CoLM 2025	Compression-based jailbreak defense — SOTA guardrail performance using 100x fewer tokens

The full paper list, demos, and blog posts are aggregated on the project page at llmlingua.com.

What You Actually Get

Cost savings — shorter prompt and shorter generation in one move; the only overhead is one small-LLM call
Extended context — sits on top of long-context models, mitigates “lost in the middle” so the same token budget carries more useful signal
No retraining — the underlying LLM is untouched, only a compressor sits in front of it (true plug-in)
Knowledge preservation — designed to keep ICL examples and reasoning chains intact
KV-Cache compression — drops both inference memory and latency
Recoverable — they show GPT-4 can recover the key information from a compressed prompt

Example (LLMLingua 1)

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
result = llm_lingua.compress_prompt(
 prompt, instruction="", question="", target_token=200
)
# {
# 'compressed_prompt': '...',
# 'origin_tokens': 2365,
# 'compressed_tokens': 211,
# 'ratio': '11.2x',
# 'saving': ', Saving $0.1 in GPT-4.'
# }

Quantized backends are supported too: TheBloke/Llama-2-7b-Chat-GPTQ runs the compressor in under 8GB of GPU memory.

Example (LongLLMLingua RAG mode)

compressed = llm_lingua.compress_prompt(
 prompt_list,
 question=question,
 rate=0.55,
 condition_in_question="after_condition",
 reorder_context="sort",
 dynamic_context_compression_ratio=0.3,
 condition_compare=True,
 context_budget="+100",
)

Retrieved chunks are sorted under the question condition and the compression rate is varied dynamically by position — that combination is what drives the RAG accuracy gain.

Integrations

LangChain retriever integration — drop LLMLinguaCompressor into a ContextualCompressionRetriever and you’re done
LlamaIndex node postprocessor — bolts onto the tail of any query engine pipeline
Microsoft Prompt flow integration — works as a standard node inside Azure environments

Insights

The chat’s one-word verdict — “underrated” — is exactly right. Six papers stacked, integrations across LangChain, LlamaIndex, and Prompt flow, and a 3x to 10x cost cut the moment you wire it in — yet production case studies remain rare. A few likely reasons. First, compressed prompts are hard to debug — humans struggle to trace “why was that token dropped?”, which makes regression testing painful. Second, the compressor itself is another small-LLM call, so latency-tight realtime systems can’t easily afford it. Third, the ROI has only become obvious now that GPT-5 and Claude 4.x have made per-token cost a real budget line — and that’s exactly when ops teams haven’t yet caught up to the awareness. Tellingly, OpenAI’s Privacy Filter (reversible tokenization) surfaced right alongside this — compression, pseudonymization, recovery, and KV-cache management are all clearly bifurcating into a production tooling layer. agentmemory + agent-skills + LLMLingua = the agent context-management stack that’s quietly assembling itself. Net read: when a high-performance tool stays underused, the bottleneck is usually the integration layer’s maturity, not the tool.

References

Repo and demos

microsoft/LLMLingua — main GitHub repo (6,156 stars, MIT)
llmlingua.com — project hub (papers, demos, posts)
HuggingFace LLMLingua demo
HuggingFace LLMLingua-2 demo

Papers

Integrations