<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inference on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/inference/</link><description>Recent content in Inference on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/inference/index.xml" rel="self" type="application/rss+xml"/><item><title>Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Pushing Qwen3.5-122B from 28.3 to 51 tok per second on a single DGX Spark" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; is a recipe that pushes &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; from 28.3 to 51 tok/s on a single &lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;, an 80 percent gain. It stacks five orthogonal techniques on top of vLLM 0.19: AutoRound INT4 quantization, an FP8 dense-layer hybrid, MTP-2 speculative decoding, an INT8 LM head, and optional TurboQuant KV cache compression — all while preserving 256K context. Apache 2.0, 171 stars on GitHub. The interesting question it answers in the affirmative: can a single workstation actually serve a 100B-class MoE model at production speed?&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 Base["Baseline &amp;lt;br/&amp;gt; 28.3 tok/s"] --&gt; S1["+ Hybrid INT4+FP8 &amp;lt;br/&amp;gt; 30.8 tok/s"]
 S1 --&gt; S2["+ MTP-2 Speculative &amp;lt;br/&amp;gt; 38.4 tok/s"]
 S2 --&gt; V2["v2: + INT8 LM Head &amp;lt;br/&amp;gt; 51 tok/s"]
 V2 --&gt; TQ["v2-tq: + TurboQuant KV &amp;lt;br/&amp;gt; 39 tok/s &amp;lt;br/&amp;gt; 1.4M KV"]&lt;/pre&gt;&lt;h2 id="results"&gt;Results
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;tok/s&lt;/th&gt;
 &lt;th&gt;Gain&lt;/th&gt;
 &lt;th&gt;Image&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)&lt;/td&gt;
 &lt;td&gt;28.3&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ Hybrid INT4+FP8 dense layers&lt;/td&gt;
 &lt;td&gt;30.8&lt;/td&gt;
 &lt;td&gt;+8.8%&lt;/td&gt;
 &lt;td&gt;step 1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ MTP-2 speculative decoding&lt;/td&gt;
 &lt;td&gt;38.4&lt;/td&gt;
 &lt;td&gt;+35.7%&lt;/td&gt;
 &lt;td&gt;step 2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;v2&lt;/strong&gt; (+ INT8 LM head v2)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;51&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;+80%&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (+ TurboQuant KV cache)&lt;/td&gt;
 &lt;td&gt;39&lt;/td&gt;
 &lt;td&gt;+38%&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2-tq&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The same stack pushes Qwen3.5-35B-A3B (the smaller sibling) to 112 tok/s.&lt;/p&gt;
&lt;h3 id="256k-context-tradeoff"&gt;256K context tradeoff
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Build&lt;/th&gt;
 &lt;th&gt;KV cache&lt;/th&gt;
 &lt;th&gt;256K concurrent users&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;v2 (standard)&lt;/td&gt;
 &lt;td&gt;355K tokens&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (TurboQuant)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;1.4M tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="the-model-in-one-paragraph"&gt;The model in one paragraph
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt; is a hybrid MoE that activates 10B of its 122B parameters per token: 256 experts with 8 routed plus 1 shared, 48 layers alternating Gated DeltaNet and Gated Attention at a 12:1 ratio, native 262K context (extensible to 1M with YaRN), Apache 2.0. The starting point for this recipe is &lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt;, produced with &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound&lt;/a&gt; at group size 128 with &lt;code&gt;shared_expert&lt;/code&gt; left out of quantization.&lt;/p&gt;
&lt;h2 id="the-five-techniques"&gt;The five techniques
&lt;/h2&gt;&lt;h3 id="1-hybrid-int4--fp8-dense-layers-9"&gt;1. Hybrid INT4 + FP8 dense layers (+9%)
&lt;/h3&gt;&lt;p&gt;Replace the BF16 shared-expert weights inside the AutoRound INT4 model with FP8 weights from the official Qwen checkpoint. Net effect: experts stay INT4, dense layers run FP8. Memory and compute drop without touching accuracy.&lt;/p&gt;
&lt;h3 id="2-mtp-2-speculative-decoding-36"&gt;2. MTP-2 speculative decoding (+36%)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction&lt;/a&gt; generates 2 tokens per step with roughly 80 percent acceptance, the single largest jump in the chain. Notably there is no separate draft model — the main model itself runs multi-head prediction, which keeps the deployment simple.&lt;/p&gt;
&lt;h3 id="3-int8-lm-head-v2-triton-kernel"&gt;3. INT8 LM head v2 (Triton kernel)
&lt;/h3&gt;&lt;p&gt;Quantizes the final vocabulary projection to INT8 via a custom Triton kernel. This is the biggest jump in the v2 build (38.4 to 51 tok/s). LM heads are usually exempt from quantization, but on models with very large vocabularies the cost is high enough that revisiting the assumption pays off.&lt;/p&gt;
&lt;h3 id="4-turboquant-kv-cache-optional"&gt;4. TurboQuant KV cache (optional)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt; compresses the KV cache 4x. Absolute throughput drops slightly versus v2, but concurrent 256K-context users go from 1 to 5 — a meaningful tradeoff for long-context multi-tenant workloads.&lt;/p&gt;
&lt;h2 id="environment"&gt;Environment
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;vLLM 0.19.1, CUDA 13.0, Docker-based&lt;/li&gt;
&lt;li&gt;Inference stack: &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM 0.19&lt;/a&gt; + &lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Model: &lt;code&gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;One-shot &lt;code&gt;./install.sh&lt;/code&gt; runs steps 0 through 4, idempotent&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;51 tok/s on a 100B-class model from a single workstation lands close to the 60 tok/s zone that feels native in a chat UI, which is the real news here. For a 171-star repo the engineering is unusually tight — bench tables, step-wise Dockerfiles, install.sh, vLLM/CUDA version notes — and you can run it as written. The deeper lesson is that the five techniques are orthogonal: hybrid quant attacks memory and accuracy, MTP attacks decoding parallelism, INT8 LM head attacks compute, and TurboQuant attacks KV memory. The 80 percent number is not one big trick but a sequence of bottleneck migrations. The v2 versus v2-tq split also shows that throughput and concurrency are different axes — pick the build that matches your workload, not the highest single-stream number. Expect this hybrid-quant plus speculative plus custom-kernel stack to land as a default in vLLM and SGLang within a quarter or two, at which point &amp;ldquo;100B in one box&amp;rdquo; stops being a demo.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;h3 id="repo-and-model-cards"&gt;Repo and model cards
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; — 171 stars, Apache 2.0&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen/Qwen3.5-122B-A10B&lt;/a&gt; — 122B/10B hybrid MoE, 262K context&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt; — INT4 group128&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="inference-frameworks"&gt;Inference frameworks
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="optimization-techniques"&gt;Optimization techniques
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (arXiv:2309.05516)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction (arXiv:2404.19737)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>The LLMLingua Series — Microsoft's Underrated Prompt Compression Stack</title><link>https://ice-ice-bear.github.io/posts/2026-05-06-llmlingua-series/</link><pubDate>Wed, 06 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-06-llmlingua-series/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post The LLMLingua Series — Microsoft's Underrated Prompt Compression Stack" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Someone dropped &lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;LLMLingua&lt;/a&gt; in a chat, another member replied &lt;em&gt;&amp;ldquo;yes, very underrated.&amp;rdquo;&lt;/em&gt; The repo has 6,156 stars, MIT license, and six papers in the series stretching from EMNLP 2023 through CoLM 2025 — and yet production case studies are surprisingly thin on the ground. Compression up to 20x with minimal performance loss should be a no-brainer; why isn&amp;rsquo;t the adoption faster? Unpack the word &amp;ldquo;underrated&amp;rdquo; from that chat and you find the &lt;strong&gt;research-to-production gap&lt;/strong&gt; in plain sight.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Origin["LLMLingua &amp;lt;br/&amp;gt; EMNLP 2023"] --&gt; Long["LongLLMLingua &amp;lt;br/&amp;gt; ACL 2024"]
 Origin --&gt; V2["LLMLingua-2 &amp;lt;br/&amp;gt; ACL 2024 Findings"]
 Long --&gt; MInf["MInference &amp;lt;br/&amp;gt; 2024"]
 V2 --&gt; MInf
 MInf --&gt; SCB["SCBench &amp;lt;br/&amp;gt; 2024"]
 SCB --&gt; Sec["SecurityLingua &amp;lt;br/&amp;gt; CoLM 2025"]

 Origin -.-&gt;|small LLM token pruning| Theme1["20x compression"]
 Long -.-&gt;|"lost-in-middle fix"| Theme2["RAG +21.4%"]
 V2 -.-&gt;|GPT-4 distill BERT| Theme3["3-6x faster"]
 MInf -.-&gt;|long-context prefill| Theme4["1M token 10x"]&lt;/pre&gt;&lt;h2 id="six-papers-one-table"&gt;Six Papers, One Table
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Paper&lt;/th&gt;
 &lt;th&gt;Year&lt;/th&gt;
 &lt;th&gt;Headline result&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;EMNLP 2023&lt;/td&gt;
 &lt;td&gt;Use a small LLM (GPT2-small, LLaMA-7B) to drop low-value tokens — &lt;strong&gt;20x compression&lt;/strong&gt; with minimal quality loss&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LongLLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024&lt;/td&gt;
 &lt;td&gt;Mitigates &amp;ldquo;lost in the middle.&amp;rdquo; RAG accuracy &lt;strong&gt;+21.4%&lt;/strong&gt; at 1/4 the tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua-2&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024 Findings&lt;/td&gt;
 &lt;td&gt;BERT-class encoder distilled from GPT-4 — &lt;strong&gt;3-6x faster&lt;/strong&gt; and stronger out-of-domain&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;MInference&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;Long-context inference acceleration. &lt;strong&gt;10x prefill on 1M tokens&lt;/strong&gt; on A100&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SCBench&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;A benchmark suite for KV-cache-centric long-context methods&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SecurityLingua&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;CoLM 2025&lt;/td&gt;
 &lt;td&gt;Compression-based jailbreak defense — SOTA guardrail performance using &lt;strong&gt;100x fewer tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The full paper list, demos, and blog posts are aggregated on the project page at &lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="what-you-actually-get"&gt;What You Actually Get
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost savings&lt;/strong&gt; — shorter prompt and shorter generation in one move; the only overhead is one small-LLM call&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extended context&lt;/strong&gt; — sits on top of long-context models, mitigates &amp;ldquo;lost in the middle&amp;rdquo; so the same token budget carries more useful signal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No retraining&lt;/strong&gt; — the underlying LLM is untouched, only a compressor sits in front of it (true plug-in)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Knowledge preservation&lt;/strong&gt; — designed to keep ICL examples and reasoning chains intact&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV-Cache compression&lt;/strong&gt; — drops both inference memory and latency&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recoverable&lt;/strong&gt; — they show GPT-4 can recover the key information from a compressed prompt&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="example-llmlingua-1"&gt;Example (LLMLingua 1)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;llmlingua&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;llm_lingua&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# {&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_prompt&amp;#39;: &amp;#39;...&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;origin_tokens&amp;#39;: 2365,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_tokens&amp;#39;: 211,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;ratio&amp;#39;: &amp;#39;11.2x&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;saving&amp;#39;: &amp;#39;, Saving $0.1 in GPT-4.&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# }&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Quantized backends are supported too: &lt;code&gt;TheBloke/Llama-2-7b-Chat-GPTQ&lt;/code&gt; runs the compressor in &lt;strong&gt;under 8GB of GPU memory&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="example-longllmlingua-rag-mode"&gt;Example (LongLLMLingua RAG mode)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_in_question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;after_condition&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;reorder_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sort&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dynamic_context_compression_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_compare&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;context_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;+100&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Retrieved chunks are sorted under the question condition and the compression rate is varied dynamically by position — that combination is what drives the RAG accuracy gain.&lt;/p&gt;
&lt;h2 id="integrations"&gt;Integrations
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain retriever integration&lt;/a&gt; — drop &lt;code&gt;LLMLinguaCompressor&lt;/code&gt; into a &lt;code&gt;ContextualCompressionRetriever&lt;/code&gt; and you&amp;rsquo;re done&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex node postprocessor&lt;/a&gt; — bolts onto the tail of any query engine pipeline&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow integration&lt;/a&gt; — works as a standard node inside Azure environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;The chat&amp;rsquo;s one-word verdict — &lt;em&gt;&amp;ldquo;underrated&amp;rdquo;&lt;/em&gt; — is exactly right. &lt;strong&gt;Six papers stacked, integrations across LangChain, LlamaIndex, and Prompt flow, and a 3x to 10x cost cut the moment you wire it in — yet production case studies remain rare.&lt;/strong&gt; A few likely reasons. First, compressed prompts are hard to debug — humans struggle to trace &amp;ldquo;why was that token dropped?&amp;rdquo;, which makes regression testing painful. Second, the compressor itself is another small-LLM call, so latency-tight realtime systems can&amp;rsquo;t easily afford it. Third, the ROI has only become obvious now that GPT-5 and Claude 4.x have made per-token cost a real budget line — and that&amp;rsquo;s exactly when ops teams haven&amp;rsquo;t yet caught up to the awareness. Tellingly, OpenAI&amp;rsquo;s Privacy Filter (reversible tokenization) surfaced right alongside this — compression, pseudonymization, recovery, and KV-cache management are all clearly bifurcating into a production tooling layer. &lt;strong&gt;agentmemory + agent-skills + LLMLingua = the agent context-management stack&lt;/strong&gt; that&amp;rsquo;s quietly assembling itself. Net read: when a high-performance tool stays underused, the bottleneck is usually the integration layer&amp;rsquo;s maturity, not the tool.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Repo and demos&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;microsoft/LLMLingua&lt;/a&gt; — main GitHub repo (6,156 stars, MIT)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt; — project hub (papers, demos, posts)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua demo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua-2" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua-2 demo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;LLMLingua (EMNLP 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;LongLLMLingua (ACL 2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;LLMLingua-2 (ACL 2024 Findings)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;MInference (arXiv 2407.02490)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Integrations&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain LLMLinguaCompressor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex LongLLMLingua postprocessor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>