<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Microsoft on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/microsoft/</link><description>Recent content in Microsoft on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 06 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/microsoft/index.xml" rel="self" type="application/rss+xml"/><item><title>The LLMLingua Series — Microsoft's Underrated Prompt Compression Stack</title><link>https://ice-ice-bear.github.io/posts/2026-05-06-llmlingua-series/</link><pubDate>Wed, 06 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-06-llmlingua-series/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post The LLMLingua Series — Microsoft's Underrated Prompt Compression Stack" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Someone dropped &lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;LLMLingua&lt;/a&gt; in a chat, another member replied &lt;em&gt;&amp;ldquo;yes, very underrated.&amp;rdquo;&lt;/em&gt; The repo has 6,156 stars, MIT license, and six papers in the series stretching from EMNLP 2023 through CoLM 2025 — and yet production case studies are surprisingly thin on the ground. Compression up to 20x with minimal performance loss should be a no-brainer; why isn&amp;rsquo;t the adoption faster? Unpack the word &amp;ldquo;underrated&amp;rdquo; from that chat and you find the &lt;strong&gt;research-to-production gap&lt;/strong&gt; in plain sight.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Origin["LLMLingua &amp;lt;br/&amp;gt; EMNLP 2023"] --&gt; Long["LongLLMLingua &amp;lt;br/&amp;gt; ACL 2024"]
 Origin --&gt; V2["LLMLingua-2 &amp;lt;br/&amp;gt; ACL 2024 Findings"]
 Long --&gt; MInf["MInference &amp;lt;br/&amp;gt; 2024"]
 V2 --&gt; MInf
 MInf --&gt; SCB["SCBench &amp;lt;br/&amp;gt; 2024"]
 SCB --&gt; Sec["SecurityLingua &amp;lt;br/&amp;gt; CoLM 2025"]

 Origin -.-&gt;|small LLM token pruning| Theme1["20x compression"]
 Long -.-&gt;|"lost-in-middle fix"| Theme2["RAG +21.4%"]
 V2 -.-&gt;|GPT-4 distill BERT| Theme3["3-6x faster"]
 MInf -.-&gt;|long-context prefill| Theme4["1M token 10x"]&lt;/pre&gt;&lt;h2 id="six-papers-one-table"&gt;Six Papers, One Table
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Paper&lt;/th&gt;
 &lt;th&gt;Year&lt;/th&gt;
 &lt;th&gt;Headline result&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;EMNLP 2023&lt;/td&gt;
 &lt;td&gt;Use a small LLM (GPT2-small, LLaMA-7B) to drop low-value tokens — &lt;strong&gt;20x compression&lt;/strong&gt; with minimal quality loss&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LongLLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024&lt;/td&gt;
 &lt;td&gt;Mitigates &amp;ldquo;lost in the middle.&amp;rdquo; RAG accuracy &lt;strong&gt;+21.4%&lt;/strong&gt; at 1/4 the tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua-2&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024 Findings&lt;/td&gt;
 &lt;td&gt;BERT-class encoder distilled from GPT-4 — &lt;strong&gt;3-6x faster&lt;/strong&gt; and stronger out-of-domain&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;MInference&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;Long-context inference acceleration. &lt;strong&gt;10x prefill on 1M tokens&lt;/strong&gt; on A100&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SCBench&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;A benchmark suite for KV-cache-centric long-context methods&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SecurityLingua&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;CoLM 2025&lt;/td&gt;
 &lt;td&gt;Compression-based jailbreak defense — SOTA guardrail performance using &lt;strong&gt;100x fewer tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The full paper list, demos, and blog posts are aggregated on the project page at &lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="what-you-actually-get"&gt;What You Actually Get
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost savings&lt;/strong&gt; — shorter prompt and shorter generation in one move; the only overhead is one small-LLM call&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extended context&lt;/strong&gt; — sits on top of long-context models, mitigates &amp;ldquo;lost in the middle&amp;rdquo; so the same token budget carries more useful signal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No retraining&lt;/strong&gt; — the underlying LLM is untouched, only a compressor sits in front of it (true plug-in)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Knowledge preservation&lt;/strong&gt; — designed to keep ICL examples and reasoning chains intact&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV-Cache compression&lt;/strong&gt; — drops both inference memory and latency&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recoverable&lt;/strong&gt; — they show GPT-4 can recover the key information from a compressed prompt&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="example-llmlingua-1"&gt;Example (LLMLingua 1)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;llmlingua&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;llm_lingua&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# {&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_prompt&amp;#39;: &amp;#39;...&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;origin_tokens&amp;#39;: 2365,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_tokens&amp;#39;: 211,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;ratio&amp;#39;: &amp;#39;11.2x&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;saving&amp;#39;: &amp;#39;, Saving $0.1 in GPT-4.&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# }&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Quantized backends are supported too: &lt;code&gt;TheBloke/Llama-2-7b-Chat-GPTQ&lt;/code&gt; runs the compressor in &lt;strong&gt;under 8GB of GPU memory&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="example-longllmlingua-rag-mode"&gt;Example (LongLLMLingua RAG mode)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_in_question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;after_condition&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;reorder_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sort&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dynamic_context_compression_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_compare&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;context_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;+100&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Retrieved chunks are sorted under the question condition and the compression rate is varied dynamically by position — that combination is what drives the RAG accuracy gain.&lt;/p&gt;
&lt;h2 id="integrations"&gt;Integrations
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain retriever integration&lt;/a&gt; — drop &lt;code&gt;LLMLinguaCompressor&lt;/code&gt; into a &lt;code&gt;ContextualCompressionRetriever&lt;/code&gt; and you&amp;rsquo;re done&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex node postprocessor&lt;/a&gt; — bolts onto the tail of any query engine pipeline&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow integration&lt;/a&gt; — works as a standard node inside Azure environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;The chat&amp;rsquo;s one-word verdict — &lt;em&gt;&amp;ldquo;underrated&amp;rdquo;&lt;/em&gt; — is exactly right. &lt;strong&gt;Six papers stacked, integrations across LangChain, LlamaIndex, and Prompt flow, and a 3x to 10x cost cut the moment you wire it in — yet production case studies remain rare.&lt;/strong&gt; A few likely reasons. First, compressed prompts are hard to debug — humans struggle to trace &amp;ldquo;why was that token dropped?&amp;rdquo;, which makes regression testing painful. Second, the compressor itself is another small-LLM call, so latency-tight realtime systems can&amp;rsquo;t easily afford it. Third, the ROI has only become obvious now that GPT-5 and Claude 4.x have made per-token cost a real budget line — and that&amp;rsquo;s exactly when ops teams haven&amp;rsquo;t yet caught up to the awareness. Tellingly, OpenAI&amp;rsquo;s Privacy Filter (reversible tokenization) surfaced right alongside this — compression, pseudonymization, recovery, and KV-cache management are all clearly bifurcating into a production tooling layer. &lt;strong&gt;agentmemory + agent-skills + LLMLingua = the agent context-management stack&lt;/strong&gt; that&amp;rsquo;s quietly assembling itself. Net read: when a high-performance tool stays underused, the bottleneck is usually the integration layer&amp;rsquo;s maturity, not the tool.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Repo and demos&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;microsoft/LLMLingua&lt;/a&gt; — main GitHub repo (6,156 stars, MIT)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt; — project hub (papers, demos, posts)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua demo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua-2" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua-2 demo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;LLMLingua (EMNLP 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;LongLLMLingua (ACL 2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;LLMLingua-2 (ACL 2024 Findings)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;MInference (arXiv 2407.02490)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Integrations&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain LLMLinguaCompressor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex LongLLMLingua postprocessor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>