<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inference on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/ko/tags/inference/</link><description>Recent content in Inference on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>ko</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/ko/tags/inference/index.xml" rel="self" type="application/rss+xml"/><item><title>DGX Spark에서 Qwen3.5-122B를 28.3에서 51 tok/s로 끌어올린 추론 최적화 레시피</title><link>https://ice-ice-bear.github.io/ko/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/ko/posts/2026-05-07-dgx-spark-qwen35-inference-tuning/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post DGX Spark에서 Qwen3.5-122B를 28.3에서 51 tok/s로 끌어올린 추론 최적화 레시피" /&gt;&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt;는 &lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt; 단일 박스에서 &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt;를 28.3에서 51 tok/s까지 80퍼센트 끌어올린 레시피다. INT4 양자화, FP8 dense layer hybrid, MTP-2 speculative decoding, INT8 LM head, TurboQuant KV cache까지 다섯 가지 기법을 차례로 쌓았고 256K context도 유지된다. Apache 2.0, GitHub 별 171개. &amp;ldquo;단일 워크스테이션에서 100B급 MoE 모델을 production 수준으로 돌릴 수 있는가&amp;rdquo; 라는 질문에 대한 강한 긍정 답이다.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 Base["Baseline &amp;lt;br/&amp;gt; 28.3 tok/s"] --&gt; S1["+ Hybrid INT4+FP8 &amp;lt;br/&amp;gt; 30.8 tok/s"]
 S1 --&gt; S2["+ MTP-2 Speculative &amp;lt;br/&amp;gt; 38.4 tok/s"]
 S2 --&gt; V2["v2: + INT8 LM Head &amp;lt;br/&amp;gt; 51 tok/s"]
 V2 --&gt; TQ["v2-tq: + TurboQuant KV &amp;lt;br/&amp;gt; 39 tok/s &amp;lt;br/&amp;gt; 1.4M KV"]&lt;/pre&gt;&lt;h2 id="결과-표"&gt;결과 표
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;구성&lt;/th&gt;
 &lt;th&gt;tok/s&lt;/th&gt;
 &lt;th&gt;향상&lt;/th&gt;
 &lt;th&gt;빌드&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Baseline (vLLM 0.19 + AutoRound INT4 + FlashInfer)&lt;/td&gt;
 &lt;td&gt;28.3&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;td&gt;—&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ Hybrid INT4+FP8 dense layers&lt;/td&gt;
 &lt;td&gt;30.8&lt;/td&gt;
 &lt;td&gt;+8.8%&lt;/td&gt;
 &lt;td&gt;step 1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;+ MTP-2 Speculative Decoding&lt;/td&gt;
 &lt;td&gt;38.4&lt;/td&gt;
 &lt;td&gt;+35.7%&lt;/td&gt;
 &lt;td&gt;step 2&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;v2&lt;/strong&gt; (+ INT8 LM Head v2)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;51&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;+80%&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (+ TurboQuant KV Cache)&lt;/td&gt;
 &lt;td&gt;39&lt;/td&gt;
 &lt;td&gt;+38%&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;Dockerfile.v2-tq&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;같은 최적화로 Qwen3.5-35B-A3B (작은 형제) 는 112 tok/s까지 올라간다.&lt;/p&gt;
&lt;h3 id="256k-context"&gt;256K Context
&lt;/h3&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;설정&lt;/th&gt;
 &lt;th&gt;KV Cache&lt;/th&gt;
 &lt;th&gt;256K 동시 사용자&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;v2 (standard)&lt;/td&gt;
 &lt;td&gt;355K tokens&lt;/td&gt;
 &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;v2-tq (TurboQuant)&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;1.4M tokens&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="모델-한-줄"&gt;모델 한 줄
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen3.5-122B-A10B&lt;/a&gt;는 122B 총 파라미터 중 10B만 활성화하는 hybrid MoE다. 256개 expert 중 8 routed + 1 shared, Gated DeltaNet과 Gated Attention이 12:1 비율로 교차하는 48 레이어 구조에 native 262K context (YaRN 확장 시 1M)까지 지원한다. Apache 2.0. 이 모델을 &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound&lt;/a&gt;로 INT4 양자화한 &lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt; (group size 128, shared_expert는 ignore) 가 출발점이다.&lt;/p&gt;
&lt;h2 id="핵심-기법"&gt;핵심 기법
&lt;/h2&gt;&lt;h3 id="1-hybrid-int4--fp8-dense-layers-9"&gt;1. Hybrid INT4 + FP8 Dense Layers (+9%)
&lt;/h3&gt;&lt;p&gt;AutoRound INT4 모델의 BF16 shared expert weights를 official Qwen 체크포인트의 FP8 weights로 교체한다. 즉 expert 레이어만 INT4, dense는 FP8. 정확도를 보존하면서 메모리와 연산량을 동시에 줄인다.&lt;/p&gt;
&lt;h3 id="2-mtp-2-speculative-decoding-36"&gt;2. MTP-2 Speculative Decoding (+36%)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction&lt;/a&gt; 방식으로 한 번에 2 토큰을 예측한다. accept rate가 약 80퍼센트로 매우 높아 디코드 throughput이 가장 크게 점프하는 단계다. 작은 draft 모델을 따로 돌리지 않고 메인 모델 자체가 multi-head 예측을 한다는 점이 주목할 만하다.&lt;/p&gt;
&lt;h3 id="3-int8-lm-head-v2-triton-커널"&gt;3. INT8 LM Head v2 (Triton 커널)
&lt;/h3&gt;&lt;p&gt;LM head, 즉 최종 token vocabulary projection 레이어를 INT8로 양자화한다. Triton 커스텀 커널로 구현되며 v2 빌드에서 가장 큰 점프 (38.4 → 51 tok/s) 를 만든다. LM head는 보통 양자화 대상에서 빠지지만 vocabulary가 큰 모델일수록 영향력이 크다는 게 다시 확인됐다.&lt;/p&gt;
&lt;h3 id="4-turboquant-kv-cache-선택"&gt;4. TurboQuant KV Cache (선택)
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt;로 KV cache를 4배 압축한다. 절대 throughput은 v2 대비 약간 떨어지지만 256K context 동시 사용자가 1명에서 5명으로 늘어난다. Long-context multi-tenant 시나리오에서 의미 있는 트레이드오프다.&lt;/p&gt;
&lt;h2 id="환경"&gt;환경
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;vLLM 0.19.1, CUDA 13.0, Docker 기반&lt;/li&gt;
&lt;li&gt;추론 엔진: &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM 0.19&lt;/a&gt; + &lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;모델: &lt;code&gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;./install.sh&lt;/code&gt; 한 번으로 Step 0~4 자동 (idempotent)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="인사이트"&gt;인사이트
&lt;/h2&gt;&lt;p&gt;100B급 모델을 단일 워크스테이션에서 51 tok/s로 돌린다는 건 production 응답 속도 (60 tok/s 근처) 에 거의 닿았다는 뜻이다. 별 171개 짜리 레시피 치고는 짜임새가 단단해서 벤치 표, 단계별 Docker, install.sh, vLLM/CUDA 버전 호환성까지 모두 갖췄고 그대로 따라 돌릴 수 있다. 흥미로운 건 다섯 기법이 직교한다는 점이다. Hybrid quant은 메모리/정확도, MTP는 디코딩 병렬성, INT8 LM head는 컴퓨트, TurboQuant은 KV 메모리를 각각 친다. 한 곳을 짠 게 아니라 병목을 차례로 옮기면서 합산한 결과가 80퍼센트다. 그리고 v2-tq에서 보이듯 throughput과 동시 사용자 수는 다른 축이라 워크로드에 따라 다른 빌드를 골라야 한다. 다음 분기쯤이면 이런 hybrid quant + speculative + custom kernel 스택이 vLLM/SGLang에 표준으로 들어올 가능성이 높고, &amp;ldquo;100B 모델을 한 박스에서&amp;rdquo; 가 점점 demo가 아니라 default로 바뀐다.&lt;/p&gt;
&lt;h2 id="참고"&gt;참고
&lt;/h2&gt;&lt;h3 id="repo-and-model-cards"&gt;Repo and model cards
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4" target="_blank" rel="noopener"
 &gt;albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4&lt;/a&gt; — 별 171, Apache 2.0&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" target="_blank" rel="noopener"
 &gt;Qwen/Qwen3.5-122B-A10B&lt;/a&gt; — 122B/10B hybrid MoE, 262K context&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound" target="_blank" rel="noopener"
 &gt;Intel/Qwen3.5-122B-A10B-int4-AutoRound&lt;/a&gt; — INT4 group128 quantized&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/" target="_blank" rel="noopener"
 &gt;NVIDIA DGX Spark&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="inference-frameworks"&gt;Inference frameworks
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/flashinfer-ai/flashinfer" target="_blank" rel="noopener"
 &gt;FlashInfer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="optimization-techniques"&gt;Optimization techniques
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (arXiv:2309.05516)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2404.19737" target="_blank" rel="noopener"
 &gt;Multi-Token Prediction (arXiv:2404.19737)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/turbo-quant" target="_blank" rel="noopener"
 &gt;TurboQuant&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>LLMLingua 시리즈 — 프롬프트를 20배까지 압축하는 Microsoft의 저평가 도구</title><link>https://ice-ice-bear.github.io/ko/posts/2026-05-06-llmlingua-series/</link><pubDate>Wed, 06 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/ko/posts/2026-05-06-llmlingua-series/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post LLMLingua 시리즈 — 프롬프트를 20배까지 압축하는 Microsoft의 저평가 도구" /&gt;&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;한 토론에서 누군가 &lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;LLMLingua&lt;/a&gt;를 언급했고, 다른 사람이 &lt;em&gt;&amp;ldquo;네 굉장히 저평가 되있다고 생각합니다&amp;rdquo;&lt;/em&gt; 라고 동의했다. 별 6,156개에 MIT 라이선스, EMNLP'23부터 CoLM 2025까지 6편의 논문이 이어진 시리즈인데도 운영 사례를 찾기 어려운 도구다. 압축률 20배에 거의 무손실이라는 강력한 결과가 있는데 왜 production 채택이 더디게 진행되는지 — &amp;ldquo;저평가&amp;quot;라는 이 한 단어를 풀어보면 &lt;strong&gt;연구 → 프로덕션 사이의 갭&lt;/strong&gt;이 그대로 보인다.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Origin["LLMLingua &amp;lt;br/&amp;gt; EMNLP 2023"] --&gt; Long["LongLLMLingua &amp;lt;br/&amp;gt; ACL 2024"]
 Origin --&gt; V2["LLMLingua-2 &amp;lt;br/&amp;gt; ACL 2024 Findings"]
 Long --&gt; MInf["MInference &amp;lt;br/&amp;gt; 2024"]
 V2 --&gt; MInf
 MInf --&gt; SCB["SCBench &amp;lt;br/&amp;gt; 2024"]
 SCB --&gt; Sec["SecurityLingua &amp;lt;br/&amp;gt; CoLM 2025"]

 Origin -.-&gt;|작은 LLM으로 토큰 제거| Theme1["20x 압축"]
 Long -.-&gt;|"lost in middle 완화"| Theme2["RAG +21.4%"]
 V2 -.-&gt;|GPT-4 distill BERT| Theme3["3-6x 빠름"]
 MInf -.-&gt;|long-context prefill| Theme4["1M token 10x"]&lt;/pre&gt;&lt;h2 id="시리즈-6편-한-표로"&gt;시리즈 6편 한 표로
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;논문&lt;/th&gt;
 &lt;th&gt;연도&lt;/th&gt;
 &lt;th&gt;핵심 결과&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;EMNLP 2023&lt;/td&gt;
 &lt;td&gt;작은 LLM(GPT2-small, LLaMA-7B 등)으로 비핵심 토큰 제거 → &lt;strong&gt;20x 압축&lt;/strong&gt;, 최소 성능 저하&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LongLLMLingua&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024&lt;/td&gt;
 &lt;td&gt;&amp;ldquo;Lost in the middle&amp;rdquo; 완화. RAG 성능 &lt;strong&gt;+21.4%&lt;/strong&gt;, 토큰 1/4로&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;LLMLingua-2&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;ACL 2024 Findings&lt;/td&gt;
 &lt;td&gt;GPT-4 distillation 기반 BERT-level encoder. &lt;strong&gt;3-6x 빠르고&lt;/strong&gt; out-of-domain에 강함&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;&lt;strong&gt;MInference&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;Long-context inference 가속. &lt;strong&gt;A100에서 1M 토큰 prefill 10배&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SCBench&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;2024&lt;/td&gt;
 &lt;td&gt;KV cache 중심 long-context 메서드 평가 벤치마크&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;SecurityLingua&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;CoLM 2025&lt;/td&gt;
 &lt;td&gt;Jailbreak 방어. 압축 기반 보호로 SOTA 가드레일 대비 &lt;strong&gt;100x 적은 토큰&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;원논문 모음과 데모는 프로젝트 페이지 &lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt; 에서 모두 모아 볼 수 있다.&lt;/p&gt;
&lt;h2 id="핵심-효과-6가지"&gt;핵심 효과 6가지
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;비용 절감&lt;/strong&gt; — 프롬프트와 생성 길이를 동시에 단축, 압축 오버헤드는 작은 LLM 한 번 호출 정도&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;확장 컨텍스트&lt;/strong&gt; — long-context 모델 위에 얹어 &amp;ldquo;lost in middle&amp;rdquo; 완화, 같은 토큰 예산으로 더 많은 정보&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;추가 학습 불필요&lt;/strong&gt; — 본 LLM은 그대로, 앞단 압축기만 끼우는 plug-in 구조&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;지식 보존&lt;/strong&gt; — ICL(In-Context Learning) 예제와 reasoning chain 같은 핵심 정보는 유지하도록 설계&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;KV-Cache 압축&lt;/strong&gt; — 추론 메모리/지연 동시 감소&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;복원 가능&lt;/strong&gt; — GPT-4가 압축 프롬프트에서 핵심 정보를 복원할 수 있음을 실험으로 보임&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="사용-예시-llmlingua-1"&gt;사용 예시 (LLMLingua 1)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;llmlingua&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;llm_lingua&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PromptCompressor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# {&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_prompt&amp;#39;: &amp;#39;...&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;origin_tokens&amp;#39;: 2365,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;compressed_tokens&amp;#39;: 211,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;ratio&amp;#39;: &amp;#39;11.2x&amp;#39;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# &amp;#39;saving&amp;#39;: &amp;#39;, Saving $0.1 in GPT-4.&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# }&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;quantized 모델도 지원: &lt;code&gt;TheBloke/Llama-2-7b-Chat-GPTQ&lt;/code&gt; 사용 시 &lt;strong&gt;8GB 미만 GPU 메모리&lt;/strong&gt;로 압축기를 돌릴 수 있다.&lt;/p&gt;
&lt;h2 id="사용-예시-longllmlingua-rag-모드"&gt;사용 예시 (LongLLMLingua RAG 모드)
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_lingua&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_in_question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;after_condition&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;reorder_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;sort&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;dynamic_context_compression_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;condition_compare&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;context_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;+100&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;retrieved chunk를 question 조건 아래 정렬하고, 위치별로 압축률을 동적으로 조절하는 옵션들이 RAG에서 정확도를 끌어올린다.&lt;/p&gt;
&lt;h2 id="통합"&gt;통합
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain retrievers 통합&lt;/a&gt; — &lt;code&gt;LLMLinguaCompressor&lt;/code&gt;를 &lt;code&gt;ContextualCompressionRetriever&lt;/code&gt;에 끼우기만 하면 끝&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex node postprocessor 통합&lt;/a&gt; — query engine pipeline 마지막 단계에 추가&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow 통합&lt;/a&gt; — Azure 환경에서 표준 노드로 사용 가능&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="인사이트"&gt;인사이트
&lt;/h2&gt;&lt;p&gt;&lt;em&gt;&amp;ldquo;저평가&amp;rdquo;&lt;/em&gt; 라는 한 단어가 정확하다. &lt;strong&gt;연구 결과는 5편 6편 쌓였고, 통합도 LangChain·LlamaIndex·Prompt flow까지 다 있고, 적용하면 즉시 비용이 1/3에서 1/10으로 떨어지는데, production 사례는 의외로 적다.&lt;/strong&gt; 이유를 추정하면 첫째, 압축된 prompt의 디버깅이 어렵다 — &amp;ldquo;왜 이 토큰이 빠졌지&amp;quot;를 사람이 추적하기 힘들어 회귀 테스트가 까다롭다. 둘째, 압축기로 작은 LLM을 한 번 더 돌려야 해서 latency 예산이 빡빡한 실시간 시스템에는 들이밀기 어렵다. 셋째, GPT-5나 Claude 4.x 처럼 토큰 단가가 비싼 모델이 본격적으로 깔린 지금이야말로 ROI가 분명한데, 정작 이 시점에 운영팀의 인지도가 낮다. OpenAI Privacy Filter (Reversible Tokenization) 같은 LLM 파이프라인 중간 레이어들이 같은 시기에 회자된다는 점이 결정적인데 — 압축, 가명화, 복원, KV cache 관리는 production tooling으로 분화 중이고, &lt;strong&gt;agentmemory + agent-skills + LLMLingua = &amp;ldquo;에이전트의 컨텍스트 관리 스택&amp;rdquo;&lt;/strong&gt; 이 만들어지는 흐름이 보인다. 한마디로, &amp;ldquo;성능 좋은데 잘 안 쓰이는&amp;rdquo; 도구는 도구의 문제가 아니라 통합 레이어의 미성숙 문제일 가능성이 높다.&lt;/p&gt;
&lt;h2 id="참고"&gt;참고
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Repo and demos&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;microsoft/LLMLingua&lt;/a&gt; — GitHub 본 저장소 (별 6,156, MIT)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://llmlingua.com/" target="_blank" rel="noopener"
 &gt;llmlingua.com&lt;/a&gt; — 프로젝트 페이지 (논문, 데모, 블로그 모음)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua 데모&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/microsoft/LLMLingua-2" target="_blank" rel="noopener"
 &gt;HuggingFace LLMLingua-2 데모&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2023.emnlp-main.825" target="_blank" rel="noopener"
 &gt;LLMLingua (EMNLP 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.acl-long.91" target="_blank" rel="noopener"
 &gt;LongLLMLingua (ACL 2024)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://aclanthology.org/2024.findings-acl.57" target="_blank" rel="noopener"
 &gt;LLMLingua-2 (ACL 2024 Findings)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2407.02490" target="_blank" rel="noopener"
 &gt;MInference (arXiv 2407.02490)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Integrations&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://python.langchain.com/docs/integrations/document_transformers/llmlingua" target="_blank" rel="noopener"
 &gt;LangChain LLMLinguaCompressor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LongLLMLingua/" target="_blank" rel="noopener"
 &gt;LlamaIndex LongLLMLingua postprocessor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://microsoft.github.io/promptflow/" target="_blank" rel="noopener"
 &gt;Microsoft Prompt flow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>