<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gguf on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/ko/tags/gguf/</link><description>Recent content in Gguf on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>ko</language><lastBuildDate>Sun, 10 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/ko/tags/gguf/index.xml" rel="self" type="application/rss+xml"/><item><title>오픈 가중치 모델 5월 첫주 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B</title><link>https://ice-ice-bear.github.io/ko/posts/2026-05-10-open-weight-models-digest/</link><pubDate>Sun, 10 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/ko/posts/2026-05-10-open-weight-models-digest/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post 오픈 가중치 모델 5월 첫주 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B" /&gt;&lt;h2 id="개요"&gt;개요
&lt;/h2&gt;&lt;p&gt;2026년 5월 첫째 주는 오픈 가중치 진영에서 의외로 큰 한 주였다. &lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;Zyphra&lt;/a&gt;가 &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt;로 760M 활성 파라미터만으로 8B급 추론을 끌어냈고, &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Google&lt;/a&gt;이 &lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt;로 25.2B/3.8B 활성 MoE 멀티모달을 풀었으며, 같은 시기 &lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt;가 35B/3B 활성으로 등장했다. 그리고 그 위에 &lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth&lt;/a&gt;가 며칠 안에 &lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;Gemma 4 GGUF&lt;/a&gt;와 &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 GGUF&lt;/a&gt;를 얹어 &lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt;·&lt;a class="link" href="https://ollama.com/" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt;에서 바로 돌아가는 상태로 만들었다. 한 주를 묶어 보면 &lt;strong&gt;&amp;ldquo;8B–35B급 = MoE, 활성 1–4B, 양자화 동시 출시&amp;rdquo;&lt;/strong&gt; 라는 새 표준이 굳어지는 그림이다.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Week["2026-05 첫째 주 오픈 가중치"] --&gt; Vendors["벤더 3사"]
 Week --&gt; Quants["양자화 레이어"]

 Vendors --&gt; Zyphra["Zyphra &amp;lt;br/&amp;gt; ZAYA1-8B (8.4B / 0.76B active)"]
 Vendors --&gt; Google["Google &amp;lt;br/&amp;gt; Gemma 4 26B-A4B-it (25.2B / 3.8B active)"]
 Vendors --&gt; Qwen["Alibaba &amp;lt;br/&amp;gt; Qwen3.6-35B-A3B (35B / 3B active)"]

 Quants --&gt; Unsloth["Unsloth Dynamic 2.0 GGUF"]
 Unsloth --&gt; Gemma4GGUF["gemma-4-26B-A4B-it-GGUF"]
 Unsloth --&gt; Qwen36GGUF["Qwen3.6-35B-A3B-GGUF"]

 Gemma4GGUF --&gt; Runtimes["llama.cpp / Ollama / LM Studio"]
 Qwen36GGUF --&gt; Runtimes&lt;/pre&gt;&lt;h2 id="1-zyphra-zaya1-8b--활성-760m-amd-네이티브-스택의-첫-결과물"&gt;1. Zyphra ZAYA1-8B — 활성 760M, AMD-네이티브 스택의 첫 결과물
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;Zyphra&lt;/a&gt;는 &lt;a class="link" href="https://www.marktechpost.com/2024/04/17/meet-zamba-7b-zyphras-novel-ai-model-thats-small-in-size-and-big-on-performance/" target="_blank" rel="noopener"
 &gt;Zamba-7B&lt;/a&gt;·&lt;a class="link" href="https://github.com/Zyphra/BlackMamba" target="_blank" rel="noopener"
 &gt;BlackMamba&lt;/a&gt; 계보를 거쳐, 2024년부터 SSM-attention 하이브리드를 밀어온 회사다. 2025년 6월 &lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;$110M Series A&lt;/a&gt;로 유니콘 라인업에 진입했고, 2026-05-06에 &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt;를 풀었다. 베이스 모델은 &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-reasoning-base" target="_blank" rel="noopener"
 &gt;ZAYA1-reasoning-base&lt;/a&gt;에 별도 공개돼 있다.&lt;/p&gt;
&lt;p&gt;핵심 숫자:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;항목&lt;/th&gt;
 &lt;th&gt;값&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;총 파라미터&lt;/td&gt;
 &lt;td&gt;8.4B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;활성 파라미터&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;760M&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;라이선스&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;학습 인프라&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html" target="_blank" rel="noopener"
 &gt;AMD Instinct MI300X&lt;/a&gt; × 1,024장 + &lt;a class="link" href="https://www.amd.com/en/products/networking.html" target="_blank" rel="noopener"
 &gt;AMD Pensando Pollara&lt;/a&gt; 네트워킹, &lt;a class="link" href="https://www.ibm.com/cloud" target="_blank" rel="noopener"
 &gt;IBM Cloud&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;기술 보고서&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.05365" target="_blank" rel="noopener"
 &gt;arXiv:2605.05365&lt;/a&gt; / &lt;a class="link" href="https://www.zyphra.com/post/zaya1-8b" target="_blank" rel="noopener"
 &gt;Zyphra 블로그&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;ZAYA1-8B는 &lt;a class="link" href="https://www.hmmt.org/" target="_blank" rel="noopener"
 &gt;HMMT Feb 2026&lt;/a&gt;에서 71.6, &lt;a class="link" href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions" target="_blank" rel="noopener"
 &gt;AIME 2026&lt;/a&gt;에서 89.1을 찍었다. 같은 그래프에서 &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3-4B" target="_blank" rel="noopener"
 &gt;Qwen3-4B&lt;/a&gt;는 77.5, &lt;a class="link" href="https://huggingface.co/google/gemma-4-E4B-it" target="_blank" rel="noopener"
 &gt;Gemma-4-E4B&lt;/a&gt;는 50.3이다. &lt;strong&gt;활성 1B 미만 모델이 4B급을 이긴다는 게 ZAYA1의 주장이고, 이것이 가능한 이유는 추론 후처리(post-training reasoning)와 SSM-MoE 하이브리드의 결합&lt;/strong&gt;이다. 배포는 &lt;a class="link" href="https://github.com/Zyphra/vllm" target="_blank" rel="noopener"
 &gt;Zyphra 포크 vLLM&lt;/a&gt; 한 줄로 끝나도록 패키징돼 있다.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install &lt;span class="s2"&gt;&amp;#34;vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;vllm serve Zyphra/ZAYA1-8B --port &lt;span class="m"&gt;8010&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mamba-cache-dtype float32 --dtype bfloat16 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;AMD 진영에서 처음으로 &amp;ldquo;&lt;a class="link" href="https://www.nvidia.com/en-us/data-center/h100/" target="_blank" rel="noopener"
 &gt;NVIDIA H100&lt;/a&gt; 없이 end-to-end로 학습된 reasoning SOTA급 모델&amp;quot;을 내놓았다는 게 가장 큰 산업적 의미다. &lt;a class="link" href="https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus/" target="_blank" rel="noopener"
 &gt;VentureBeat 보도&lt;/a&gt;와 &lt;a class="link" href="https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/" target="_blank" rel="noopener"
 &gt;HPCWire 기사&lt;/a&gt; 모두 이 점을 강조한다.&lt;/p&gt;
&lt;h2 id="2-gemma-4-26b-a4b-it--google의-moe-멀티모달"&gt;2. Gemma 4 26B-A4B-it — Google의 MoE 멀티모달
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://deepmind.google/" target="_blank" rel="noopener"
 &gt;Google DeepMind&lt;/a&gt;의 &lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma&lt;/a&gt; 시리즈는 &lt;a class="link" href="https://blog.google/technology/developers/gemma-open-models/" target="_blank" rel="noopener"
 &gt;Gemma 1&lt;/a&gt; (2024-02) → &lt;a class="link" href="https://blog.google/technology/developers/google-gemma-2/" target="_blank" rel="noopener"
 &gt;Gemma 2&lt;/a&gt; → &lt;a class="link" href="https://blog.google/technology/developers/gemma-3/" target="_blank" rel="noopener"
 &gt;Gemma 3&lt;/a&gt; → &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Gemma 4&lt;/a&gt;로 빠르게 세대를 갈아왔다. &lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt;는 이번 세대에서 &lt;strong&gt;첫 공식 MoE 라인업&lt;/strong&gt;이다.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;항목&lt;/th&gt;
 &lt;th&gt;값&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;총 파라미터&lt;/td&gt;
 &lt;td&gt;25.2B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;활성 파라미터&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;3.8B&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;전문가&lt;/td&gt;
 &lt;td&gt;128개 중 8 활성 + 1 공유&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;레이어&lt;/td&gt;
 &lt;td&gt;30&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;컨텍스트&lt;/td&gt;
 &lt;td&gt;256K 토큰&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;어휘&lt;/td&gt;
 &lt;td&gt;262K&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;모달리티&lt;/td&gt;
 &lt;td&gt;텍스트 + 이미지 (가변 해상도)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;학습 데이터 컷오프&lt;/td&gt;
 &lt;td&gt;2025-01&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;다국어&lt;/td&gt;
 &lt;td&gt;140+ 학습, 35+ 지원&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;라이선스&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;아키텍처 디테일이 흥미롭다. &lt;strong&gt;local sliding window attention(1024) + 마지막 레이어 global attention&lt;/strong&gt;, 글로벌 레이어에선 KV를 unify, 그리고 &lt;a class="link" href="https://arxiv.org/abs/2306.15595" target="_blank" rel="noopener"
 &gt;p-RoPE&lt;/a&gt; 변형으로 256K 컨텍스트를 끌어 올렸다. 멀티모달 인코더는 약 550M, 비전 토큰 예산을 70/140/280/560/1120 중에 골라서 latency-quality 트레이드오프를 노출한다.&lt;/p&gt;
&lt;p&gt;벤치마크 (instruct):&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;벤치&lt;/th&gt;
 &lt;th&gt;점수&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://github.com/TIGER-AI-Lab/MMLU-Pro" target="_blank" rel="noopener"
 &gt;MMLU Pro&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;82.6&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions" target="_blank" rel="noopener"
 &gt;AIME 2026&lt;/a&gt; (no tools)&lt;/td&gt;
 &lt;td&gt;88.3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://livecodebench.github.io/" target="_blank" rel="noopener"
 &gt;LiveCodeBench v6&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;77.1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://github.com/idavidrein/gpqa" target="_blank" rel="noopener"
 &gt;GPQA Diamond&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;82.3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://mmmu-benchmark.github.io/" target="_blank" rel="noopener"
 &gt;MMMU Pro&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;73.8&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://codeforces.com/" target="_blank" rel="noopener"
 &gt;Codeforces ELO&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;1718&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class="link" href="https://ai.google.dev/gemma/docs/core" target="_blank" rel="noopener"
 &gt;Gemma 4 도큐먼트&lt;/a&gt;는 &lt;code&gt;enable_thinking=True&lt;/code&gt; 옵션과 multi-turn에서 thinking 블록 제외 권장을 명시한다. 같은 주에 풀린 &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;LiteRT-LM v0.11.0&lt;/a&gt;이 Gemma 4용 &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;MTP(Multi-token Prediction)&lt;/a&gt;를 모바일 GPU에서 2× 가속한다는 점까지 묶어서 보면, Google은 &lt;strong&gt;클라우드 가중치 + 엣지 런타임 + 디코드 가속&lt;/strong&gt;을 한 분기에 다 챙긴 그림이다.&lt;/p&gt;
&lt;h2 id="3-qwen-36-35b-a3b--256개-전문가-1m-컨텍스트"&gt;3. Qwen 3.6 35B-A3B — 256개 전문가, 1M 컨텍스트
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Alibaba Qwen 팀&lt;/a&gt;은 &lt;a class="link" href="https://qwenlm.github.io/blog/qwen2/" target="_blank" rel="noopener"
 &gt;Qwen2&lt;/a&gt; → &lt;a class="link" href="https://qwenlm.github.io/blog/qwen2.5/" target="_blank" rel="noopener"
 &gt;Qwen2.5&lt;/a&gt; → &lt;a class="link" href="https://qwenlm.github.io/blog/qwen3/" target="_blank" rel="noopener"
 &gt;Qwen3&lt;/a&gt; → Qwen3.5 → Qwen3.6으로 6개월 단위 릴리스를 유지하는 중이다. &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt; 카드를 보면 MoE 설계가 이번 세대에서 가장 공격적이다.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;항목&lt;/th&gt;
 &lt;th&gt;값&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;총 파라미터&lt;/td&gt;
 &lt;td&gt;35B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;활성 파라미터&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;3B&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;전문가 수&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;256개&lt;/strong&gt; (Routed 8 + Shared 1)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;레이어&lt;/td&gt;
 &lt;td&gt;40&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;히든 차원&lt;/td&gt;
 &lt;td&gt;2048&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;컨텍스트&lt;/td&gt;
 &lt;td&gt;262K 네이티브 / &lt;strong&gt;&lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN&lt;/a&gt;으로 1,010K&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;어텐션 레이아웃이 독특하다 — &lt;code&gt;10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))&lt;/code&gt; 구조다. &lt;a class="link" href="https://arxiv.org/abs/2412.06464" target="_blank" rel="noopener"
 &gt;Gated DeltaNet&lt;/a&gt;이 32 V-head / 16 QK-head / 128 head-dim, gated attention이 16 Q-head / 2 KV-head / 256 head-dim. &lt;strong&gt;Mamba/DeltaNet 계열 linear-time mixer를 3:1로 attention과 섞은 하이브리드&lt;/strong&gt; — 컨텍스트가 길수록 비용 우위가 커지는 설계다.&lt;/p&gt;
&lt;p&gt;벤치마크:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.swebench.com/" target="_blank" rel="noopener"
 &gt;SWE-bench Verified&lt;/a&gt; 73.4&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/TIGER-AI-Lab/MMLU-Pro" target="_blank" rel="noopener"
 &gt;MMLU-Pro&lt;/a&gt; 85.2&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://livecodebench.github.io/" target="_blank" rel="noopener"
 &gt;LiveCodeBench v6&lt;/a&gt; 80.4&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://mmmu-benchmark.github.io/" target="_blank" rel="noopener"
 &gt;MMMU&lt;/a&gt; 81.7 (비전)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;권장 추론 엔진은 &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang ≥0.5.10&lt;/a&gt; / &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM ≥0.19.0&lt;/a&gt; / &lt;a class="link" href="https://github.com/kvcache-ai/ktransformers" target="_blank" rel="noopener"
 &gt;KTransformers&lt;/a&gt;다.&lt;/p&gt;
&lt;h2 id="4-같은-클래스-묶어-보기"&gt;4. 같은 클래스 묶어 보기
&lt;/h2&gt;&lt;p&gt;세 모델을 같은 표에 놓으면 &amp;ldquo;8B–35B 클래스 = MoE&amp;rdquo; 가 더 또렷해진다.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;모델&lt;/th&gt;
 &lt;th&gt;총 / 활성&lt;/th&gt;
 &lt;th&gt;전문가&lt;/th&gt;
 &lt;th&gt;컨텍스트&lt;/th&gt;
 &lt;th&gt;멀티모달&lt;/th&gt;
 &lt;th&gt;학습 인프라&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;8.4B / 0.76B&lt;/td&gt;
 &lt;td&gt;— (SSM-MoE)&lt;/td&gt;
 &lt;td&gt;미공개&lt;/td&gt;
 &lt;td&gt;텍스트&lt;/td&gt;
 &lt;td&gt;AMD MI300X × 1,024&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;25.2B / 3.8B&lt;/td&gt;
 &lt;td&gt;128 (8+1)&lt;/td&gt;
 &lt;td&gt;256K&lt;/td&gt;
 &lt;td&gt;텍스트+이미지&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://cloud.google.com/tpu" target="_blank" rel="noopener"
 &gt;TPU&lt;/a&gt; (Google 내부)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;35B / 3B&lt;/td&gt;
 &lt;td&gt;256 (8+1)&lt;/td&gt;
 &lt;td&gt;262K → 1M&lt;/td&gt;
 &lt;td&gt;텍스트+이미지&lt;/td&gt;
 &lt;td&gt;Alibaba 내부&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;활성 파라미터가 모두 &lt;strong&gt;0.76B / 3B / 3.8B&lt;/strong&gt;로 압축돼 있다는 점이 핵심이다. 추론 시 메모리 대역폭과 연산 둘 다 4B급에 맞춰져 있어서, &lt;strong&gt;VRAM 24GB 한 장에서 35B급 가중치를 4-bit로 굴리는 시나리오가 일반 워크플로&lt;/strong&gt;가 된다.&lt;/p&gt;
&lt;h2 id="5-unsloth의-양자화-동시-출시"&gt;5. Unsloth의 양자화 동시 출시
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth&lt;/a&gt;가 &lt;a class="link" href="https://unsloth.ai/blog/dynamic-v2" target="_blank" rel="noopener"
 &gt;Dynamic 2.0 GGUF&lt;/a&gt; 방식으로 베이스 모델 공개 며칠 안에 양자화를 푼다. 핵심 아이디어는 &lt;strong&gt;레이어마다 다른 양자화 타입을 동적으로 선택&lt;/strong&gt;해서, 같은 파일 크기(Q4_K_M)에서 Q5_K_M에 더 가까운 정확도를 뽑아내는 것. KL Divergence가 &lt;a class="link" href="https://github.com/ggml-org/llama.cpp/pull/4861" target="_blank" rel="noopener"
 &gt;imatrix&lt;/a&gt;·&lt;a class="link" href="https://arxiv.org/abs/1712.05877" target="_blank" rel="noopener"
 &gt;QAT&lt;/a&gt; 대비 낮다는 게 &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Unsloth 벤치마크&lt;/a&gt;의 주장이다.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;gemma-4-26B-A4B-it-GGUF&lt;/a&gt;의 양자화 라인업:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;타깃 VRAM&lt;/th&gt;
 &lt;th&gt;추천 양자화&lt;/th&gt;
 &lt;th&gt;파일 크기&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;12GB 클래스&lt;/td&gt;
 &lt;td&gt;UD-IQ2_M / UD-Q2_K_XL&lt;/td&gt;
 &lt;td&gt;10.0–10.5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;16GB 클래스&lt;/td&gt;
 &lt;td&gt;UD-IQ3_XXS / UD-Q3_K_M&lt;/td&gt;
 &lt;td&gt;11.4–12.7 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;24GB 클래스&lt;/td&gt;
 &lt;td&gt;UD-Q4_K_M / MXFP4_MOE&lt;/td&gt;
 &lt;td&gt;16.6–16.9 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;32GB 클래스&lt;/td&gt;
 &lt;td&gt;UD-Q5_K_M&lt;/td&gt;
 &lt;td&gt;21.2 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;48GB+ 워크스테이션&lt;/td&gt;
 &lt;td&gt;UD-Q8_K_XL / BF16&lt;/td&gt;
 &lt;td&gt;27.6–50.5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen3.6-35B-A3B-GGUF&lt;/a&gt;도 동일한 사다리를 따라간다 — 1-bit &lt;code&gt;UD-IQ1_M&lt;/code&gt;(10 GB)부터 BF16(69.4 GB)까지. &lt;strong&gt;35B 모델이 10 GB에 들어간다&lt;/strong&gt;는 게 인상적이다.&lt;/p&gt;
&lt;p&gt;런타임 매트릭스:&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 GGUF["Unsloth Dynamic 2.0 GGUF"] --&gt; Llama["llama.cpp / llama-server"]
 GGUF --&gt; Ollama["Ollama"]
 GGUF --&gt; LM["LM Studio"]
 GGUF --&gt; Jan["Jan"]
 GGUF --&gt; vLLM["vLLM"]
 GGUF --&gt; Py["llama-cpp-python"]
 GGUF --&gt; Studio["Unsloth Studio"]&lt;/pre&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# llama.cpp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;brew install llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ollama run hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="6-앱-개발자-관점에서--fp16-레퍼런스가-아니라-양자화-티어를-타깃하라"&gt;6. 앱 개발자 관점에서 — FP16 레퍼런스가 아니라 양자화 티어를 타깃하라
&lt;/h2&gt;&lt;p&gt;이 한 주의 진짜 시사점은 모델 사양이 아니라 &lt;strong&gt;배포 경로&lt;/strong&gt;다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MoE는 더 이상 옵션이 아니다.&lt;/strong&gt; 8B–35B 클래스의 새 모델은 사실상 전부 MoE다. 추론 스택이 MoE-aware 커널 (sparse expert dispatch, batched MoE GEMM)을 지원하지 않으면 활성 파라미터의 이점을 못 살린다. &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;·&lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt;·&lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; 모두 이미 MoE 경로를 갖췄으니, 직접 짠 추론 코드라면 갈아탈 시점이다.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;FP16/BF16 레퍼런스를 타깃하지 마라.&lt;/strong&gt; 실제 사용 환경의 90%는 &lt;a class="link" href="https://huggingface.co/docs/hub/gguf" target="_blank" rel="noopener"
 &gt;Q4_K_M&lt;/a&gt; 또는 &lt;a class="link" href="https://www.microsoft.com/en-us/research/blog/mxfp4-bringing-fp4-precision-to-deep-learning/" target="_blank" rel="noopener"
 &gt;MXFP4&lt;/a&gt;다. 평가는 양자화 후 가중치로 다시 돌려야 의미가 있다. &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Unsloth Dynamic 2.0&lt;/a&gt; 같은 selective quantization 덕에 양자화 손실은 줄었지만 0은 아니다.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;컨텍스트 256K–1M가 기본값이 됐다.&lt;/strong&gt; &lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN&lt;/a&gt; 같은 확장을 적용해도 KV cache 메모리가 폭증한다 — 24GB 카드에서 Qwen 3.6 35B-A3B를 1M 컨텍스트로 굴리면 가중치보다 KV cache가 더 무겁다. &lt;a class="link" href="https://blog.vllm.ai/2023/06/20/vllm.html" target="_blank" rel="noopener"
 &gt;paged attention&lt;/a&gt;·&lt;a class="link" href="https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html" target="_blank" rel="noopener"
 &gt;prefix caching&lt;/a&gt;·context pruning을 디폴트로 깔고 가야 한다.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;벤더 락-인이 사라지는 중.&lt;/strong&gt; AMD MI300X에서 학습한 ZAYA1, Google TPU에서 학습한 Gemma 4, Alibaba 내부 클러스터의 Qwen 3.6 — 모두 같은 HF 카드 포맷으로 풀린다. 학습 인프라는 점점 다양해지는데 추론 인프라(llama.cpp + Ollama + vLLM)는 한 줄로 통일된다.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="인사이트"&gt;인사이트
&lt;/h2&gt;&lt;p&gt;2026년 5월 첫째 주는 작은 분기점이다. &lt;strong&gt;활성 파라미터 1B–4B / 총 8B–35B / MoE / 양자화 동시 출시&lt;/strong&gt; 라는 네 항목이 동시에 표준으로 굳어졌다. ZAYA1-8B는 AMD-네이티브 스택이 NVIDIA 없이도 reasoning SOTA를 만들 수 있음을, Gemma 4 26B-A4B-it는 멀티모달 + 256K 컨텍스트가 26B급 MoE로 내려왔음을, Qwen 3.6 35B-A3B는 256개 전문가 + DeltaNet 하이브리드 + 1M 컨텍스트가 가능함을 보였다. Unsloth가 며칠 안에 GGUF를 올린 덕에 한국 개발자도 24GB VRAM 한 장 또는 32GB 통합 메모리 노트북 한 대로 세 모델을 모두 굴려볼 수 있다. &lt;strong&gt;앱 개발자 입장에서 진짜 행동 항목은 단순하다 — 양자화 티어(UD-Q4_K_M)를 평가 기준으로 박고, 추론 스택은 MoE-aware로 맞추고, 컨텍스트 예산은 256K가 아니라 KV cache로 다시 계산하라.&lt;/strong&gt; 6월에 새 모델이 또 나와도 같은 형판이 계속 굴러갈 가능성이 높다.&lt;/p&gt;
&lt;h2 id="참고"&gt;참고
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;모델 카드&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;Zyphra/ZAYA1-8B&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-reasoning-base" target="_blank" rel="noopener"
 &gt;ZAYA1-reasoning-base&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/Zyphra" target="_blank" rel="noopener"
 &gt;Zyphra 컬렉션&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;google/gemma-4-26B-A4B-it&lt;/a&gt; · &lt;a class="link" href="https://ai.google.dev/gemma/docs/core" target="_blank" rel="noopener"
 &gt;Gemma 4 docs&lt;/a&gt; · &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Gemma 4 launch blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;unsloth/gemma-4-26B-A4B-it-GGUF&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;unsloth/Qwen3.6-35B-A3B-GGUF&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/collections/unsloth/unsloth-dynamic-20-quants" target="_blank" rel="noopener"
 &gt;Unsloth Dynamic 2.0 Quants 컬렉션&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;기술 보고서 / 블로그&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.zyphra.com/post/zaya1-8b" target="_blank" rel="noopener"
 &gt;Zyphra: ZAYA1-8B 블로그&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2605.05365" target="_blank" rel="noopener"
 &gt;ZAYA1 arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google: Multi-token Prediction for Gemma 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://unsloth.ai/blog/dynamic-v2" target="_blank" rel="noopener"
 &gt;Unsloth: Dynamic v2.0 GGUFs&lt;/a&gt; · &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Dynamic 2.0 문서&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus/" target="_blank" rel="noopener"
 &gt;VentureBeat: ZAYA1-8B on MI300X&lt;/a&gt; · &lt;a class="link" href="https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/" target="_blank" rel="noopener"
 &gt;HPCWire: Zyphra Releases ZAYA1-8B&lt;/a&gt; · &lt;a class="link" href="https://hothardware.com/news/amd-zyphra-gpu-cluster-gives-birth-zaya-1-moe-ai-model" target="_blank" rel="noopener"
 &gt;HotHardware: AMD Zyphra GPU Cluster&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;런타임 / 추론 스택&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; · &lt;a class="link" href="https://ollama.com/" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt; · &lt;a class="link" href="https://lmstudio.ai/" target="_blank" rel="noopener"
 &gt;LM Studio&lt;/a&gt; · &lt;a class="link" href="https://jan.ai/" target="_blank" rel="noopener"
 &gt;Jan&lt;/a&gt; · &lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth Studio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt; · &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt; · &lt;a class="link" href="https://github.com/kvcache-ai/ktransformers" target="_blank" rel="noopener"
 &gt;KTransformers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/Zyphra/vllm" target="_blank" rel="noopener"
 &gt;Zyphra vLLM fork&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;관련 배경&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN paper&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2412.06464" target="_blank" rel="noopener"
 &gt;Gated DeltaNet paper&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener"
 &gt;Speculative decoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.marktechpost.com/2024/04/17/meet-zamba-7b-zyphras-novel-ai-model-thats-small-in-size-and-big-on-performance/" target="_blank" rel="noopener"
 &gt;Zamba-7B (Zyphra 이전 모델)&lt;/a&gt; · &lt;a class="link" href="https://github.com/Zyphra/BlackMamba" target="_blank" rel="noopener"
 &gt;BlackMamba&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>