<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Zyphra on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/zyphra/</link><description>Recent content in Zyphra on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sun, 10 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/zyphra/index.xml" rel="self" type="application/rss+xml"/><item><title>Open-Weight Models, First Week of May 2026 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B</title><link>https://ice-ice-bear.github.io/posts/2026-05-10-open-weight-models-digest/</link><pubDate>Sun, 10 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-10-open-weight-models-digest/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Open-Weight Models, First Week of May 2026 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;The first week of May 2026 was a quietly heavy week for open weights. &lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;Zyphra&lt;/a&gt; shipped &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt; — 8B-class reasoning with only 760M active parameters. &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Google&lt;/a&gt; released &lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt;, a 25.2B / 3.8B-active multimodal MoE. &lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Alibaba&amp;rsquo;s Qwen team&lt;/a&gt; followed with &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt;, 35B total / 3B active. And on top of that, &lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth&lt;/a&gt; had &lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;Gemma 4 GGUF&lt;/a&gt; and &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 GGUF&lt;/a&gt; builds running on &lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; and &lt;a class="link" href="https://ollama.com/" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt; within days. Zoom out and the pattern is clear: &lt;strong&gt;the 8B–35B class is now MoE with 1–4B active, and quantized builds ship at the same time as the reference weights.&lt;/strong&gt;&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Week["First week of May 2026"] --&gt; Vendors["3 vendors"]
 Week --&gt; Quants["Quantization layer"]

 Vendors --&gt; Zyphra["Zyphra &amp;lt;br/&amp;gt; ZAYA1-8B (8.4B / 0.76B active)"]
 Vendors --&gt; Google["Google &amp;lt;br/&amp;gt; Gemma 4 26B-A4B-it (25.2B / 3.8B active)"]
 Vendors --&gt; Qwen["Alibaba &amp;lt;br/&amp;gt; Qwen3.6-35B-A3B (35B / 3B active)"]

 Quants --&gt; Unsloth["Unsloth Dynamic 2.0 GGUF"]
 Unsloth --&gt; Gemma4GGUF["gemma-4-26B-A4B-it-GGUF"]
 Unsloth --&gt; Qwen36GGUF["Qwen3.6-35B-A3B-GGUF"]

 Gemma4GGUF --&gt; Runtimes["llama.cpp / Ollama / LM Studio"]
 Qwen36GGUF --&gt; Runtimes&lt;/pre&gt;&lt;h2 id="1-zyphra-zaya1-8b--760m-active-the-first-amd-native-end-to-end-result"&gt;1. Zyphra ZAYA1-8B — 760M active, the first AMD-native end-to-end result
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;Zyphra&lt;/a&gt; has been on the SSM-attention hybrid track since &lt;a class="link" href="https://www.marktechpost.com/2024/04/17/meet-zamba-7b-zyphras-novel-ai-model-thats-small-in-size-and-big-on-performance/" target="_blank" rel="noopener"
 &gt;Zamba-7B&lt;/a&gt; and &lt;a class="link" href="https://github.com/Zyphra/BlackMamba" target="_blank" rel="noopener"
 &gt;BlackMamba&lt;/a&gt; in 2024, hit unicorn status with a &lt;a class="link" href="https://www.zyphra.com/" target="_blank" rel="noopener"
 &gt;$110M Series A&lt;/a&gt; in June 2025, and shipped &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt; on 2026-05-06. The base is published separately as &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-reasoning-base" target="_blank" rel="noopener"
 &gt;ZAYA1-reasoning-base&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The numbers:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Field&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Total params&lt;/td&gt;
 &lt;td&gt;8.4B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Active params&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;760M&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;License&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Training infra&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html" target="_blank" rel="noopener"
 &gt;AMD Instinct MI300X&lt;/a&gt; × 1,024 + &lt;a class="link" href="https://www.amd.com/en/products/networking.html" target="_blank" rel="noopener"
 &gt;AMD Pensando Pollara&lt;/a&gt; networking, &lt;a class="link" href="https://www.ibm.com/cloud" target="_blank" rel="noopener"
 &gt;IBM Cloud&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Tech report&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.05365" target="_blank" rel="noopener"
 &gt;arXiv:2605.05365&lt;/a&gt; · &lt;a class="link" href="https://www.zyphra.com/post/zaya1-8b" target="_blank" rel="noopener"
 &gt;Zyphra blog&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;ZAYA1-8B posts 71.6 on &lt;a class="link" href="https://www.hmmt.org/" target="_blank" rel="noopener"
 &gt;HMMT Feb 2026&lt;/a&gt; and 89.1 on &lt;a class="link" href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions" target="_blank" rel="noopener"
 &gt;AIME 2026&lt;/a&gt;. For comparison on the same chart: &lt;a class="link" href="https://huggingface.co/Qwen/Qwen3-4B" target="_blank" rel="noopener"
 &gt;Qwen3-4B&lt;/a&gt; lands at 77.5, &lt;a class="link" href="https://huggingface.co/google/gemma-4-E4B-it" target="_blank" rel="noopener"
 &gt;Gemma-4-E4B&lt;/a&gt; at 50.3. The claim is &lt;strong&gt;a sub-1B-active model beating 4B-class peers&lt;/strong&gt;, made possible by post-training reasoning plus an SSM-MoE hybrid backbone. Serving is one line via the &lt;a class="link" href="https://github.com/Zyphra/vllm" target="_blank" rel="noopener"
 &gt;Zyphra vLLM fork&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install &lt;span class="s2"&gt;&amp;#34;vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;vllm serve Zyphra/ZAYA1-8B --port &lt;span class="m"&gt;8010&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --mamba-cache-dtype float32 --dtype bfloat16 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The industrial significance: this is the first reasoning-SOTA-class open model &lt;strong&gt;trained end-to-end without &lt;a class="link" href="https://www.nvidia.com/en-us/data-center/h100/" target="_blank" rel="noopener"
 &gt;NVIDIA H100&lt;/a&gt;&lt;/strong&gt; — both &lt;a class="link" href="https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus/" target="_blank" rel="noopener"
 &gt;VentureBeat&lt;/a&gt; and &lt;a class="link" href="https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/" target="_blank" rel="noopener"
 &gt;HPCWire&lt;/a&gt; lead with that angle.&lt;/p&gt;
&lt;h2 id="2-gemma-4-26b-a4b-it--googles-moe-multimodal"&gt;2. Gemma 4 26B-A4B-it — Google&amp;rsquo;s MoE multimodal
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://deepmind.google/" target="_blank" rel="noopener"
 &gt;Google DeepMind&lt;/a&gt;&amp;rsquo;s &lt;a class="link" href="https://ai.google.dev/gemma" target="_blank" rel="noopener"
 &gt;Gemma&lt;/a&gt; line moved fast: &lt;a class="link" href="https://blog.google/technology/developers/gemma-open-models/" target="_blank" rel="noopener"
 &gt;Gemma 1&lt;/a&gt; (Feb 2024) → &lt;a class="link" href="https://blog.google/technology/developers/google-gemma-2/" target="_blank" rel="noopener"
 &gt;Gemma 2&lt;/a&gt; → &lt;a class="link" href="https://blog.google/technology/developers/gemma-3/" target="_blank" rel="noopener"
 &gt;Gemma 3&lt;/a&gt; → &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Gemma 4&lt;/a&gt;. &lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt; is &lt;strong&gt;the first official MoE entry&lt;/strong&gt; in the family.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Field&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Total params&lt;/td&gt;
 &lt;td&gt;25.2B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Active params&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;3.8B&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Experts&lt;/td&gt;
 &lt;td&gt;8 active of 128 + 1 shared&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Layers&lt;/td&gt;
 &lt;td&gt;30&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Context&lt;/td&gt;
 &lt;td&gt;256K tokens&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Vocab&lt;/td&gt;
 &lt;td&gt;262K&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Modalities&lt;/td&gt;
 &lt;td&gt;Text + Image (variable resolution)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Training cutoff&lt;/td&gt;
 &lt;td&gt;2025-01&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Languages&lt;/td&gt;
 &lt;td&gt;140+ trained, 35+ supported&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;License&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache 2.0&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The architecture is interesting: &lt;strong&gt;local sliding window attention (1024) + a final global-attention layer&lt;/strong&gt;, unified KVs in the global layer, plus &lt;a class="link" href="https://arxiv.org/abs/2306.15595" target="_blank" rel="noopener"
 &gt;proportional RoPE (p-RoPE)&lt;/a&gt; to make the 256K window work. The vision encoder is ~550M and the token budget is configurable across 70/140/280/560/1120, exposing the latency-quality trade-off directly to the caller.&lt;/p&gt;
&lt;p&gt;Benchmarks (instruct):&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Benchmark&lt;/th&gt;
 &lt;th&gt;Score&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://github.com/TIGER-AI-Lab/MMLU-Pro" target="_blank" rel="noopener"
 &gt;MMLU Pro&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;82.6&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions" target="_blank" rel="noopener"
 &gt;AIME 2026&lt;/a&gt; (no tools)&lt;/td&gt;
 &lt;td&gt;88.3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://livecodebench.github.io/" target="_blank" rel="noopener"
 &gt;LiveCodeBench v6&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;77.1&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://github.com/idavidrein/gpqa" target="_blank" rel="noopener"
 &gt;GPQA Diamond&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;82.3&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://mmmu-benchmark.github.io/" target="_blank" rel="noopener"
 &gt;MMMU Pro&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;73.8&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://codeforces.com/" target="_blank" rel="noopener"
 &gt;Codeforces ELO&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;1718&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class="link" href="https://ai.google.dev/gemma/docs/core" target="_blank" rel="noopener"
 &gt;Gemma 4 docs&lt;/a&gt; spell out the &lt;code&gt;enable_thinking=True&lt;/code&gt; flag and the recommendation to drop thinking blocks from multi-turn history. Combine this with &lt;a class="link" href="https://github.com/google-ai-edge/LiteRT-LM/releases/tag/v0.11.0" target="_blank" rel="noopener"
 &gt;LiteRT-LM v0.11.0&lt;/a&gt; shipping in the same week with Gemma-4 &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Multi-token Prediction&lt;/a&gt; for 2× mobile-GPU decode, and Google has cloud weights + edge runtime + decode acceleration all aligned in a single quarter.&lt;/p&gt;
&lt;h2 id="3-qwen-36-35b-a3b--256-experts-1m-context"&gt;3. Qwen 3.6 35B-A3B — 256 experts, 1M context
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Alibaba Qwen team&lt;/a&gt; keeps a roughly six-month release tempo: &lt;a class="link" href="https://qwenlm.github.io/blog/qwen2/" target="_blank" rel="noopener"
 &gt;Qwen2&lt;/a&gt; → &lt;a class="link" href="https://qwenlm.github.io/blog/qwen2.5/" target="_blank" rel="noopener"
 &gt;Qwen2.5&lt;/a&gt; → &lt;a class="link" href="https://qwenlm.github.io/blog/qwen3/" target="_blank" rel="noopener"
 &gt;Qwen3&lt;/a&gt; → Qwen3.5 → Qwen3.6. The &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt; card shows the most aggressive MoE design of the generation.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Field&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Total params&lt;/td&gt;
 &lt;td&gt;35B&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Active params&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;3B&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Experts&lt;/td&gt;
 &lt;td&gt;&lt;strong&gt;256&lt;/strong&gt; (8 routed + 1 shared)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Layers&lt;/td&gt;
 &lt;td&gt;40&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Hidden dim&lt;/td&gt;
 &lt;td&gt;2048&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Context&lt;/td&gt;
 &lt;td&gt;262K native / &lt;strong&gt;&lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN&lt;/a&gt; extension to 1,010K&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The attention layout reads as &lt;code&gt;10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))&lt;/code&gt;. &lt;a class="link" href="https://arxiv.org/abs/2412.06464" target="_blank" rel="noopener"
 &gt;Gated DeltaNet&lt;/a&gt; uses 32 V-heads / 16 QK-heads / 128 head-dim; gated attention uses 16 Q-heads / 2 KV-heads / 256 head-dim. &lt;strong&gt;A 3:1 mix of linear-time Mamba/DeltaNet-style mixers and full attention&lt;/strong&gt; — the cost advantage grows with context.&lt;/p&gt;
&lt;p&gt;Benchmarks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.swebench.com/" target="_blank" rel="noopener"
 &gt;SWE-bench Verified&lt;/a&gt; 73.4&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/TIGER-AI-Lab/MMLU-Pro" target="_blank" rel="noopener"
 &gt;MMLU-Pro&lt;/a&gt; 85.2&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://livecodebench.github.io/" target="_blank" rel="noopener"
 &gt;LiveCodeBench v6&lt;/a&gt; 80.4&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://mmmu-benchmark.github.io/" target="_blank" rel="noopener"
 &gt;MMMU&lt;/a&gt; 81.7 (vision)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recommended inference engines: &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang ≥0.5.10&lt;/a&gt;, &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM ≥0.19.0&lt;/a&gt;, &lt;a class="link" href="https://github.com/kvcache-ai/ktransformers" target="_blank" rel="noopener"
 &gt;KTransformers&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="4-side-by-side--the-8b35b-class-is-now-moe"&gt;4. Side-by-side — the 8B–35B class is now MoE
&lt;/h2&gt;&lt;p&gt;Drop the three into one table and the pattern sharpens.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Total / Active&lt;/th&gt;
 &lt;th&gt;Experts&lt;/th&gt;
 &lt;th&gt;Context&lt;/th&gt;
 &lt;th&gt;Multimodal&lt;/th&gt;
 &lt;th&gt;Training infra&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;ZAYA1-8B&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;8.4B / 0.76B&lt;/td&gt;
 &lt;td&gt;— (SSM-MoE)&lt;/td&gt;
 &lt;td&gt;n/a&lt;/td&gt;
 &lt;td&gt;Text&lt;/td&gt;
 &lt;td&gt;AMD MI300X × 1,024&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;Gemma 4 26B-A4B-it&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;25.2B / 3.8B&lt;/td&gt;
 &lt;td&gt;128 (8+1)&lt;/td&gt;
 &lt;td&gt;256K&lt;/td&gt;
 &lt;td&gt;Text+Image&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://cloud.google.com/tpu" target="_blank" rel="noopener"
 &gt;TPU&lt;/a&gt; (internal)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen 3.6 35B-A3B&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;35B / 3B&lt;/td&gt;
 &lt;td&gt;256 (8+1)&lt;/td&gt;
 &lt;td&gt;262K → 1M&lt;/td&gt;
 &lt;td&gt;Text+Image&lt;/td&gt;
 &lt;td&gt;Alibaba internal&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Active params cluster tightly at &lt;strong&gt;0.76B / 3B / 3.8B&lt;/strong&gt;. Both memory bandwidth and compute at inference time are sized for the 4B class — meaning &lt;strong&gt;running 35B-class weights at 4-bit on a single 24GB card is the normal flow now&lt;/strong&gt;, not the edge case.&lt;/p&gt;
&lt;h2 id="5-unsloths-same-week-quantization-drop"&gt;5. Unsloth&amp;rsquo;s same-week quantization drop
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth&lt;/a&gt; ships &lt;a class="link" href="https://unsloth.ai/blog/dynamic-v2" target="_blank" rel="noopener"
 &gt;Dynamic 2.0 GGUF&lt;/a&gt; builds within days of any base release. The core idea: &lt;strong&gt;pick a different quantization type per layer&lt;/strong&gt;, dynamically. The result is closer to Q5_K_M accuracy at Q4_K_M file size, with lower KL divergence than &lt;a class="link" href="https://github.com/ggml-org/llama.cpp/pull/4861" target="_blank" rel="noopener"
 &gt;imatrix&lt;/a&gt; or &lt;a class="link" href="https://arxiv.org/abs/1712.05877" target="_blank" rel="noopener"
 &gt;QAT&lt;/a&gt; baselines on the &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Unsloth benchmarks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;gemma-4-26B-A4B-it-GGUF&lt;/a&gt; quant ladder:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Target VRAM&lt;/th&gt;
 &lt;th&gt;Recommended quant&lt;/th&gt;
 &lt;th&gt;File size&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;12GB class&lt;/td&gt;
 &lt;td&gt;UD-IQ2_M / UD-Q2_K_XL&lt;/td&gt;
 &lt;td&gt;10.0–10.5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;16GB class&lt;/td&gt;
 &lt;td&gt;UD-IQ3_XXS / UD-Q3_K_M&lt;/td&gt;
 &lt;td&gt;11.4–12.7 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;24GB class&lt;/td&gt;
 &lt;td&gt;UD-Q4_K_M / MXFP4_MOE&lt;/td&gt;
 &lt;td&gt;16.6–16.9 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;32GB class&lt;/td&gt;
 &lt;td&gt;UD-Q5_K_M&lt;/td&gt;
 &lt;td&gt;21.2 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;48GB+ workstation&lt;/td&gt;
 &lt;td&gt;UD-Q8_K_XL / BF16&lt;/td&gt;
 &lt;td&gt;27.6–50.5 GB&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;Qwen3.6-35B-A3B-GGUF&lt;/a&gt; follows the same ladder — from a 1-bit &lt;code&gt;UD-IQ1_M&lt;/code&gt; at 10 GB up to BF16 at 69.4 GB. &lt;strong&gt;A 35B-class model that fits in 10 GB&lt;/strong&gt; is the striking endpoint.&lt;/p&gt;
&lt;p&gt;The runtime matrix:&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 GGUF["Unsloth Dynamic 2.0 GGUF"] --&gt; Llama["llama.cpp / llama-server"]
 GGUF --&gt; Ollama["Ollama"]
 GGUF --&gt; LM["LM Studio"]
 GGUF --&gt; Jan["Jan"]
 GGUF --&gt; vLLM["vLLM"]
 GGUF --&gt; Py["llama-cpp-python"]
 GGUF --&gt; Studio["Unsloth Studio"]&lt;/pre&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# llama.cpp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;brew install llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Ollama&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ollama run hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="6-what-this-means-for-app-developers--target-the-quant-tier-not-the-fp16-reference"&gt;6. What this means for app developers — target the quant tier, not the FP16 reference
&lt;/h2&gt;&lt;p&gt;The real takeaway for the week is about deployment, not specs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MoE is no longer optional.&lt;/strong&gt; Every new model in the 8B–35B class is MoE. If your inference stack doesn&amp;rsquo;t have MoE-aware kernels (sparse expert dispatch, batched MoE GEMM), you don&amp;rsquo;t get the active-param win. &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;, &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt;, and &lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; all have MoE paths now — if you&amp;rsquo;re on a homegrown inference layer, this is the moment to switch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stop benchmarking on FP16/BF16.&lt;/strong&gt; ~90% of real-world deployments run &lt;a class="link" href="https://huggingface.co/docs/hub/gguf" target="_blank" rel="noopener"
 &gt;Q4_K_M&lt;/a&gt; or &lt;a class="link" href="https://www.microsoft.com/en-us/research/blog/mxfp4-bringing-fp4-precision-to-deep-learning/" target="_blank" rel="noopener"
 &gt;MXFP4&lt;/a&gt;. Re-run evals on the quantized weights. Selective quantization like &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Unsloth Dynamic 2.0&lt;/a&gt; narrows the gap, but it isn&amp;rsquo;t zero.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;256K–1M context is the new baseline.&lt;/strong&gt; Even with &lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN&lt;/a&gt; extensions, KV cache memory explodes — on a 24GB card running Qwen 3.6 35B-A3B at 1M context, the KV cache outweighs the weights. &lt;a class="link" href="https://blog.vllm.ai/2023/06/20/vllm.html" target="_blank" rel="noopener"
 &gt;Paged attention&lt;/a&gt;, &lt;a class="link" href="https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html" target="_blank" rel="noopener"
 &gt;prefix caching&lt;/a&gt;, and context pruning should be defaults.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vendor lock-in is dissolving at the training layer.&lt;/strong&gt; ZAYA1 was trained on AMD MI300X, Gemma 4 on Google TPUs, Qwen 3.6 on Alibaba&amp;rsquo;s internal cluster — all ship in the same HF card format. Training infra fragments while inference infra (llama.cpp + Ollama + vLLM) consolidates.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;The first week of May 2026 is a small inflection. Four things ossified into a standard simultaneously: &lt;strong&gt;1–4B active params, 8B–35B total, MoE, and same-week quantization&lt;/strong&gt;. ZAYA1-8B proved an AMD-native stack can produce a reasoning-SOTA model without NVIDIA; Gemma 4 26B-A4B-it pulled multimodal + 256K context down into a 26B-class MoE; Qwen 3.6 35B-A3B showed 256 experts + a DeltaNet hybrid + 1M context is buildable. Unsloth had all three runnable on consumer hardware within days. The action items for app developers are concrete: lock your evaluation to the quant tier (UD-Q4_K_M), make sure the inference stack is MoE-aware, and re-budget context in KV-cache memory rather than token counts. When the next batch ships in June — and it will — the same template will keep working.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Model cards&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-8B" target="_blank" rel="noopener"
 &gt;Zyphra/ZAYA1-8B&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/Zyphra/ZAYA1-reasoning-base" target="_blank" rel="noopener"
 &gt;ZAYA1-reasoning-base&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/Zyphra" target="_blank" rel="noopener"
 &gt;Zyphra collection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/google/gemma-4-26B-A4B-it" target="_blank" rel="noopener"
 &gt;google/gemma-4-26B-A4B-it&lt;/a&gt; · &lt;a class="link" href="https://ai.google.dev/gemma/docs/core" target="_blank" rel="noopener"
 &gt;Gemma 4 docs&lt;/a&gt; · &lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" target="_blank" rel="noopener"
 &gt;Gemma 4 launch blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF" target="_blank" rel="noopener"
 &gt;unsloth/gemma-4-26B-A4B-it-GGUF&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF" target="_blank" rel="noopener"
 &gt;unsloth/Qwen3.6-35B-A3B-GGUF&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/collections/unsloth/unsloth-dynamic-20-quants" target="_blank" rel="noopener"
 &gt;Unsloth Dynamic 2.0 Quants collection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tech reports / blogs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.zyphra.com/post/zaya1-8b" target="_blank" rel="noopener"
 &gt;Zyphra: ZAYA1-8B blog&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2605.05365" target="_blank" rel="noopener"
 &gt;ZAYA1 arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/" target="_blank" rel="noopener"
 &gt;Google: Multi-token Prediction for Gemma 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://unsloth.ai/blog/dynamic-v2" target="_blank" rel="noopener"
 &gt;Unsloth: Dynamic v2.0 GGUFs&lt;/a&gt; · &lt;a class="link" href="https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs" target="_blank" rel="noopener"
 &gt;Dynamic 2.0 documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://venturebeat.com/technology/meet-zaya1-8b-a-super-efficient-open-reasoning-model-trained-on-amd-instinct-mi300-gpus/" target="_blank" rel="noopener"
 &gt;VentureBeat: ZAYA1-8B on MI300X&lt;/a&gt; · &lt;a class="link" href="https://www.hpcwire.com/aiwire/2026/05/07/zyphra-releases-zaya1-8b-reasoning-model/" target="_blank" rel="noopener"
 &gt;HPCWire: Zyphra Releases ZAYA1-8B&lt;/a&gt; · &lt;a class="link" href="https://hothardware.com/news/amd-zyphra-gpu-cluster-gives-birth-zaya-1-moe-ai-model" target="_blank" rel="noopener"
 &gt;HotHardware: AMD Zyphra GPU Cluster&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Runtimes / inference stacks&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/ggerganov/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt; · &lt;a class="link" href="https://ollama.com/" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt; · &lt;a class="link" href="https://lmstudio.ai/" target="_blank" rel="noopener"
 &gt;LM Studio&lt;/a&gt; · &lt;a class="link" href="https://jan.ai/" target="_blank" rel="noopener"
 &gt;Jan&lt;/a&gt; · &lt;a class="link" href="https://unsloth.ai/" target="_blank" rel="noopener"
 &gt;Unsloth Studio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt; · &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt; · &lt;a class="link" href="https://github.com/kvcache-ai/ktransformers" target="_blank" rel="noopener"
 &gt;KTransformers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/Zyphra/vllm" target="_blank" rel="noopener"
 &gt;Zyphra vLLM fork&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Background reading&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2309.00071" target="_blank" rel="noopener"
 &gt;YaRN paper&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2412.06464" target="_blank" rel="noopener"
 &gt;Gated DeltaNet paper&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener"
 &gt;Speculative decoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.marktechpost.com/2024/04/17/meet-zamba-7b-zyphras-novel-ai-model-thats-small-in-size-and-big-on-performance/" target="_blank" rel="noopener"
 &gt;Zamba-7B (prior Zyphra model)&lt;/a&gt; · &lt;a class="link" href="https://github.com/Zyphra/BlackMamba" target="_blank" rel="noopener"
 &gt;BlackMamba&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>