Featured image of post Open-Weight Models, First Week of May 2026 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B

Open-Weight Models, First Week of May 2026 — Zyphra ZAYA1, Gemma 4 26B A4B, Qwen 3.6 35B A3B

A snapshot of three open-weight MoE releases from the first week of May 2026, compared on parameters, active experts, and quantization

Overview

The first week of May 2026 was a quietly heavy week for open weights. Zyphra shipped ZAYA1-8B — 8B-class reasoning with only 760M active parameters. Google released Gemma 4 26B-A4B-it, a 25.2B / 3.8B-active multimodal MoE. Alibaba’s Qwen team followed with Qwen 3.6 35B-A3B, 35B total / 3B active. And on top of that, Unsloth had Gemma 4 GGUF and Qwen 3.6 GGUF builds running on llama.cpp and Ollama within days. Zoom out and the pattern is clear: the 8B–35B class is now MoE with 1–4B active, and quantized builds ship at the same time as the reference weights.

1. Zyphra ZAYA1-8B — 760M active, the first AMD-native end-to-end result

Zyphra has been on the SSM-attention hybrid track since Zamba-7B and BlackMamba in 2024, hit unicorn status with a $110M Series A in June 2025, and shipped ZAYA1-8B on 2026-05-06. The base is published separately as ZAYA1-reasoning-base.

The numbers:

FieldValue
Total params8.4B
Active params760M
LicenseApache 2.0
Training infraAMD Instinct MI300X × 1,024 + AMD Pensando Pollara networking, IBM Cloud
Tech reportarXiv:2605.05365 · Zyphra blog

ZAYA1-8B posts 71.6 on HMMT Feb 2026 and 89.1 on AIME 2026. For comparison on the same chart: Qwen3-4B lands at 77.5, Gemma-4-E4B at 50.3. The claim is a sub-1B-active model beating 4B-class peers, made possible by post-training reasoning plus an SSM-MoE hybrid backbone. Serving is one line via the Zyphra vLLM fork.

pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr"
vllm serve Zyphra/ZAYA1-8B --port 8010 \
   --mamba-cache-dtype float32 --dtype bfloat16 \
   --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml

The industrial significance: this is the first reasoning-SOTA-class open model trained end-to-end without NVIDIA H100 — both VentureBeat and HPCWire lead with that angle.

2. Gemma 4 26B-A4B-it — Google’s MoE multimodal

Google DeepMind’s Gemma line moved fast: Gemma 1 (Feb 2024) → Gemma 2Gemma 3Gemma 4. Gemma 4 26B-A4B-it is the first official MoE entry in the family.

FieldValue
Total params25.2B
Active params3.8B
Experts8 active of 128 + 1 shared
Layers30
Context256K tokens
Vocab262K
ModalitiesText + Image (variable resolution)
Training cutoff2025-01
Languages140+ trained, 35+ supported
LicenseApache 2.0

The architecture is interesting: local sliding window attention (1024) + a final global-attention layer, unified KVs in the global layer, plus proportional RoPE (p-RoPE) to make the 256K window work. The vision encoder is ~550M and the token budget is configurable across 70/140/280/560/1120, exposing the latency-quality trade-off directly to the caller.

Benchmarks (instruct):

BenchmarkScore
MMLU Pro82.6
AIME 2026 (no tools)88.3
LiveCodeBench v677.1
GPQA Diamond82.3
MMMU Pro73.8
Codeforces ELO1718

Gemma 4 docs spell out the enable_thinking=True flag and the recommendation to drop thinking blocks from multi-turn history. Combine this with LiteRT-LM v0.11.0 shipping in the same week with Gemma-4 Multi-token Prediction for 2× mobile-GPU decode, and Google has cloud weights + edge runtime + decode acceleration all aligned in a single quarter.

3. Qwen 3.6 35B-A3B — 256 experts, 1M context

The Alibaba Qwen team keeps a roughly six-month release tempo: Qwen2Qwen2.5Qwen3 → Qwen3.5 → Qwen3.6. The Qwen 3.6 35B-A3B card shows the most aggressive MoE design of the generation.

FieldValue
Total params35B
Active params3B
Experts256 (8 routed + 1 shared)
Layers40
Hidden dim2048
Context262K native / YaRN extension to 1,010K

The attention layout reads as 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)). Gated DeltaNet uses 32 V-heads / 16 QK-heads / 128 head-dim; gated attention uses 16 Q-heads / 2 KV-heads / 256 head-dim. A 3:1 mix of linear-time Mamba/DeltaNet-style mixers and full attention — the cost advantage grows with context.

Benchmarks:

Recommended inference engines: SGLang ≥0.5.10, vLLM ≥0.19.0, KTransformers.

4. Side-by-side — the 8B–35B class is now MoE

Drop the three into one table and the pattern sharpens.

ModelTotal / ActiveExpertsContextMultimodalTraining infra
ZAYA1-8B8.4B / 0.76B— (SSM-MoE)n/aTextAMD MI300X × 1,024
Gemma 4 26B-A4B-it25.2B / 3.8B128 (8+1)256KText+ImageTPU (internal)
Qwen 3.6 35B-A3B35B / 3B256 (8+1)262K → 1MText+ImageAlibaba internal

Active params cluster tightly at 0.76B / 3B / 3.8B. Both memory bandwidth and compute at inference time are sized for the 4B class — meaning running 35B-class weights at 4-bit on a single 24GB card is the normal flow now, not the edge case.

5. Unsloth’s same-week quantization drop

Unsloth ships Dynamic 2.0 GGUF builds within days of any base release. The core idea: pick a different quantization type per layer, dynamically. The result is closer to Q5_K_M accuracy at Q4_K_M file size, with lower KL divergence than imatrix or QAT baselines on the Unsloth benchmarks.

gemma-4-26B-A4B-it-GGUF quant ladder:

Target VRAMRecommended quantFile size
12GB classUD-IQ2_M / UD-Q2_K_XL10.0–10.5 GB
16GB classUD-IQ3_XXS / UD-Q3_K_M11.4–12.7 GB
24GB classUD-Q4_K_M / MXFP4_MOE16.6–16.9 GB
32GB classUD-Q5_K_M21.2 GB
48GB+ workstationUD-Q8_K_XL / BF1627.6–50.5 GB

Qwen3.6-35B-A3B-GGUF follows the same ladder — from a 1-bit UD-IQ1_M at 10 GB up to BF16 at 69.4 GB. A 35B-class model that fits in 10 GB is the striking endpoint.

The runtime matrix:

# llama.cpp
brew install llama.cpp
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

# Ollama
ollama run hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

6. What this means for app developers — target the quant tier, not the FP16 reference

The real takeaway for the week is about deployment, not specs.

  1. MoE is no longer optional. Every new model in the 8B–35B class is MoE. If your inference stack doesn’t have MoE-aware kernels (sparse expert dispatch, batched MoE GEMM), you don’t get the active-param win. vLLM, SGLang, and llama.cpp all have MoE paths now — if you’re on a homegrown inference layer, this is the moment to switch.

  2. Stop benchmarking on FP16/BF16. ~90% of real-world deployments run Q4_K_M or MXFP4. Re-run evals on the quantized weights. Selective quantization like Unsloth Dynamic 2.0 narrows the gap, but it isn’t zero.

  3. 256K–1M context is the new baseline. Even with YaRN extensions, KV cache memory explodes — on a 24GB card running Qwen 3.6 35B-A3B at 1M context, the KV cache outweighs the weights. Paged attention, prefix caching, and context pruning should be defaults.

  4. Vendor lock-in is dissolving at the training layer. ZAYA1 was trained on AMD MI300X, Gemma 4 on Google TPUs, Qwen 3.6 on Alibaba’s internal cluster — all ship in the same HF card format. Training infra fragments while inference infra (llama.cpp + Ollama + vLLM) consolidates.

Insights

The first week of May 2026 is a small inflection. Four things ossified into a standard simultaneously: 1–4B active params, 8B–35B total, MoE, and same-week quantization. ZAYA1-8B proved an AMD-native stack can produce a reasoning-SOTA model without NVIDIA; Gemma 4 26B-A4B-it pulled multimodal + 256K context down into a 26B-class MoE; Qwen 3.6 35B-A3B showed 256 experts + a DeltaNet hybrid + 1M context is buildable. Unsloth had all three runnable on consumer hardware within days. The action items for app developers are concrete: lock your evaluation to the quant tier (UD-Q4_K_M), make sure the inference stack is MoE-aware, and re-budget context in KV-cache memory rather than token counts. When the next batch ships in June — and it will — the same template will keep working.

References

Model cards

Tech reports / blogs

Runtimes / inference stacks

Background reading

Built with Hugo
Theme Stack designed by Jimmy