<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Granite on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/granite/</link><description>Recent content in Granite on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 04 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/granite/index.xml" rel="self" type="application/rss+xml"/><item><title>Simon Willison's Granite 4.1 3B Pelican Gallery — Why All 21 Quantizations Flopped Together</title><link>https://ice-ice-bear.github.io/posts/2026-05-04-simonwillison-granite-pelican-benchmark/</link><pubDate>Mon, 04 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-04-simonwillison-granite-pelican-benchmark/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Simon Willison's Granite 4.1 3B Pelican Gallery — Why All 21 Quantizations Flopped Together" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://simonwillison.net/" target="_blank" rel="noopener"
 &gt;Simon Willison&lt;/a&gt; ran his signature prompt — &lt;em&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle&amp;rdquo;&lt;/em&gt; — through 21 quantized variants of &lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B&lt;/a&gt;, spanning 1.2GB to 6.34GB (51.3GB total). His verdict was one line: &lt;em&gt;&amp;ldquo;There&amp;rsquo;s no distinguishable pattern relating quality to size — they&amp;rsquo;re all pretty terrible!&amp;rdquo;&lt;/em&gt;. This post takes that gallery as a starting point to ask what informal benchmarks catch that the leaderboards miss, and where to look first if you actually want to measure the quantization-vs-quality curve.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 P["Prompt &amp;lt;br/&amp;gt; pelican on a bicycle"] --&gt; Q["Granite 4.1 3B &amp;lt;br/&amp;gt; 21 quant variants"]
 Q --&gt; S1["1.2GB ~ 6.34GB"]
 S1 --&gt; O["21 SVG outputs"]
 O --&gt; J["Simon's eyeball judgment"]
 J --&gt; R["No size-quality pattern &amp;lt;br/&amp;gt; all abstract shapes"]&lt;/pre&gt;&lt;h2 id="whats-the-svg-pelican-thing"&gt;What&amp;rsquo;s the SVG Pelican Thing
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://simonwillison.net/tags/pelican-riding-a-bicycle/" target="_blank" rel="noopener"
 &gt;pelican-riding-a-bicycle series&lt;/a&gt; is Simon&amp;rsquo;s personal informal benchmark, run against every new LLM as it lands. The prompt is one line.&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle.&amp;rdquo;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;p&gt;SVG forces a text model to emit coordinates, paths, and a viewBox directly — visual reasoning, but expressed as tokens. More importantly, the result &lt;strong&gt;renders into an image immediately&lt;/strong&gt;, so cross-model comparison is intuitive. Failure modes that don&amp;rsquo;t surface in &lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt; anonymous pair-voting or &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; multiple choice — proportion, line continuity, part placement — show up plainly in a single SVG.&lt;/p&gt;
&lt;h2 id="the-experiment"&gt;The Experiment
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Item&lt;/th&gt;
 &lt;th&gt;Detail&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Target&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B Instruct&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Variants&lt;/td&gt;
 &lt;td&gt;21 quantizations (1.2GB to 6.34GB, 51.3GB total)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Prompt&lt;/td&gt;
 &lt;td&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle&amp;rdquo;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Output&lt;/td&gt;
 &lt;td&gt;21 SVGs laid out in one gallery page&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Judge&lt;/td&gt;
 &lt;td&gt;Simon Willison&amp;rsquo;s eyes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The &lt;a class="link" href="https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery" target="_blank" rel="noopener"
 &gt;original gallery post&lt;/a&gt; lays all 21 out on one page.&lt;/p&gt;
&lt;h2 id="the-result--simons-take"&gt;The Result — Simon&amp;rsquo;s Take
&lt;/h2&gt;
 &lt;blockquote&gt;
 &lt;p&gt;&lt;em&gt;&amp;ldquo;There&amp;rsquo;s no distinguishable pattern relating quality to size — they&amp;rsquo;re all pretty terrible!&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No distinguishable pattern relating size to quality.&lt;/strong&gt; 1.2GB and 6.34GB land effectively on the same line.&lt;/li&gt;
&lt;li&gt;All 21 are abstract collections of shapes — neither pelican nor bicycle is clearly identifiable.&lt;/li&gt;
&lt;li&gt;Curiously, &lt;strong&gt;the smallest model produced the most recognizable bicycle&lt;/strong&gt;, and the largest produced the closest thing to a pelican — a hint that size-quality may not even be monotonic here.&lt;/li&gt;
&lt;li&gt;Simon wraps with: less interesting than expected; he&amp;rsquo;ll retry on a model that can actually draw.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-got-measured-and-what-didnt"&gt;What Got Measured (and What Didn&amp;rsquo;t)
&lt;/h2&gt;&lt;h3 id="1-the-quantization-curve-is-bounded-by-the-base-models-capability-ceiling"&gt;1. The quantization curve is bounded by the base model&amp;rsquo;s capability ceiling
&lt;/h3&gt;&lt;p&gt;A 5x memory range (1.2GB to 6.34GB) and no meaningful difference in output quality. But the takeaway is &lt;strong&gt;not&lt;/strong&gt; &amp;ldquo;quantization is harmless.&amp;rdquo; The cleaner reading is: &lt;strong&gt;this base model is just weak at SVG pelicans&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;To measure quantization impact cleanly, the base needs to be strong enough on the task. If the base sits near the floor, no scheme — &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;AutoRound&lt;/a&gt;, GGUF, AWQ, anything — will produce visible separation. &lt;strong&gt;Verify the capability ceiling before designing the quant benchmark.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="2-informal-benchmarks-complement-the-standard-leaderboards"&gt;2. Informal benchmarks complement the standard leaderboards
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt; pair-voting and &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; measure token-level correctness or preference on text. Questions like &amp;ldquo;can this model lay out parts in 2D space&amp;rdquo; don&amp;rsquo;t surface there. The SVG pelican fills exactly that gap — &lt;strong&gt;not on any official leaderboard, but a sanity check everyone agrees on&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="3-what-this-implies-for-the-granite-family"&gt;3. What this implies for the Granite family
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite&lt;/a&gt; and the &lt;a class="link" href="https://www.ibm.com/products/watsonx-ai/foundation-models" target="_blank" rel="noopener"
 &gt;watsonx Granite lineup&lt;/a&gt; are positioned for enterprise RAG, tool calling, and coding. On that map, an SVG pelican is an out-of-distribution task — being weak there is almost expected. But placed next to mobile-first small-model lines like Google&amp;rsquo;s Gemma + LiteRT releases, it underlines the bigger pattern: &lt;strong&gt;at the 3B class, practical usefulness depends heavily on which family put its capability where&lt;/strong&gt;. Same parameter count, very different shapes of competence.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;Informal benchmarks survive because they show, in a single image, the kind of failure a leaderboard score can&amp;rsquo;t render. The SVG pelican complements &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; and &lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt;; it doesn&amp;rsquo;t replace them — you need both to see a model&amp;rsquo;s strengths and weaknesses together. Quantization-vs-quality curves are bounded by the base model&amp;rsquo;s capability on the task, so before designing a quant benchmark, confirm the base sits well above the floor; otherwise &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;AutoRound&lt;/a&gt; and friends just compress noise into smaller noise. The detail that the smallest variant drew the best bicycle is what&amp;rsquo;s actually interesting — it questions the monotonic assumption itself, suggesting quant comparisons should be read as distributions, not point scores. &lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite&lt;/a&gt; being weak at out-of-distribution visual reasoning is consistent with its enterprise targeting, which is why picking a 3B small open model is really a question of &amp;ldquo;which family put its capability where.&amp;rdquo; External observers like Simon laying 21 variants on one page is doing a real service — it&amp;rsquo;s a fast, shareable model card before any official benchmark numbers drop.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Original gallery post&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery" target="_blank" rel="noopener"
 &gt;Simon Willison: Granite 4.1 3B SVG Pelican Gallery (2026-05-04)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/tags/pelican-riding-a-bicycle/" target="_blank" rel="noopener"
 &gt;pelican-riding-a-bicycle series tag&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/" target="_blank" rel="noopener"
 &gt;Simon Willison&amp;rsquo;s Weblog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;IBM Granite&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B Instruct (Hugging Face)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite official page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.ibm.com/products/watsonx-ai/foundation-models" target="_blank" rel="noopener"
 &gt;watsonx foundation model lineup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Related benchmark refs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena (pairwise leaderboard)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU (Papers with Code)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (quantization library)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>