<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm Eval on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/llm-eval/</link><description>Recent content in Llm Eval on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 06 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/llm-eval/index.xml" rel="self" type="application/rss+xml"/><item><title>Polaris MCFG — A License-Safe Metric-Compatible Font Generator, Plus the LLM Eval Rubric Thread Next to It</title><link>https://ice-ice-bear.github.io/posts/2026-05-06-polaris-mcfg-and-llm-eval-rubric/</link><pubDate>Wed, 06 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-06-polaris-mcfg-and-llm-eval-rubric/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Polaris MCFG — A License-Safe Metric-Compatible Font Generator, Plus the LLM Eval Rubric Thread Next to It" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/PolarisOffice/polaris_mcfg" target="_blank" rel="noopener"
 &gt;PolarisOffice/polaris_mcfg&lt;/a&gt; appeared on 2026-04-26 — a tool that looks like it came out of the Polaris Office product team. It extracts &lt;strong&gt;only the layout metrics&lt;/strong&gt; from restricted fonts (think Hancom fonts, internal commercial fonts) and grafts them onto freely-licensed fonts like &lt;a class="link" href="https://fonts.google.com/noto/specimen/Noto&amp;#43;Sans" target="_blank" rel="noopener"
 &gt;Noto Sans&lt;/a&gt; and &lt;a class="link" href="https://github.com/orioncactus/pretendard" target="_blank" rel="noopener"
 &gt;Pretendard&lt;/a&gt; to produce a new font. The result: &lt;strong&gt;original line breaks and page boundaries preserved, license now safe&lt;/strong&gt;. What makes the chatroom timing interesting is that the conversation immediately around this share was about &lt;strong&gt;LLM evaluation rubrics&lt;/strong&gt; — two topics that look unrelated but both belong to production-grade engineering practice.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Source["Source font.ttf &amp;lt;br/&amp;gt; (commercial/restricted)"] --&gt; Extract["mcfg extract"]
 Extract --&gt; Metrics["metrics.json &amp;lt;br/&amp;gt; advance/ascender/descender"]
 Free["Free font.ttf &amp;lt;br/&amp;gt; (Noto Sans/Pretendard)"] --&gt; Generate["mcfg generate"]
 Metrics --&gt; Generate
 Generate --&gt; Output["Polaris font.ttf &amp;lt;br/&amp;gt; OFL-safe"]
 Output --&gt; Validate["mcfg validate &amp;lt;br/&amp;gt; HarfBuzz render regression"]
 Validate --&gt; Pass["PASS &amp;lt;br/&amp;gt; advance widths match &amp;lt;br/&amp;gt; render within ±0.5 percent"]&lt;/pre&gt;&lt;h2 id="the-problem-it-solves"&gt;The Problem It Solves
&lt;/h2&gt;&lt;p&gt;Open a Hancom-authored .hwp or .docx in another environment and &lt;strong&gt;line breaks and page splits drift&lt;/strong&gt;. The visible glyph shapes aren&amp;rsquo;t the issue — the &lt;strong&gt;numeric metrics are&lt;/strong&gt;: advance width, ascender, descender, line gap. polaris_mcfg solves this with one clean cut: never touch the outline, only graft the numbers from one font onto another&amp;rsquo;s design.&lt;/p&gt;
&lt;h2 id="the-clean-separation--license-safe-boundary"&gt;The Clean Separation — License-Safe Boundary
&lt;/h2&gt;&lt;p&gt;The data the tool handles is &lt;strong&gt;numbers only&lt;/strong&gt;. Glyph outlines are never extracted, never copied. The visible design of the output font is 100% from the free font, and so is its license. The standard there is the &lt;a class="link" href="https://openfontlicense.org/" target="_blank" rel="noopener"
 &gt;SIL Open Font License (OFL)&lt;/a&gt; 1.1 — finalized in 2007 by Victor Gaultney and Nicolas Spalinger at SIL International, untouched for nearly 20 years, the de facto free-license standard for the font industry. Both Noto Sans and Pretendard ship under OFL.&lt;/p&gt;
&lt;h2 id="cli"&gt;CLI
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Subcommand&lt;/th&gt;
 &lt;th&gt;Purpose&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;mcfg extract &amp;lt;font.ttf&amp;gt;&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Metrics → JSON&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;mcfg compare a b&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Diff two fonts (or two JSONs); text/json/html output&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;mcfg generate --metrics … --design …&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Produce the synthesized font&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;mcfg validate &amp;lt;font&amp;gt; --against …&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Verify the metrics actually match&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;mcfg extract NotoSansKR-Bold.ttf -o bold.json
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;mcfg generate &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --metrics bold.json &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --design NotoSansKR-Regular.ttf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --output PolarisBoldMetrics-Regular.ttf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --apply global,advance &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --license-text &lt;span class="s2"&gt;&amp;#34;SIL Open Font License 1.1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;mcfg validate PolarisBoldMetrics-Regular.ttf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --against NotoSansKR-Bold.ttf &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --render-default &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --render-tolerance-pct 0.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# → result: PASS (advance widths match, rendering within ±0.5%)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Validation runs through &lt;a class="link" href="https://harfbuzz.github.io/" target="_blank" rel="noopener"
 &gt;HarfBuzz&lt;/a&gt;, the de facto OpenType shaping engine — the only way to confirm the metric graft really worked is to render real text and compare pixels.&lt;/p&gt;
&lt;h2 id="milestones-and-license-responsibility"&gt;Milestones and License Responsibility
&lt;/h2&gt;&lt;p&gt;M1 (metric extractor + JSON schema) through M7 (packaging and docs) are all complete; 84 tests pass. Tool code is MIT; output fonts inherit the design font&amp;rsquo;s license (OFL or similar). One important caveat: &lt;strong&gt;whether the source font&amp;rsquo;s EULA permits metric extraction is the user&amp;rsquo;s responsibility&lt;/strong&gt; (Requirements.md §6). The tool is not an automated license-laundering machine — it&amp;rsquo;s an honest separation tool, and the README is explicit about that.&lt;/p&gt;
&lt;h2 id="the-llm-eval-rubric-thread-next-to-it"&gt;The LLM Eval Rubric Thread Next to It
&lt;/h2&gt;&lt;p&gt;Around the same time, an unexpectedly pointed take on LLM evaluation surfaced:&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;ldquo;Vector similarity and RAGAS metrics aren&amp;rsquo;t really suitable for grading. Free-form grading inevitably has to go through an LLM, and the standard practice is to write the evaluation rubric first and base everything on that.&amp;rdquo;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;p&gt;This single line compresses the production wisdom of LLM-as-Judge into three points. (1) &lt;a class="link" href="https://github.com/explodinggradients/ragas" target="_blank" rel="noopener"
 &gt;Vector similarity and RAGAS&lt;/a&gt; score semantic match but don&amp;rsquo;t constitute a grading standard. (2) Free-form grading must call an LLM — rule-based scoring won&amp;rsquo;t reach. (3) Write the rubric first. &amp;ldquo;Tell me if this answer is good&amp;rdquo; doesn&amp;rsquo;t work as a prompt; you need an &lt;strong&gt;explicit grading scheme&lt;/strong&gt; before you&amp;rsquo;ll get consistency.&lt;/p&gt;
&lt;p&gt;This matches exactly where every modern LLM eval framework — &lt;a class="link" href="https://github.com/confident-ai/deepeval" target="_blank" rel="noopener"
 &gt;DeepEval&lt;/a&gt;, &lt;a class="link" href="https://github.com/evidentlyai/evidently" target="_blank" rel="noopener"
 &gt;Evidently&lt;/a&gt;, &lt;a class="link" href="https://github.com/openai/evals" target="_blank" rel="noopener"
 &gt;OpenAI Evals&lt;/a&gt; — is heading. &lt;strong&gt;Rubric-driven judging is now the standard.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;That a font metric extractor and an LLM evaluation rubric thread emerge at the same moment signals something about the audience: &lt;strong&gt;these are people who are actually shipping product&lt;/strong&gt;. The two topics look unrelated but the underlying move is identical — both are about reducing intuition-dependent territory to explicit, verifiable rules. The font tool reduces &amp;ldquo;are these metrics compatible&amp;rdquo; to a HarfBuzz rendering regression. LLM-as-Judge reduces &amp;ldquo;is this answer good&amp;rdquo; to a rubric. Both topics demand an automated verification step before they&amp;rsquo;re production-ready, and that verification step ends up defining the tool&amp;rsquo;s identity. The fact that polaris_mcfg has a &lt;code&gt;validate&lt;/code&gt; subcommand at all, and that LLM eval frameworks treat rubrics as first-class objects, are expressions of the same engineering instinct. In production &amp;ldquo;it just works&amp;rdquo; is not a finishing line — &lt;strong&gt;explicit criteria + automated verification + regression tracking&lt;/strong&gt; is the new bar, and these two topics point to the same place from very different starting points.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Tool repo and demo&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/PolarisOffice/polaris_mcfg" target="_blank" rel="noopener"
 &gt;PolarisOffice/polaris_mcfg&lt;/a&gt; — Metric-Compatible Font Generator (MIT, Python, 4 stars)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://polarisoffice.github.io/polaris_mcfg/" target="_blank" rel="noopener"
 &gt;Demo / docs site&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Font ecosystem&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://harfbuzz.github.io/" target="_blank" rel="noopener"
 &gt;HarfBuzz&lt;/a&gt; — OpenType shaping engine&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://openfontlicense.org/" target="_blank" rel="noopener"
 &gt;SIL Open Font License&lt;/a&gt; — de facto free-license standard (OFL 1.1, 2007)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.sil.org/" target="_blank" rel="noopener"
 &gt;SIL International&lt;/a&gt; — OFL stewards&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://fonts.google.com/noto/specimen/Noto&amp;#43;Sans" target="_blank" rel="noopener"
 &gt;Noto Sans&lt;/a&gt; and &lt;a class="link" href="https://github.com/orioncactus/pretendard" target="_blank" rel="noopener"
 &gt;Pretendard&lt;/a&gt; — OFL-licensed Hangul fonts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LLM evaluation methodology&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/explodinggradients/ragas" target="_blank" rel="noopener"
 &gt;RAGAS&lt;/a&gt; — RAG evaluation framework&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/confident-ai/deepeval" target="_blank" rel="noopener"
 &gt;DeepEval&lt;/a&gt; — LLM-as-Judge + rubric-based eval&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/evidentlyai/evidently" target="_blank" rel="noopener"
 &gt;Evidently&lt;/a&gt; — ML/LLM monitoring and eval&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/openai/evals" target="_blank" rel="noopener"
 &gt;OpenAI Evals&lt;/a&gt; — OpenAI&amp;rsquo;s official eval framework&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Simon Willison's Granite 4.1 3B Pelican Gallery — Why All 21 Quantizations Flopped Together</title><link>https://ice-ice-bear.github.io/posts/2026-05-04-simonwillison-granite-pelican-benchmark/</link><pubDate>Mon, 04 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-04-simonwillison-granite-pelican-benchmark/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post Simon Willison's Granite 4.1 3B Pelican Gallery — Why All 21 Quantizations Flopped Together" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://simonwillison.net/" target="_blank" rel="noopener"
 &gt;Simon Willison&lt;/a&gt; ran his signature prompt — &lt;em&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle&amp;rdquo;&lt;/em&gt; — through 21 quantized variants of &lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B&lt;/a&gt;, spanning 1.2GB to 6.34GB (51.3GB total). His verdict was one line: &lt;em&gt;&amp;ldquo;There&amp;rsquo;s no distinguishable pattern relating quality to size — they&amp;rsquo;re all pretty terrible!&amp;rdquo;&lt;/em&gt;. This post takes that gallery as a starting point to ask what informal benchmarks catch that the leaderboards miss, and where to look first if you actually want to measure the quantization-vs-quality curve.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart LR
 P["Prompt &amp;lt;br/&amp;gt; pelican on a bicycle"] --&gt; Q["Granite 4.1 3B &amp;lt;br/&amp;gt; 21 quant variants"]
 Q --&gt; S1["1.2GB ~ 6.34GB"]
 S1 --&gt; O["21 SVG outputs"]
 O --&gt; J["Simon's eyeball judgment"]
 J --&gt; R["No size-quality pattern &amp;lt;br/&amp;gt; all abstract shapes"]&lt;/pre&gt;&lt;h2 id="whats-the-svg-pelican-thing"&gt;What&amp;rsquo;s the SVG Pelican Thing
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://simonwillison.net/tags/pelican-riding-a-bicycle/" target="_blank" rel="noopener"
 &gt;pelican-riding-a-bicycle series&lt;/a&gt; is Simon&amp;rsquo;s personal informal benchmark, run against every new LLM as it lands. The prompt is one line.&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle.&amp;rdquo;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;p&gt;SVG forces a text model to emit coordinates, paths, and a viewBox directly — visual reasoning, but expressed as tokens. More importantly, the result &lt;strong&gt;renders into an image immediately&lt;/strong&gt;, so cross-model comparison is intuitive. Failure modes that don&amp;rsquo;t surface in &lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt; anonymous pair-voting or &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; multiple choice — proportion, line continuity, part placement — show up plainly in a single SVG.&lt;/p&gt;
&lt;h2 id="the-experiment"&gt;The Experiment
&lt;/h2&gt;&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Item&lt;/th&gt;
 &lt;th&gt;Detail&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Target&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B Instruct&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Variants&lt;/td&gt;
 &lt;td&gt;21 quantizations (1.2GB to 6.34GB, 51.3GB total)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Prompt&lt;/td&gt;
 &lt;td&gt;&amp;ldquo;Generate an SVG of a pelican riding a bicycle&amp;rdquo;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Output&lt;/td&gt;
 &lt;td&gt;21 SVGs laid out in one gallery page&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Judge&lt;/td&gt;
 &lt;td&gt;Simon Willison&amp;rsquo;s eyes&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The &lt;a class="link" href="https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery" target="_blank" rel="noopener"
 &gt;original gallery post&lt;/a&gt; lays all 21 out on one page.&lt;/p&gt;
&lt;h2 id="the-result--simons-take"&gt;The Result — Simon&amp;rsquo;s Take
&lt;/h2&gt;
 &lt;blockquote&gt;
 &lt;p&gt;&lt;em&gt;&amp;ldquo;There&amp;rsquo;s no distinguishable pattern relating quality to size — they&amp;rsquo;re all pretty terrible!&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No distinguishable pattern relating size to quality.&lt;/strong&gt; 1.2GB and 6.34GB land effectively on the same line.&lt;/li&gt;
&lt;li&gt;All 21 are abstract collections of shapes — neither pelican nor bicycle is clearly identifiable.&lt;/li&gt;
&lt;li&gt;Curiously, &lt;strong&gt;the smallest model produced the most recognizable bicycle&lt;/strong&gt;, and the largest produced the closest thing to a pelican — a hint that size-quality may not even be monotonic here.&lt;/li&gt;
&lt;li&gt;Simon wraps with: less interesting than expected; he&amp;rsquo;ll retry on a model that can actually draw.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-got-measured-and-what-didnt"&gt;What Got Measured (and What Didn&amp;rsquo;t)
&lt;/h2&gt;&lt;h3 id="1-the-quantization-curve-is-bounded-by-the-base-models-capability-ceiling"&gt;1. The quantization curve is bounded by the base model&amp;rsquo;s capability ceiling
&lt;/h3&gt;&lt;p&gt;A 5x memory range (1.2GB to 6.34GB) and no meaningful difference in output quality. But the takeaway is &lt;strong&gt;not&lt;/strong&gt; &amp;ldquo;quantization is harmless.&amp;rdquo; The cleaner reading is: &lt;strong&gt;this base model is just weak at SVG pelicans&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;To measure quantization impact cleanly, the base needs to be strong enough on the task. If the base sits near the floor, no scheme — &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;AutoRound&lt;/a&gt;, GGUF, AWQ, anything — will produce visible separation. &lt;strong&gt;Verify the capability ceiling before designing the quant benchmark.&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id="2-informal-benchmarks-complement-the-standard-leaderboards"&gt;2. Informal benchmarks complement the standard leaderboards
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt; pair-voting and &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; measure token-level correctness or preference on text. Questions like &amp;ldquo;can this model lay out parts in 2D space&amp;rdquo; don&amp;rsquo;t surface there. The SVG pelican fills exactly that gap — &lt;strong&gt;not on any official leaderboard, but a sanity check everyone agrees on&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="3-what-this-implies-for-the-granite-family"&gt;3. What this implies for the Granite family
&lt;/h3&gt;&lt;p&gt;&lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite&lt;/a&gt; and the &lt;a class="link" href="https://www.ibm.com/products/watsonx-ai/foundation-models" target="_blank" rel="noopener"
 &gt;watsonx Granite lineup&lt;/a&gt; are positioned for enterprise RAG, tool calling, and coding. On that map, an SVG pelican is an out-of-distribution task — being weak there is almost expected. But placed next to mobile-first small-model lines like Google&amp;rsquo;s Gemma + LiteRT releases, it underlines the bigger pattern: &lt;strong&gt;at the 3B class, practical usefulness depends heavily on which family put its capability where&lt;/strong&gt;. Same parameter count, very different shapes of competence.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;Informal benchmarks survive because they show, in a single image, the kind of failure a leaderboard score can&amp;rsquo;t render. The SVG pelican complements &lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU&lt;/a&gt; and &lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena&lt;/a&gt;; it doesn&amp;rsquo;t replace them — you need both to see a model&amp;rsquo;s strengths and weaknesses together. Quantization-vs-quality curves are bounded by the base model&amp;rsquo;s capability on the task, so before designing a quant benchmark, confirm the base sits well above the floor; otherwise &lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;AutoRound&lt;/a&gt; and friends just compress noise into smaller noise. The detail that the smallest variant drew the best bicycle is what&amp;rsquo;s actually interesting — it questions the monotonic assumption itself, suggesting quant comparisons should be read as distributions, not point scores. &lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite&lt;/a&gt; being weak at out-of-distribution visual reasoning is consistent with its enterprise targeting, which is why picking a 3B small open model is really a question of &amp;ldquo;which family put its capability where.&amp;rdquo; External observers like Simon laying 21 variants on one page is doing a real service — it&amp;rsquo;s a fast, shareable model card before any official benchmark numbers drop.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Original gallery post&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery" target="_blank" rel="noopener"
 &gt;Simon Willison: Granite 4.1 3B SVG Pelican Gallery (2026-05-04)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/tags/pelican-riding-a-bicycle/" target="_blank" rel="noopener"
 &gt;pelican-riding-a-bicycle series tag&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://simonwillison.net/" target="_blank" rel="noopener"
 &gt;Simon Willison&amp;rsquo;s Weblog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;IBM Granite&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/ibm-granite/granite-4.1-3b-instruct" target="_blank" rel="noopener"
 &gt;IBM Granite 4.1 3B Instruct (Hugging Face)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.ibm.com/granite" target="_blank" rel="noopener"
 &gt;IBM Granite official page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.ibm.com/products/watsonx-ai/foundation-models" target="_blank" rel="noopener"
 &gt;watsonx foundation model lineup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Related benchmark refs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://lmarena.ai/" target="_blank" rel="noopener"
 &gt;LMArena (pairwise leaderboard)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://paperswithcode.com/dataset/mmlu" target="_blank" rel="noopener"
 &gt;MMLU (Papers with Code)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/intel/auto-round" target="_blank" rel="noopener"
 &gt;Intel AutoRound (quantization library)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>