Llm Eval on ICE-ICE-BEAR-BLOG

Polaris MCFG — A License-Safe Metric-Compatible Font Generator, Plus the LLM Eval Rubric Thread Next to It

Wed, 06 May 2026 00:00:00 +0900

Overview

PolarisOffice/polaris_mcfg appeared on 2026-04-26 — a tool that looks like it came out of the Polaris Office product team. It extracts only the layout metrics from restricted fonts (think Hancom fonts, internal commercial fonts) and grafts them onto freely-licensed fonts like Noto Sans and Pretendard to produce a new font. The result: original line breaks and page boundaries preserved, license now safe. What makes the chatroom timing interesting is that the conversation immediately around this share was about LLM evaluation rubrics — two topics that look unrelated but both belong to production-grade engineering practice.

graph TD
 Source["Source font.ttf <br/> (commercial/restricted)"] --> Extract["mcfg extract"]
 Extract --> Metrics["metrics.json <br/> advance/ascender/descender"]
 Free["Free font.ttf <br/> (Noto Sans/Pretendard)"] --> Generate["mcfg generate"]
 Metrics --> Generate
 Generate --> Output["Polaris font.ttf <br/> OFL-safe"]
 Output --> Validate["mcfg validate <br/> HarfBuzz render regression"]
 Validate --> Pass["PASS <br/> advance widths match <br/> render within ±0.5 percent"]

The Problem It Solves

Open a Hancom-authored .hwp or .docx in another environment and line breaks and page splits drift. The visible glyph shapes aren’t the issue — the numeric metrics are: advance width, ascender, descender, line gap. polaris_mcfg solves this with one clean cut: never touch the outline, only graft the numbers from one font onto another’s design.

The Clean Separation — License-Safe Boundary

The data the tool handles is numbers only. Glyph outlines are never extracted, never copied. The visible design of the output font is 100% from the free font, and so is its license. The standard there is the SIL Open Font License (OFL) 1.1 — finalized in 2007 by Victor Gaultney and Nicolas Spalinger at SIL International, untouched for nearly 20 years, the de facto free-license standard for the font industry. Both Noto Sans and Pretendard ship under OFL.

CLI

Subcommand	Purpose
`mcfg extract <font.ttf>`	Metrics → JSON
`mcfg compare a b`	Diff two fonts (or two JSONs); text/json/html output
`mcfg generate --metrics … --design …`	Produce the synthesized font
`mcfg validate <font> --against …`	Verify the metrics actually match

mcfg extract NotoSansKR-Bold.ttf -o bold.json

mcfg generate \
 --metrics bold.json \
 --design NotoSansKR-Regular.ttf \
 --output PolarisBoldMetrics-Regular.ttf \
 --apply global,advance \
 --license-text "SIL Open Font License 1.1"

mcfg validate PolarisBoldMetrics-Regular.ttf \
 --against NotoSansKR-Bold.ttf \
 --render-default \
 --render-tolerance-pct 0.5
# → result: PASS (advance widths match, rendering within ±0.5%)

Validation runs through HarfBuzz, the de facto OpenType shaping engine — the only way to confirm the metric graft really worked is to render real text and compare pixels.

Milestones and License Responsibility

M1 (metric extractor + JSON schema) through M7 (packaging and docs) are all complete; 84 tests pass. Tool code is MIT; output fonts inherit the design font’s license (OFL or similar). One important caveat: whether the source font’s EULA permits metric extraction is the user’s responsibility (Requirements.md §6). The tool is not an automated license-laundering machine — it’s an honest separation tool, and the README is explicit about that.

The LLM Eval Rubric Thread Next to It

Around the same time, an unexpectedly pointed take on LLM evaluation surfaced:

“Vector similarity and RAGAS metrics aren’t really suitable for grading. Free-form grading inevitably has to go through an LLM, and the standard practice is to write the evaluation rubric first and base everything on that.”

This single line compresses the production wisdom of LLM-as-Judge into three points. (1) Vector similarity and RAGAS score semantic match but don’t constitute a grading standard. (2) Free-form grading must call an LLM — rule-based scoring won’t reach. (3) Write the rubric first. “Tell me if this answer is good” doesn’t work as a prompt; you need an explicit grading scheme before you’ll get consistency.

This matches exactly where every modern LLM eval framework — DeepEval, Evidently, OpenAI Evals — is heading. Rubric-driven judging is now the standard.

Insights

That a font metric extractor and an LLM evaluation rubric thread emerge at the same moment signals something about the audience: these are people who are actually shipping product. The two topics look unrelated but the underlying move is identical — both are about reducing intuition-dependent territory to explicit, verifiable rules. The font tool reduces “are these metrics compatible” to a HarfBuzz rendering regression. LLM-as-Judge reduces “is this answer good” to a rubric. Both topics demand an automated verification step before they’re production-ready, and that verification step ends up defining the tool’s identity. The fact that polaris_mcfg has a validate subcommand at all, and that LLM eval frameworks treat rubrics as first-class objects, are expressions of the same engineering instinct. In production “it just works” is not a finishing line — explicit criteria + automated verification + regression tracking is the new bar, and these two topics point to the same place from very different starting points.

References

Tool repo and demo

PolarisOffice/polaris_mcfg — Metric-Compatible Font Generator (MIT, Python, 4 stars)
Demo / docs site

Font ecosystem

HarfBuzz — OpenType shaping engine
SIL Open Font License — de facto free-license standard (OFL 1.1, 2007)
SIL International — OFL stewards
Noto Sans and Pretendard — OFL-licensed Hangul fonts

LLM evaluation methodology

RAGAS — RAG evaluation framework
DeepEval — LLM-as-Judge + rubric-based eval
Evidently — ML/LLM monitoring and eval
OpenAI Evals — OpenAI’s official eval framework

Simon Willison's Granite 4.1 3B Pelican Gallery — Why All 21 Quantizations Flopped Together

Mon, 04 May 2026 00:00:00 +0900

Overview

Simon Willison ran his signature prompt — “Generate an SVG of a pelican riding a bicycle” — through 21 quantized variants of IBM Granite 4.1 3B, spanning 1.2GB to 6.34GB (51.3GB total). His verdict was one line: “There’s no distinguishable pattern relating quality to size — they’re all pretty terrible!”. This post takes that gallery as a starting point to ask what informal benchmarks catch that the leaderboards miss, and where to look first if you actually want to measure the quantization-vs-quality curve.

flowchart LR
 P["Prompt <br/> pelican on a bicycle"] --> Q["Granite 4.1 3B <br/> 21 quant variants"]
 Q --> S1["1.2GB ~ 6.34GB"]
 S1 --> O["21 SVG outputs"]
 O --> J["Simon's eyeball judgment"]
 J --> R["No size-quality pattern <br/> all abstract shapes"]

What’s the SVG Pelican Thing

The pelican-riding-a-bicycle series is Simon’s personal informal benchmark, run against every new LLM as it lands. The prompt is one line.

“Generate an SVG of a pelican riding a bicycle.”

SVG forces a text model to emit coordinates, paths, and a viewBox directly — visual reasoning, but expressed as tokens. More importantly, the result renders into an image immediately, so cross-model comparison is intuitive. Failure modes that don’t surface in LMArena anonymous pair-voting or MMLU multiple choice — proportion, line continuity, part placement — show up plainly in a single SVG.

The Experiment

Item	Detail
Target	IBM Granite 4.1 3B Instruct
Variants	21 quantizations (1.2GB to 6.34GB, 51.3GB total)
Prompt	“Generate an SVG of a pelican riding a bicycle”
Output	21 SVGs laid out in one gallery page
Judge	Simon Willison’s eyes

The original gallery post lays all 21 out on one page.

The Result — Simon’s Take

“There’s no distinguishable pattern relating quality to size — they’re all pretty terrible!”

No distinguishable pattern relating size to quality. 1.2GB and 6.34GB land effectively on the same line.
All 21 are abstract collections of shapes — neither pelican nor bicycle is clearly identifiable.
Curiously, the smallest model produced the most recognizable bicycle, and the largest produced the closest thing to a pelican — a hint that size-quality may not even be monotonic here.
Simon wraps with: less interesting than expected; he’ll retry on a model that can actually draw.

What Got Measured (and What Didn’t)

1. The quantization curve is bounded by the base model’s capability ceiling

A 5x memory range (1.2GB to 6.34GB) and no meaningful difference in output quality. But the takeaway is not “quantization is harmless.” The cleaner reading is: this base model is just weak at SVG pelicans.

To measure quantization impact cleanly, the base needs to be strong enough on the task. If the base sits near the floor, no scheme — AutoRound, GGUF, AWQ, anything — will produce visible separation. Verify the capability ceiling before designing the quant benchmark.

2. Informal benchmarks complement the standard leaderboards

LMArena pair-voting and MMLU measure token-level correctness or preference on text. Questions like “can this model lay out parts in 2D space” don’t surface there. The SVG pelican fills exactly that gap — not on any official leaderboard, but a sanity check everyone agrees on.

3. What this implies for the Granite family

IBM Granite and the watsonx Granite lineup are positioned for enterprise RAG, tool calling, and coding. On that map, an SVG pelican is an out-of-distribution task — being weak there is almost expected. But placed next to mobile-first small-model lines like Google’s Gemma + LiteRT releases, it underlines the bigger pattern: at the 3B class, practical usefulness depends heavily on which family put its capability where. Same parameter count, very different shapes of competence.

Insights

Informal benchmarks survive because they show, in a single image, the kind of failure a leaderboard score can’t render. The SVG pelican complements MMLU and LMArena; it doesn’t replace them — you need both to see a model’s strengths and weaknesses together. Quantization-vs-quality curves are bounded by the base model’s capability on the task, so before designing a quant benchmark, confirm the base sits well above the floor; otherwise AutoRound and friends just compress noise into smaller noise. The detail that the smallest variant drew the best bicycle is what’s actually interesting — it questions the monotonic assumption itself, suggesting quant comparisons should be read as distributions, not point scores. IBM Granite being weak at out-of-distribution visual reasoning is consistent with its enterprise targeting, which is why picking a 3B small open model is really a question of “which family put its capability where.” External observers like Simon laying 21 variants on one page is doing a real service — it’s a fast, shareable model card before any official benchmark numbers drop.

References

Original gallery post

IBM Granite

Related benchmark refs