<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hallucination on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/hallucination/</link><description>Recent content in Hallucination on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 28 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/hallucination/index.xml" rel="self" type="application/rss+xml"/><item><title>The Substrate of Trust — Latent Embeddings, Multimodal Representation, Adversarial Verification, and Fake Citations</title><link>https://ice-ice-bear.github.io/posts/2026-05-28-arxiv-papers-digest/</link><pubDate>Thu, 28 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-28-arxiv-papers-digest/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post The Substrate of Trust — Latent Embeddings, Multimodal Representation, Adversarial Verification, and Fake Citations" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Four papers that hit arXiv around the same time look scattered on the surface — embeddings, multimodal representation, agent engineering, a math dataset — but three of them converge on one question: can we trust what the model produced? &lt;a class="link" href="https://arxiv.org/abs/2605.24938" target="_blank" rel="noopener"
 &gt;SMART&lt;/a&gt; and &lt;a class="link" href="https://arxiv.org/abs/2605.27295" target="_blank" rel="noopener"
 &gt;Gemini Embedding 2&lt;/a&gt; firm up the retrieval and representation substrate that grounds RAG; &lt;a class="link" href="https://arxiv.org/abs/2605.25665" target="_blank" rel="noopener"
 &gt;Meta-Engineering Harnesses&lt;/a&gt; verifies agent output adversarially; and &lt;a class="link" href="https://arxiv.org/abs/2605.28003" target="_blank" rel="noopener"
 &gt;ResearchMath-14k&lt;/a&gt; turns fabricated citations into a measured failure mode.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Q["Trusting machine output &amp;lt;br/&amp;gt; can we trust the output"] --&gt; G["Better grounding"]
 Q --&gt; V["Stronger verification"]
 Q --&gt; H["Measured hallucination"]
 G --&gt; S["SMART &amp;lt;br/&amp;gt; frozen hidden states &amp;lt;br/&amp;gt; late interaction"]
 G --&gt; GE["Gemini Embedding 2 &amp;lt;br/&amp;gt; unified multimodal space"]
 V --&gt; M["Meta-Engineering Harnesses &amp;lt;br/&amp;gt; contract-driven adversarial verify"]
 H --&gt; R["ResearchMath-14k &amp;lt;br/&amp;gt; fake references measured"]&lt;/pre&gt;&lt;h2 id="grounding-what-the-embedding-already-knew"&gt;Grounding: What the Embedding Already Knew
&lt;/h2&gt;&lt;p&gt;The claim in &lt;a class="link" href="https://arxiv.org/abs/2605.24938" target="_blank" rel="noopener"
 &gt;SMART&lt;/a&gt; (&amp;ldquo;Your Embedding Model is SMARTer Than You Think&amp;rdquo;) is provocative: the standard single-vector embedding models we already deploy contain a dormant multi-vector capability, and waking it up requires no retraining. Instead of training the model further, SMART applies &lt;a class="link" href="https://github.com/stanford-futuredata/ColBERT" target="_blank" rel="noopener"
 &gt;ColBERT&lt;/a&gt;-style late interaction (the &lt;a class="link" href="https://arxiv.org/abs/2004.12832" target="_blank" rel="noopener"
 &gt;original paper&lt;/a&gt;) at inference time over &lt;strong&gt;frozen hidden states&lt;/strong&gt;. By pulling out the token-level representations that exist &lt;em&gt;before&lt;/em&gt; they get squashed into a single vector, it reportedly beats SOTA multi-vector approaches on &lt;a class="link" href="https://en.wikipedia.org/wiki/Information_retrieval" target="_blank" rel="noopener"
 &gt;multimodal retrieval&lt;/a&gt; without any multi-vector training — and it does so while staying efficient. Code and weights are open-sourced, so the claim is reproducible right away.&lt;/p&gt;
&lt;p&gt;What makes this interesting is that it pushes back on the assumption that better retrieval requires training a heavier model from scratch. In a RAG pipeline, retrieval quality &lt;em&gt;is&lt;/em&gt; grounding quality, and weak grounding lets anything built on top drift into hallucination. SMART says there is free signal still sitting inside embedding models that are already in production.&lt;/p&gt;
&lt;h2 id="grounding-video-audio-image-text-in-one-space"&gt;Grounding: Video, Audio, Image, Text in One Space
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.27295" target="_blank" rel="noopener"
 &gt;Gemini Embedding 2&lt;/a&gt; reinforces the same substrate by the opposite route. Where SMART squeezes more out of existing models, this one trains a multimodal embedding model head-on that natively maps video, audio, image, and text into a single &lt;strong&gt;unified representation space&lt;/strong&gt;. Layering a multi-task, multi-stage recipe onto large-scale contrastive training, it claims SOTA with 62.9 R@1 on &lt;a class="link" href="https://cocodataset.org/" target="_blank" rel="noopener"
 &gt;MSCOCO&lt;/a&gt; image-text and 69.9 on &lt;a class="link" href="https://huggingface.co/spaces/mteb/leaderboard" target="_blank" rel="noopener"
 &gt;MTEB&lt;/a&gt; multilingual. Since the &lt;a class="link" href="https://github.com/embeddings-benchmark/mteb" target="_blank" rel="noopener"
 &gt;MTEB benchmark&lt;/a&gt; is the de facto yardstick for embedding quality, those numbers land on comparable coordinates.&lt;/p&gt;
&lt;p&gt;The emphasis falls on zero-shot generalization: this &lt;a class="link" href="https://deepmind.google/models/gemini/" target="_blank" rel="noopener"
 &gt;Gemini&lt;/a&gt;-family model reportedly transfers well to tasks and languages it never saw in training. SMART&amp;rsquo;s &amp;ldquo;unlock latent capability from a frozen state&amp;rdquo; and Gemini Embedding 2&amp;rsquo;s &amp;ldquo;native multimodal training&amp;rdquo; run in opposite directions but arrive at the same destination — a firmer floor for RAG to stand on.&lt;/p&gt;
&lt;h2 id="verification-compile-the-contract-then-check-it-adversarially"&gt;Verification: Compile the Contract, Then Check It Adversarially
&lt;/h2&gt;&lt;p&gt;If grounding is trust on the input side, &lt;a class="link" href="https://arxiv.org/abs/2605.25665" target="_blank" rel="noopener"
 &gt;Meta-Engineering Harnesses&lt;/a&gt; is about trust on the output side. The paper proposes a &lt;strong&gt;contract-driven adversarial verification&lt;/strong&gt; architecture for AI-native software production. It compiles product requirements into explicit contracts, routes tasks to specialized &lt;a class="link" href="https://huggingface.co/papers" target="_blank" rel="noopener"
 &gt;agents&lt;/a&gt;, and then re-checks their output through an &lt;strong&gt;independent verification&lt;/strong&gt; stage. Generation and verification are deliberately split so the two adversarially keep each other honest.&lt;/p&gt;
&lt;p&gt;On top of that sit persistent memory and failure classification: which task failed and why is accumulated, classified, and fed back into the next routing decision. The authors frame this as &amp;ldquo;CTO-as-a-service&amp;rdquo; and report early deployment across 17 features. The core move is refusing to trust a single agent&amp;rsquo;s self-confidence and externalizing verification into a separate step — a system-design acknowledgment that an LLM&amp;rsquo;s self-evaluation is hard to rely on.&lt;/p&gt;
&lt;h2 id="measured-hallucination-fake-citations-as-a-failure-mode"&gt;Measured Hallucination: Fake Citations as a Failure Mode
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.28003" target="_blank" rel="noopener"
 &gt;ResearchMath-14k&lt;/a&gt; (&lt;a class="link" href="https://huggingface.co/papers/2605.28003" target="_blank" rel="noopener"
 &gt;HF Papers&lt;/a&gt;) quantifies the trust problem most sharply. It is 14,056 research-level math problems built through a multi-agent pipeline — the largest of its kind to date (&lt;a class="link" href="https://huggingface.co/datasets/amphora/ResearchMath-14k" target="_blank" rel="noopener"
 &gt;dataset&lt;/a&gt;) — accompanied by ResearchMath-Reasoning, 220K teacher trajectories.&lt;/p&gt;
&lt;p&gt;The most striking finding is about citation behavior. Newer open models produce 5.6x more references per trace — but they also generate &lt;strong&gt;5.0x more &lt;em&gt;fake&lt;/em&gt; references&lt;/strong&gt;. In other words, the model that looks smarter fabricates more, and more convincingly: hallucination is not shrinking, it is getting more polished. Fortunately, after agentic filtering removes those fake references, fine-tuning &lt;a class="link" href="https://github.com/QwenLM/Qwen3" target="_blank" rel="noopener"
 &gt;Qwen3&lt;/a&gt; 4B-30B (&lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Qwen&lt;/a&gt;) improves the average by +9.2 points over base — direct evidence that verification and filtering lift data quality. The authors are from &lt;a class="link" href="https://en.snu.ac.kr/" target="_blank" rel="noopener"
 &gt;Seoul National University&lt;/a&gt;, &lt;a class="link" href="https://onelineai.com" target="_blank" rel="noopener"
 &gt;OneLineAI&lt;/a&gt;, and Yonsei University.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;Set side by side, the four papers decompose one question — can we trust machine output — into three layers. First, input grounding: &lt;a class="link" href="https://arxiv.org/abs/2605.24938" target="_blank" rel="noopener"
 &gt;SMART&lt;/a&gt; extracts stronger retrieval signal from embeddings you already own without retraining, while &lt;a class="link" href="https://arxiv.org/abs/2605.27295" target="_blank" rel="noopener"
 &gt;Gemini Embedding 2&lt;/a&gt; unifies modalities into one space and widens the floor RAG stands on. Second, output verification: &lt;a class="link" href="https://arxiv.org/abs/2605.25665" target="_blank" rel="noopener"
 &gt;Meta-Engineering Harnesses&lt;/a&gt; splits generation from verification and declines to trust a single agent&amp;rsquo;s confidence. Third, quantifying failure: &lt;a class="link" href="https://arxiv.org/abs/2605.28003" target="_blank" rel="noopener"
 &gt;ResearchMath-14k&lt;/a&gt; nails hallucination to a measurable figure — &amp;ldquo;5x more fake citations.&amp;rdquo; Better grounding plus harder verification plus measured hallucination is the same week&amp;rsquo;s single answer. The lessons of SMART and ResearchMath-14k are notably complementary: the former says &amp;ldquo;there is free signal to extract,&amp;rdquo; the latter says &amp;ldquo;even on top of that signal, the output can still fabricate.&amp;rdquo; So the practical takeaway is simple — strengthen grounding, externalize verification, and track failures in numbers. One caveat: the SOTA and improvement figures across three of the papers are author-reported, so they are safest read as directional until independently reproduced.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Retrieval / Representation (Grounding)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.24938" target="_blank" rel="noopener"
 &gt;SMART — Your Embedding Model is SMARTer Than You Think&lt;/a&gt; — late interaction over frozen hidden states unlocks multi-vector capability, no retraining&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.27295" target="_blank" rel="noopener"
 &gt;Gemini Embedding 2&lt;/a&gt; — unified representation space for video/audio/image/text, MSCOCO 62.9 / MTEB 69.9&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/stanford-futuredata/ColBERT" target="_blank" rel="noopener"
 &gt;ColBERT (GitHub)&lt;/a&gt; · &lt;a class="link" href="https://arxiv.org/abs/2004.12832" target="_blank" rel="noopener"
 &gt;ColBERT original paper&lt;/a&gt; — foundation of late-interaction retrieval&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/spaces/mteb/leaderboard" target="_blank" rel="noopener"
 &gt;MTEB leaderboard&lt;/a&gt; · &lt;a class="link" href="https://github.com/embeddings-benchmark/mteb" target="_blank" rel="noopener"
 &gt;MTEB (GitHub)&lt;/a&gt; — embedding evaluation standard&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://cocodataset.org/" target="_blank" rel="noopener"
 &gt;MSCOCO&lt;/a&gt; — image-text retrieval benchmark&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://deepmind.google/models/gemini/" target="_blank" rel="noopener"
 &gt;Google DeepMind Gemini&lt;/a&gt; — the Gemini model family&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Agent Verification&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.25665" target="_blank" rel="noopener"
 &gt;Meta-Engineering Harnesses for AI-Native Software Production&lt;/a&gt; — contract-driven adversarial verification, specialized agent routing, persistent memory and failure classification&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Hallucination / Evaluation&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.28003" target="_blank" rel="noopener"
 &gt;ResearchMath-14k (arXiv)&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/papers/2605.28003" target="_blank" rel="noopener"
 &gt;HF Papers&lt;/a&gt; — 14,056 research-level math problems, 5x increase in fake citations finding&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/datasets/amphora/ResearchMath-14k" target="_blank" rel="noopener"
 &gt;ResearchMath-14k dataset&lt;/a&gt; — open dataset&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/QwenLM/Qwen3" target="_blank" rel="noopener"
 &gt;Qwen3 (GitHub)&lt;/a&gt; · &lt;a class="link" href="https://huggingface.co/Qwen" target="_blank" rel="noopener"
 &gt;Qwen (Hugging Face)&lt;/a&gt; — fine-tuned base models&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.snu.ac.kr/" target="_blank" rel="noopener"
 &gt;Seoul National University&lt;/a&gt; · &lt;a class="link" href="https://onelineai.com" target="_blank" rel="noopener"
 &gt;OneLineAI&lt;/a&gt; — author affiliations&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Information_retrieval" target="_blank" rel="noopener"
 &gt;Information retrieval overview&lt;/a&gt; — background&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>