<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Image Segmentation on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/image-segmentation/</link><description>Recent content in Image Segmentation on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/image-segmentation/index.xml" rel="self" type="application/rss+xml"/><item><title>ToonOut and BiRefNet — How an Anime-Tuned Matting Model Hits 99.5% Pixel Accuracy</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-toonout-birefnet-anime-matting/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-toonout-birefnet-anime-matting/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post ToonOut and BiRefNet — How an Anime-Tuned Matting Model Hits 99.5% Pixel Accuracy" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;In &lt;a class="link" href="https://ice-ice-bear.github.io/posts/2026-05-07-popcon-dev11/" &gt;popcon dev #11&lt;/a&gt; I swapped the matting model to ToonOut. Reading the two GitHub repos side-by-side makes the story clear — &lt;a class="link" href="https://github.com/zhengpeng7/birefnet" target="_blank" rel="noopener"
 &gt;ZhengPeng7/BiRefNet&lt;/a&gt; (CAAI AIR'24, ★3,397, near-SOTA general matting) and &lt;a class="link" href="https://github.com/MatteoKartoon/BiRefNet" target="_blank" rel="noopener"
 &gt;MatteoKartoon/BiRefNet&lt;/a&gt; (anime-only fine-tune, ★94, arXiv:2509.06839). A clean example of the base-model + domain-fine-tune pattern.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph LR
 Input["Anime character RGB image"] --&gt; Compose["Composite onto #808080 gray bg &amp;lt;br/&amp;gt; (ToonOut training distribution)"]
 Compose --&gt; ToonOut["ToonOut &amp;lt;br/&amp;gt; (BiRefNet fine-tuned &amp;lt;br/&amp;gt; on 1228 anime images)"]
 ToonOut --&gt; Mask["Alpha mask &amp;lt;br/&amp;gt; (95.3% → 99.5%)"]
 Mask --&gt; Compose2["Composite onto target bg"]
 BaseRef["ZhengPeng7/BiRefNet &amp;lt;br/&amp;gt; (general matting SOTA)"] -. fine-tune .-&gt; ToonOut
 Dataset["1228 hand-annotated &amp;lt;br/&amp;gt; CC-BY 4.0"] -. train .-&gt; ToonOut&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id="birefnet--bilateral-reference-dichotomous-image-segmentation"&gt;BiRefNet — Bilateral Reference, dichotomous image segmentation
&lt;/h2&gt;&lt;p&gt;The original BiRefNet is a 2024 paper in CAAI Artificial Intelligence Research. &amp;ldquo;Dichotomous image segmentation&amp;rdquo; is the task of cleanly splitting foreground (salient) from background. What sets it apart from generic matting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High-resolution training&lt;/strong&gt; — 1024×1024 input + denser supervision than typical matting setups.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bilateral reference&lt;/strong&gt; — the decoder consults the input image twice in the forward pass. First a coarse segmentation, then fine-grained refinement. Strong on thin structures like hair.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Salient object + camouflaged object + DIS unified&lt;/strong&gt; — the model handles three tasks together, which boosts generalization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The repo&amp;rsquo;s News timeline shows ongoing maintenance:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Date&lt;/th&gt;
 &lt;th&gt;Change&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;2025-02-12&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;BiRefNet_HR-matting&lt;/code&gt; — trained at 2048×2048, dedicated high-res matting&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;2025-03-31&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;BiRefNet_dynamic&lt;/code&gt; — dynamic resolution training from 256×256 to 2304×2304. &lt;strong&gt;Robust at any resolution&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;2025-05-15&lt;/td&gt;
 &lt;td&gt;Fine-tuning tutorial video on YouTube/Bilibili&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;2025-06-30&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;refine_foreground&lt;/code&gt; accelerated 8x — ~80ms on a 5090&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;2025-09-23&lt;/td&gt;
 &lt;td&gt;Swin transformer attention swapped for PyTorch SDPA, less memory + future flash_attn compatibility&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;BiRefNet_dynamic&lt;/code&gt; is the one to watch. Trained on a dynamic resolution range (256→2304), so inference is robust at arbitrary resolutions. Previously you had to resize inputs to the training resolution; the dynamic model removes that step.&lt;/p&gt;
&lt;p&gt;GPU sponsorship is also explicit — Freepik provided GPUs for high-resolution training. A pattern: academic models maturing into production-grade releases.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="toonout--fine-tuning-on-1228-hand-annotated-images"&gt;ToonOut — fine-tuning on 1,228 hand-annotated images
&lt;/h2&gt;&lt;p&gt;ToonOut is a fork of BiRefNet. The headline number from the README:&lt;/p&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;hellip;we collected and annotated a custom dataset of &lt;strong&gt;1,228 high-quality anime images&lt;/strong&gt;&amp;hellip; The resulting model, &lt;strong&gt;ToonOut&lt;/strong&gt;, shows marked improvements in background removal accuracy for anime-style images, achieving an increase in Pixel Accuracy from &lt;strong&gt;95.3% to 99.5%&lt;/strong&gt; on our test set.&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;p&gt;1,228 is a small fine-tuning set. And yet it earned a 4.2-point pixel-accuracy gain. &lt;strong&gt;Which means the base BiRefNet was already strong; only the domain gap needed closing.&lt;/strong&gt; When you fine-tune a model that already works well on generic matting onto an anime distribution, you&amp;rsquo;re not learning the entire distribution again — you&amp;rsquo;re exposing it to edge-case patterns (hair, transparency, anime shading), and 1,228 images was enough.&lt;/p&gt;
&lt;h3 id="dataset-structure"&gt;Dataset structure
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;toonout_dataset/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├── train/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├── train_generations_20250318_emotion/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ ├── im/ # raw RGB
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ ├── gt/ # ground-truth alpha mask
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ └── an/ # combined RGBA
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;im/gt/an&lt;/code&gt; triple is the standard matting dataset shape. The dataset license is CC-BY 4.0 and the model weights are MIT, so production use has minimal constraints.&lt;/p&gt;
&lt;h3 id="fork-specific-changes"&gt;Fork-specific changes
&lt;/h3&gt;&lt;p&gt;What ToonOut adjusted from upstream BiRefNet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;bfloat16 to dodge NaN gradients&lt;/strong&gt; — the original fp16 training apparently had instability issues. &lt;code&gt;train_finetuning.sh&lt;/code&gt; standardizes to &lt;code&gt;bfloat16&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation script fix&lt;/strong&gt; — a corrected &lt;code&gt;evaluations.py&lt;/code&gt; replaces the original &lt;code&gt;eval_existingOnes.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Five fundamental scripts&lt;/strong&gt; — the split/train/test/eval/visualize pipeline tidied into bash entrypoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Utility scripts&lt;/strong&gt; — baseline prediction, alpha-mask extraction, and a &lt;strong&gt;Photoroom API integration&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last one is interesting. Photoroom is the strong commercial player in BG-removal. Bringing it in as a baseline means the ToonOut paper evaluated on three axes — academic SOTA + commercial API + ours. An academic paper with a production-grade evaluation perspective.&lt;/p&gt;
&lt;p&gt;The GPU disclaimer is also honest — training was done on 2× RTX 4090s with 24GB. That&amp;rsquo;s roughly a week of cloud compute. This level of fine-tuning is in reach of an individual.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="integrating-toonout-into-popcon"&gt;Integrating ToonOut into popcon
&lt;/h2&gt;&lt;p&gt;One more thing I learned during the swap: &lt;strong&gt;ToonOut assumes a #808080 gray background in its training distribution.&lt;/strong&gt; Pass it RGBA on white or any other background and the matting result wobbles.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# gpu_worker — always composite onto #808080 before ToonOut&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_swap_bg_to_gray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rgba&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;Soft white-key compositor: alpha-blend onto #808080.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rgba&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;rgb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rgba&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;gray&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;full_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rgb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;gray&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This is a small case of &amp;ldquo;training distribution alignment.&amp;rdquo; Normalizing inputs to match what the model trained on changes inference accuracy noticeably. The README doesn&amp;rsquo;t state this explicitly, but the training scripts and the RGBA images in the &lt;code&gt;an/&lt;/code&gt; folder strongly suggest the training data was already pre-composited on gray.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;ToonOut is a clean example of how to do domain fine-tuning. Three patterns:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Base-model selection is half the work.&lt;/strong&gt; Because BiRefNet was already near-SOTA on general matting, 1,228 anime images was enough. With a weaker base, ten thousand wouldn&amp;rsquo;t have been.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate licensing for dataset and weights.&lt;/strong&gt; Dataset is CC-BY, weights are MIT. Others can use the weights in production unrestricted, and the dataset is open to both academic and commercial work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Input distribution alignment at inference.&lt;/strong&gt; A small step that normalizes inputs to the training distribution (here: composite onto gray) materially affects accuracy.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BiRefNet&amp;rsquo;s News timeline is itself a study aid. You can watch a model evolve from academic release into production grade — dynamic resolution, attention backend swap, 8x foreground-refine acceleration — and a year of maintenance patterns reveals itself line by line.&lt;/p&gt;
&lt;p&gt;Up next: the evaluation methodology in the ToonOut paper (arXiv:2509.06839), implementation details of BiRefNet_dynamic&amp;rsquo;s dynamic-resolution training, and the matting-quality A/B metric in popcon (previous model vs ToonOut).&lt;/p&gt;</description></item></channel></rss>