<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Whisper on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/whisper/</link><description>Recent content in Whisper on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 29 Jun 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/whisper/index.xml" rel="self" type="application/rss+xml"/><item><title>The Local Speech Stack: Voice-to-Text, TTS Cloning, and Pronunciation Feedback Without the Cloud</title><link>https://ice-ice-bear.github.io/posts/2026-06-29-local-speech-stack/</link><pubDate>Mon, 29 Jun 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-06-29-local-speech-stack/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post The Local Speech Stack: Voice-to-Text, TTS Cloning, and Pronunciation Feedback Without the Cloud" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;A cluster of speech projects I came across share one premise: run the model on your own machine, not someone else&amp;rsquo;s server. Whispree turns your voice into AI prompts on macOS; VoxCPM and Voicebox do TTS and voice cloning locally; SpeechFeedback builds a Korean pronunciation-correction system on top of a Deep Speech 2 ASR model. Together they show how far local speech has come — and where the cloud still wins.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 V["Your voice"] --&gt; STT["Speech-to-Text (Whispree: WhisperKit / Groq)"]
 STT --&gt; LLM["LLM post-processing (filler removal, code-switching)"]
 LLM --&gt; OUT["Text inserted at cursor"]
 T["Your text"] --&gt; TTS["Local TTS / cloning (VoxCPM, Voicebox)"]
 TTS --&gt; AUDIO["Synthesized speech"]
 A["Spoken Korean"] --&gt; ASR["Deep Speech 2 + IPA (SpeechFeedback)"]
 ASR --&gt; FB["Pronunciation feedback"]&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id="whispree-voice-to-prompt-for-apple-silicon"&gt;Whispree: Voice-to-Prompt for Apple Silicon
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/Arsture/whispree" target="_blank" rel="noopener"
 &gt;Whispree&lt;/a&gt; (113★, mostly Swift) is a fully local macOS menu-bar app positioned as an open-source SuperWhisper alternative. The pitch is &amp;ldquo;talk to AI instead of typing&amp;rdquo;: place your cursor in any prompt input — Cursor, Claude, ChatGPT — hit &lt;code&gt;Ctrl+Shift+R&lt;/code&gt;, speak, and corrected text is pasted exactly where the cursor was. It even remembers the original focus position if you switch windows mid-recording.&lt;/p&gt;
&lt;p&gt;What makes it more than dictation is the &lt;strong&gt;LLM post-processing layer&lt;/strong&gt;. Four correction modes — Standard (fix STT errors), Filler Removal (strip &amp;ldquo;um/uh&amp;rdquo;), Structured (organize rambling speech into bullet points for a prompt), and Custom — sit between the raw transcription and the paste. The standout for Korean developers is &lt;strong&gt;code-switching optimization&lt;/strong&gt;: it correctly rewrites mixed Korean-English tech speech, e.g. &lt;code&gt;&amp;quot;밸리데이션 해야 되거든&amp;quot;&lt;/code&gt; → &lt;code&gt;&amp;quot;validation 해야 되거든&amp;quot;&lt;/code&gt;, &lt;code&gt;&amp;quot;깃허브에 PR 올려놨어&amp;quot;&lt;/code&gt; → &lt;code&gt;&amp;quot;GitHub에 PR 올려놨어&amp;quot;&lt;/code&gt;. It also auto-captures a screenshot of the focused screen and attaches it, so vision-capable models can correct formulas and technical terms from context.&lt;/p&gt;
&lt;p&gt;The provider architecture is the clever part. STT uses Groq (free) and LLM correction borrows your existing &lt;strong&gt;Codex CLI OAuth tokens&lt;/strong&gt; — so &amp;ldquo;if you have an OpenAI account, you get high-quality STT + LLM correction with virtually no additional cost.&amp;rdquo; Local options exist too (WhisperKit on CoreML+ANE, MLX Audio, six local LLMs). There&amp;rsquo;s even a URL scheme (&lt;code&gt;whispree://toggle&lt;/code&gt;) for triggering from Raycast, Stream Deck, or AppleScript. One notable bit of release discipline visible in the commit log: a Claude subscription provider was reverted before release and preserved on a feature branch — a reminder that &amp;ldquo;what you ship&amp;rdquo; and &amp;ldquo;what you built&amp;rdquo; aren&amp;rsquo;t the same.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="voxcpm-and-voicebox-tts-and-cloning-entirely-local"&gt;VoxCPM and Voicebox: TTS and Cloning, Entirely Local
&lt;/h2&gt;&lt;p&gt;On the synthesis side, two projects stood out. &lt;strong&gt;VoxCPM&lt;/strong&gt; (the project behind the &amp;ldquo;open-source TTS that nails morning-drama dialogue&amp;rdquo; Short) is a multilingual speech model doing voice design and cloning under Apache 2.0, and the demo&amp;rsquo;s point was its emotional range — Korean dialogue delivered with convincing melodramatic affect, not flat robotic TTS.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Voicebox&lt;/strong&gt; is framed as &amp;ldquo;ElevenLabs downloaded onto your PC&amp;rdquo;: free, open-source, everything runs locally with no internet. Its identity is &lt;em&gt;local-first&lt;/em&gt;, and the architecture is a Swiss-army-knife of &lt;strong&gt;five swappable engines&lt;/strong&gt; — pick Qwen-TTS-style models for multilingual, instruction-following voices (&amp;ldquo;speak a little slower&amp;rdquo;), or Chatterbox Turbo for fast, emotion-tagged synthesis (write &lt;code&gt;(laughs)&lt;/code&gt; or &lt;code&gt;(sighs)&lt;/code&gt; inline). It&amp;rsquo;s not just text-to-speech but a full audio production studio: multi-character scenes, reverb effects, editing, and an API for automation.&lt;/p&gt;
&lt;p&gt;The honest trade-off, as the reviewer put it, is that local synthesis is &amp;ldquo;a manual-transmission car&amp;rdquo; — more control, but you handle setup, GPU requirements, and a learning curve. If you need a result in one minute and don&amp;rsquo;t have a fast GPU, a cloud subscription is still the saner choice. Where local &lt;em&gt;clearly&lt;/em&gt; wins: game developers generating thousands of NPC lines without per-use fees, content creators keeping scripts off external servers, and companies wiring TTS into pipelines that handle sensitive data.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="speechfeedback-asr-as-a-pronunciation-tutor"&gt;SpeechFeedback: ASR as a Pronunciation Tutor
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/DevTae/SpeechFeedback" target="_blank" rel="noopener"
 &gt;SpeechFeedback&lt;/a&gt; takes ASR in a different direction — not transcription for its own sake, but &lt;strong&gt;Korean pronunciation correction&lt;/strong&gt;. It&amp;rsquo;s a Docker + FastAPI system built on the KoSpeech toolkit, implementing a Deep Speech 2 architecture (3-layer CNN + 7 bidirectional GRU layers + CTC loss, per the Baidu paper).&lt;/p&gt;
&lt;p&gt;The clever design choice is &lt;strong&gt;IPA (International Phonetic Alphabet) conversion&lt;/strong&gt;. Instead of recognizing standard orthography, the model recognizes pronunciation as actually spoken, which collapsed the output vocabulary from &lt;strong&gt;2000 classes down to 44&lt;/strong&gt; — a dramatically smaller, more learnable target. That&amp;rsquo;s what lets it give feedback on &lt;em&gt;how&lt;/em&gt; a word was pronounced versus how it should be. The project&amp;rsquo;s engineering log is a nice case study in data-bound ML: scaling labeled data from 10k to 600k examples (by porting an R-based hangul-to-IPA converter to Python) grew the per-epoch step count ~60×, and switching the source dataset from lecture audio to conversational Korean made the model generalize better to everyday speech.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph LR
 A["Korean speech input"] --&gt; B["Deep Speech 2 ASR"]
 B --&gt; C["IPA transcription (44-class output)"]
 C --&gt; D["compare vs target pronunciation"]
 D --&gt; E["pronunciation feedback"]&lt;/pre&gt;&lt;hr&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;The common thread is that &lt;strong&gt;speech is following the same local-first arc that image and text models took&lt;/strong&gt;: capable open models, on-device inference, and privacy as the headline feature rather than an afterthought. But the four projects also map the spectrum cleanly. Whispree is pragmatic-hybrid — local app, but happy to borrow Groq and Codex tokens because that&amp;rsquo;s where quality-per-cost is best right now. Voicebox is purist-local, trading convenience for control and zero data egress. VoxCPM shows the synthesis quality bar (emotional, multilingual) has risen enough that &amp;ldquo;local&amp;rdquo; no longer means &amp;ldquo;obviously worse.&amp;rdquo; And SpeechFeedback is a reminder that ASR isn&amp;rsquo;t only for transcription — reframed with IPA, the same model becomes a tutor. The recurring lesson for builders: the interesting work increasingly sits in the &lt;em&gt;layer around&lt;/em&gt; the speech model — LLM post-processing, provider routing, IPA reframing, multi-engine selection — not the model itself.&lt;/p&gt;</description></item></channel></rss>