Overview
Following Previous Post: #7, this is the 8th cycle. It covers two URL classifier quality fixes and one taxonomy-path bug found while running the multilingual blog. Three small fixes, but all three touch surfaces a user notices instantly — classification accuracy and build success.
URL Classifier — Perplexity Computer + Billing Page Filter
Background
Perplexity’s newly launched Computer Agent URLs were appearing in extracted history, but the classifier was dropping them into generic web_page. At the same time, RunPod billing/account pages were leaking personal account data while still being classified as a “tech URL.”
Implementation
Two changes to url_classifier.py’s dispatch table:
- Added Perplexity Computer agent pattern →
ai_chat_perplexity - Moved explicit billing patterns (
/billing,/account) into the default-filter set
Result
Billing pages no longer surface in history extracts, and Perplexity Computer results now classify correctly into the AI chat group.
URL Classifier — YouTube Shorts and GitHub Accuracy
Background
The classifier didn’t detect YouTube Shorts (youtube.com/shorts/...) and dropped them into generic web_page. Separately, GitHub URLs containing /blob/, /tree/, or /commit/ were misclassified as github_repo, causing the fetcher to try and pull a README that wasn’t there.
Implementation
# Add YouTube Shorts pattern
re.compile(r"youtube\.com/shorts/"): UrlType.YOUTUBE,
# Strengthen GitHub repo pattern — split blob/tree/commit into github_other
Result
YouTube Shorts are now valid transcript fetch targets, and GitHub /blob/ URLs are split into a separate category (github_other), eliminating fetcher failures.
Multilingual Taxonomy Path Bug
Background
After the blog moved to bilingual (Korean/English), creating a new tag wrote _index.md to content/tags/{tag}/_index.md. But in multilingual mode, Hugo expects content/{lang}/tags/{tag}/_index.md. Result: every new tag 404’d on the English homepage.
Implementation
The taxonomy _index writer in image_handler.py now accepts the --language flag and writes under language_content_dirs[lang]. When the publish command runs both languages, both _index files are created.
Result
The previously-broken tag index on the English blog now works, confirmed live after a cache bust.
Commit Log
| Message | Files |
|---|---|
| fix: classify Perplexity Computer agent URLs and filter billing pages | url_classifier.py |
| fix: improve URL classifier quality — YouTube Shorts, noise filtering, GitHub accuracy | url_classifier.py |
| fix: write taxonomy _index.md under language content root | image_handler.py |
Insights
All three fixes were of the form “what assumption did we wrongly model about another component (Chrome history DB, Hugo i18n)?” The URL classifier is essentially a set of assumptions about external site URL schemas — when a new pattern like Perplexity Computer ships, those assumptions go stale immediately. The multilingual taxonomy bug was more interesting: Hugo’s multilingual mode keeps the default content directory in place, but taxonomy indexes have to move under the language root. That asymmetry is exactly the kind of thing that breaks first at the single-language → multilingual transition.
