Featured image of post log-blog Dev Log — Extracting Dev Logs from Claude Code Sessions

log-blog Dev Log — Extracting Dev Logs from Claude Code Sessions

Building the sessions command that parses Claude Code session data to auto-generate development logs, plus improvements to AI chat extraction

Overview

log-blog is a Python CLI tool that converts Chrome browsing history into Hugo blog posts. Today’s work split across two major threads. First, I improved AI chat URL classification and added Gemini share link extraction. Second, I built a new sessions command that parses Claude Code CLI session data to auto-generate development log posts. Across four sessions and roughly five hours, 13 commits landed.


AI Chat Extraction Improvements — AI_LANDING Noise Filter

Background

When extracting AI service URLs from Chrome history, actual conversation pages and landing/login pages were mixed together. Across two Chrome profiles, 96 out of 3,575 URLs were AI service URLs — and most were noise: claude.ai/oauth/*, chatgpt.com/ (landing page), gemini.google.com/app (no conversation ID).

Diagnosis:

  • Claude: Most URLs were claude.ai/code/* (Claude Code sessions); claude.ai/chat/{uuid} conversation patterns: 0
  • ChatGPT: 1 conversation URL, the rest landing pages
  • Gemini: gemini.google.com/app/{id} conversations matched, but gemini.google.com/share/{id} (share links) were missing
  • Perplexity: No URLs in history at all

Implementation

I added AI_LANDING to the UrlType enum and restructured the classifier to run the noise filter before conversation pattern matching.

class UrlType(str, Enum):
    # ... existing types ...
    AI_LANDING = "ai_landing"  # Noise: landing/OAuth/settings pages

Sample noise patterns:

_AI_NOISE_PATTERNS = [
    re.compile(r"claude\.ai/(?:oauth|chrome|code(?:/(?:onboarding|family))?)?(?:[?#]|$)"),
    re.compile(r"chatgpt\.com/?(?:[?#]|$)"),
    re.compile(r"gemini\.google\.com/(?:app)?(?:/download)?(?:[?#]|$)"),
    # ...
]

In content_fetcher.py, AI_LANDING URLs now get an early-return skip with no fetch attempt — no wasting Playwright slots on login walls.

I also added url_type to the extract --json output, so the skill’s Step 2 classification uses the same regex engine instead of having Claude guess the type.

Result: 34 AI chat conversations correctly classified, 32 noise URLs filtered out.

Added the gemini.google.com/share/{id} pattern to the Gemini classification regex, and implemented a dedicated _extract_gemini_share() extractor in ai_chat_fetcher.py. Share links are publicly accessible, so they’re handled with standard Playwright — no CDP connection needed.


YouTube Fetcher Fix — Adapting to a Breaking API Change

Background

While writing a blog post, YouTube transcript fetching failed:

AttributeError: type object 'YouTubeTranscriptApi' has no attribute 'list_transcripts'

The youtube-transcript-api library shipped a v1.x update that changed class methods to instance methods.

v0.x (old)v1.x (new)
YouTubeTranscriptApi.list_transcripts(video_id)YouTubeTranscriptApi().list(video_id)
YouTubeTranscriptApi.get_transcript(video_id)YouTubeTranscriptApi().fetch(video_id)

Implementation

I rewrote youtube_fetcher.py:

def _get_transcript(video_id: str):
    from youtube_transcript_api import YouTubeTranscriptApi
    api = YouTubeTranscriptApi()
    try:
        return api.fetch(video_id, languages=["ko", "en"])
    except Exception:
        pass
    try:
        transcript_list = api.list(video_id)
        for transcript in transcript_list:
            try:
                return transcript.fetch()
            except Exception:
                continue
    except Exception:
        pass
    return None

I also added the YouTube oEmbed API as a fallback to fetch video metadata (title, channel name, thumbnail) even when no transcript is available. Zero dependencies — just urllib.request:

_OEMBED_URL = "https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={video_id}&format=json"

Three-tier fallback:

  1. Transcript + oEmbed metadata (best)
  2. oEmbed metadata only (when transcript unavailable)
  3. Playwright scraping (when everything else fails)

Sessions Command — Extracting Dev Logs from Claude Code Sessions

Background

I run 20–40 Claude Code CLI sessions per day across multiple projects (GitHub + Bitbucket). Those sessions contain rich development narrative — debugging processes, architecture decisions, code changes — but there was no way to turn them into blog posts. The Chrome history pipeline tells me “what I looked at” but not “what I built.”

Data Flow

Automatic Project Discovery

Claude Code stores session files under ~/.claude/projects/ in directories named with the project path encoded as a string:

-Users-lsr-Documents-github-trading-agent/
  ├── f08f2420-0442-475f-a1f8-3691da54eb9d.jsonl
  ├── 30de43c5-8bc2-48d0-86df-c1a6a3f7f6ee.jsonl
  └── ...

The problem: directory names can contain hyphens. For a repo named hybrid-image-search-demo, it’s impossible to tell from the directory name alone which hyphens are path separators and which are part of directory names.

I solved this with a greedy filesystem matching algorithm:

def _reverse_map_path(dirname: str) -> Path | None:
    # Strip worktree suffix if present
    if _WORKTREE_SEPARATOR in dirname:
        dirname = dirname.split(_WORKTREE_SEPARATOR)[0]

    raw = "/" + dirname[1:]  # leading '-' → '/'
    segments = raw.split("-")

    result_parts: list[str] = []
    i = 0
    while i < len(segments):
        matched = False
        for j in range(len(segments), i, -1):
            candidate = "-".join(segments[i:j])
            test_path = "/".join(result_parts + [candidate])
            if os.path.exists(test_path):
                result_parts.append(candidate)
                i = j
                matched = True
                break
        if not matched:
            result_parts.append(segments[i])
            i += 1

    path = Path("/".join(result_parts))
    return path if path.exists() else None

By trying the longest possible match first, directories with hyphens like /Users/lsr/Documents/bitbucket/hybrid-image-search-demo are resolved correctly.

JSONL Parsing — Smart Filtering

Claude Code’s JSONL files contain many message types: user, assistant, system, progress, and more. Including everything produces too much noise; I need to extract what matters.

Message typeInclude?What to extract
User textYesFull text (narrative backbone)
Assistant textYesUp to 1,500 chars (decisions/explanations)
Edit/Write tool callsYesFile path + diff content
Bash errorsYesCommand + stderr
Bash successSummary onlyCommand only
WebFetch/WebSearchSummary onlyURL/query only
Agent subtasksSummary onlyDelegation description + result summary
Read/Grep/GlobNoExploration noise
thinking blocksNoInternal reasoning, noise

Default exclusions: sessions under 2 minutes or with fewer than 3 messages (override with --include-short). Max 100 items per session.

CLI Usage

# List available projects
uv run log-blog sessions --list

# Detailed session data for a specific project (JSON)
uv run log-blog sessions --project log-blog --all --json

# All data including short sessions
uv run log-blog sessions --all --include-short --json

The output JSON contains three key datasets — sessions, git_commits, and files_changed — which the Claude Code skill’s “Dev Log Mode” reads to write a narrative development log post.


Skill Update — Adding Dev Log Mode

I added a “Dev Log Mode” section to SKILL.md. When a user says “summarize what I did today” or “write a dev log,” the skill now branches to the session-data flow instead of the Chrome history flow.

Comparing the two modes:

ItemChrome History ModeDev Log Mode
Data sourceChrome SQLite DBClaude Code JSONL + git log
Content nature“What I looked at”“What I built”
Post styleTopic-based technical analysisProblem → solution narrative
Fetching neededYes (Playwright/API per URL)No (included in session data)

Commit Log

MessageChanged files
docs: add design spec for AI chat extraction improvementspecs
docs: fix stale references in AI chat extraction specspecs
docs: add implementation plan for AI chat extraction improvementplans
chore: add pytest dev dependencypyproject.toml, uv.lock
feat: add AI_LANDING noise filter and Gemini share link supporturl_classifier.py, tests
feat: add url_type to extract –json and filter AI_LANDING noisecli.py, tests
feat: skip AI_LANDING URLs in content fetchercontent_fetcher.py
feat: add Gemini share link content extractionai_chat_fetcher.py
docs: update skill to use url_type from extract outputSKILL.md
docs: add session-to-devlog feature design specspecs
docs: update session-devlog spec with review fixesspecs
docs: add session-devlog implementation planplans
feat: add sessions command for Claude Code dev log extractioncli.py, config.py, session_parser.py

Insights

Two separate threads converged on the same goal today. Improving AI chat URL classification captures “what I looked at externally” more accurately; the sessions command captures “what I built internally.” Together they move log-blog from a “browsing log tool” to a foundation for recording the full scope of development activity.

The greedy filesystem matching algorithm is simple but effective. Reverse-mapping hyphenated directory names can’t be solved with regex alone — checking the actual filesystem is the most reliable approach. The key insight is accepting that Claude Code’s project directory encoding is lossy and validating at runtime instead.

The youtube-transcript-api v1.x breaking change was a reminder of why dependency management matters. Adding oEmbed as a fallback reflects graceful degradation — “if we can’t get the transcript, at least get the metadata.” The result is a three-tier fallback (transcript + oEmbed, oEmbed only, Playwright), each level maximizing the information retrieved.

The spec → design → plan → implement workflow (brainstorm → writing-plans → subagent-driven-development) continues to prove its worth. The AI chat improvement handled 7 tasks in parallel via subagents, and three spec review loops removed unnecessary types like AI_CHAT_CLAUDE_CODE, meaningfully improving the design before any code was written.

Built with Hugo
Theme Stack designed by Jimmy