PopCon Dev Log #4 — SAM 2.1 Interactive Segmentation and Cost Optimization

Overview

This is the fourth entry in the PopCon dev log series. Two major changes happened this round. First, VEO 3’s cost was unsustainable, so I switched the video generation model to Alibaba’s DashScope Wan 2.2. Second, rembg’s background removal quality wasn’t cutting it, so I built an interactive segmentation workflow using Meta’s SAM 2.1 — users click on the foreground object and SAM generates a precise mask.

Video Generation Model Swap: VEO 3 → DashScope Wan 2.2

The Cost Problem

VEO 3 produced good results, but the cost added up fast. PopCon needs to generate multiple action videos per emoji character, so per-generation cost matters a lot.

I evaluated several alternatives:

Option	Pros	Cons
fal.ai Wan 2.1	Simple API	Mediocre quality-to-price ratio
RunPod GPU	Full control	Infrastructure overhead
Alibaba DashScope Wan 2.2	Lowest cost, decent quality	China-based API

DashScope Wan 2.2 won on price-to-quality ratio.

Alongside the model swap, several other changes went in:

Frontend action selection: Users can now pick which actions to generate instead of getting all of them
Backbone generation removed: No longer needed with Wan 2.2
End pose generation removed: Eliminated an unnecessary processing step
Inter-action throttles removed: No more artificial delays between action generations

Character Generation Improvements

Full-Body Enforcement

AI character generation sometimes produced only upper-body results. This caused inconsistent lower bodies across different actions. I updated the prompts to enforce full-body generation every time.

Reference Image Support

Users can now upload a reference image when generating characters. This is useful for creating variations of existing characters or matching a particular style.

Other Improvements

Broader image format support: WebP, GIF, BMP, and TIFF uploads now accepted
Background removal for uploads: Uploaded character images can optionally have their background removed
Media preview modal: Click an emoji card to see it at full size
Asset download links: Direct download for generated assets

Performance Optimization

flowchart LR
    subgraph Before["Sequential"]
        A1["Pose 1"] --> A2["Pose 2"] --> A3["Pose 3"]
    end
    subgraph After["Parallel"]
        B1["Pose 1"]
        B2["Pose 2"]
        B3["Pose 3"]
    end
    Before -->|"sequential → parallel"| After

Pose generation was changed from sequential to parallel. Startup delay and inter-action throttles were removed. End pose generation was eliminated entirely. The perceived speed improvement is significant.

SAM 2.1 Interactive Background Removal

Why rembg Wasn’t Enough

In the previous post, I implemented background removal with rembg. The quality issues were hard to ignore:

Inaccurate foreground boundaries on complex backgrounds
Parts of the character getting clipped, or background artifacts remaining
Fundamental limitation of fully automated approaches — the model can’t always tell what’s foreground

Why SAM 2.1

Meta’s SAM 2.1 (Segment Anything Model) segments based on user-provided point prompts. Key advantages:

Interactive: Users indicate foreground/background directly, improving accuracy
Runs on M1 Mac: I initially considered cloud GPU options like RunPod, but confirmed SAM 2.1 runs well on M1 Mac via PyTorch’s MPS backend
Easy integration: Available through the ultralytics package

Architecture

flowchart TB
    subgraph Frontend["Next.js /refine Page"]
        F1["Load frame image"]
        F2["SegmentCanvas component<br/>click to place points"]
        F3["Mask preview"]
        F4["Apply mask"]
    end
    subgraph Backend["FastAPI SAM2 Endpoints"]
        B1["GET /raw-frame<br/>serve original frames"]
        B2["POST /sam/predict<br/>points → mask prediction"]
        B3["POST /sam/apply<br/>apply mask → RGBA result"]
    end
    subgraph Model["SAMSegmenter Class"]
        M1["predict: point-based mask generation"]
        M2["apply_mask: mask → RGBA conversion"]
        M3["predict_and_apply_all<br/>batch process all frames"]
    end
    F1 --> B1
    F2 --> B2
    B2 --> M1
    F4 --> B3
    B3 --> M2

Workflow Changes

Previously, the pipeline was fully automatic: video generation → frame extraction → background removal. With SAM, there’s now a user interaction step in the middle:

Video generation → frame extraction (worker stage 3 completes here)
Status changes to awaiting_refinement
User visits /refine page and clicks to remove backgrounds
Final asset generation after refinement

I added the awaiting_refinement status so the frontend can show a “waiting for background removal” state and display a Refine Backgrounds link. The ProgressTracker treats this status as generation-complete.

Implementation Details

Backend — SAMSegmenter class:

predict: Takes click points, returns predicted masks
apply_mask: Applies a predicted mask to the original image, producing an RGBA result
predict_and_apply_all: Batch processes all frames

Backend — API endpoints:

GET /raw-frame: Serves original frame images
POST /sam/predict: Point-based mask prediction, returns RGBA mask
POST /sam/apply: Applies mask to frame

Frontend — SegmentCanvas component:

Renders frame image on a canvas
Captures click events to collect point coordinates
Calls SAM API for mask preview
Calls apply API on confirmation

Commit Log

Message	Changes
feat: replace VEO 3 with DashScope Wan 2.2 and remove backbone generation	Swap video generation model, remove backbone step
feat: pass selected action names from frontend to backend	Frontend action selection
fix: clear character preview when switching between upload and generate modes	Reset preview on mode switch
feat: add optional reference image support for AI character generation	Reference image upload
feat: support WebP, GIF, BMP, and TIFF image uploads	Broader format support
feat: add background removal option for uploaded character images	Background removal for uploads
perf: remove end pose generation and inter-action throttles	Remove unnecessary steps and delays
feat: enforce full-body character generation and add asset download links	Full-body enforcement, download links
fix: add media preview modal with close button to emoji cards	Media preview modal
perf: parallelize pose generation and eliminate startup delay	Parallel pose generation
docs: add SAM2 interactive background removal design spec	SAM2 design document
docs: add SAM2 interactive background removal implementation plan	SAM2 implementation plan
feat: add ultralytics SAM 2.1 dependency and sam_model config	Add SAM 2.1 dependency
feat: add awaiting_refinement status to models	New awaiting_refinement status
refactor: simplify process_video to extract-only (no bg removal)	Simplify video processing
refactor: worker stage 3 extracts frames only, ends at awaiting_refinement	Worker stage 3 stops at extraction
feat: add SAMSegmenter class with predict, apply_mask, predict_and_apply_all	Core SAMSegmenter implementation
feat: add SAM2 endpoints and raw frame serving to FastAPI	SAM2 API endpoints
feat: add SAM embed/predict/apply API functions	Frontend SAM API functions
feat: add SegmentCanvas click-to-segment component	Click-to-segment canvas component
feat: add /refine page for interactive SAM2 background removal	/refine page implementation
feat: add Refine Backgrounds link and awaiting_refinement status display	Refine link and status display
feat: treat awaiting_refinement as generation-complete in ProgressTracker	ProgressTracker status handling
fix: address code review findings	Code review fixes
merge: integrate main refactors with SAM2 interactive bg removal	Merge main refactors
merge: integrate main branch changes with SAM2 implementation	Merge main changes
fix: return RGBA mask from SAM predict endpoint	Fix SAM predict RGBA mask

Next Steps

Improve UX for applying segmentation results across all frames at once
Connect the final APNG/GIF asset generation pipeline
Optimize SAM model loading for deployment environments

This is the fourth post in the PopCon series. More to come.