Featured image of post PopCon Dev Log #4 — SAM 2.1 Interactive Segmentation and Cost Optimization

PopCon Dev Log #4 — SAM 2.1 Interactive Segmentation and Cost Optimization

Replacing VEO 3 with DashScope Wan 2.2 for cheaper video generation, and building SAM 2.1 interactive background removal to replace rembg

Previous post: PopCon Dev Log #3

Overview

This is the fourth entry in the PopCon dev log series. Two major changes happened this round. First, VEO 3’s cost was unsustainable, so I switched the video generation model to Alibaba’s DashScope Wan 2.2. Second, rembg’s background removal quality wasn’t cutting it, so I built an interactive segmentation workflow using Meta’s SAM 2.1 — users click on the foreground object and SAM generates a precise mask.

Video Generation Model Swap: VEO 3 → DashScope Wan 2.2

The Cost Problem

VEO 3 produced good results, but the cost added up fast. PopCon needs to generate multiple action videos per emoji character, so per-generation cost matters a lot.

I evaluated several alternatives:

OptionProsCons
fal.ai Wan 2.1Simple APIMediocre quality-to-price ratio
RunPod GPUFull controlInfrastructure overhead
Alibaba DashScope Wan 2.2Lowest cost, decent qualityChina-based API

DashScope Wan 2.2 won on price-to-quality ratio.

Alongside the model swap, several other changes went in:

  • Frontend action selection: Users can now pick which actions to generate instead of getting all of them
  • Backbone generation removed: No longer needed with Wan 2.2
  • End pose generation removed: Eliminated an unnecessary processing step
  • Inter-action throttles removed: No more artificial delays between action generations

Character Generation Improvements

Full-Body Enforcement

AI character generation sometimes produced only upper-body results. This caused inconsistent lower bodies across different actions. I updated the prompts to enforce full-body generation every time.

Reference Image Support

Users can now upload a reference image when generating characters. This is useful for creating variations of existing characters or matching a particular style.

Other Improvements

  • Broader image format support: WebP, GIF, BMP, and TIFF uploads now accepted
  • Background removal for uploads: Uploaded character images can optionally have their background removed
  • Media preview modal: Click an emoji card to see it at full size
  • Asset download links: Direct download for generated assets

Performance Optimization

Pose generation was changed from sequential to parallel. Startup delay and inter-action throttles were removed. End pose generation was eliminated entirely. The perceived speed improvement is significant.

SAM 2.1 Interactive Background Removal

Why rembg Wasn’t Enough

In the previous post, I implemented background removal with rembg. The quality issues were hard to ignore:

  • Inaccurate foreground boundaries on complex backgrounds
  • Parts of the character getting clipped, or background artifacts remaining
  • Fundamental limitation of fully automated approaches — the model can’t always tell what’s foreground

Why SAM 2.1

Meta’s SAM 2.1 (Segment Anything Model) segments based on user-provided point prompts. Key advantages:

  • Interactive: Users indicate foreground/background directly, improving accuracy
  • Runs on M1 Mac: I initially considered cloud GPU options like RunPod, but confirmed SAM 2.1 runs well on M1 Mac via PyTorch’s MPS backend
  • Easy integration: Available through the ultralytics package

Architecture

Workflow Changes

Previously, the pipeline was fully automatic: video generation → frame extraction → background removal. With SAM, there’s now a user interaction step in the middle:

  1. Video generation → frame extraction (worker stage 3 completes here)
  2. Status changes to awaiting_refinement
  3. User visits /refine page and clicks to remove backgrounds
  4. Final asset generation after refinement

I added the awaiting_refinement status so the frontend can show a “waiting for background removal” state and display a Refine Backgrounds link. The ProgressTracker treats this status as generation-complete.

Implementation Details

Backend — SAMSegmenter class:

  • predict: Takes click points, returns predicted masks
  • apply_mask: Applies a predicted mask to the original image, producing an RGBA result
  • predict_and_apply_all: Batch processes all frames

Backend — API endpoints:

  • GET /raw-frame: Serves original frame images
  • POST /sam/predict: Point-based mask prediction, returns RGBA mask
  • POST /sam/apply: Applies mask to frame

Frontend — SegmentCanvas component:

  • Renders frame image on a canvas
  • Captures click events to collect point coordinates
  • Calls SAM API for mask preview
  • Calls apply API on confirmation

Commit Log

MessageChanges
feat: replace VEO 3 with DashScope Wan 2.2 and remove backbone generationSwap video generation model, remove backbone step
feat: pass selected action names from frontend to backendFrontend action selection
fix: clear character preview when switching between upload and generate modesReset preview on mode switch
feat: add optional reference image support for AI character generationReference image upload
feat: support WebP, GIF, BMP, and TIFF image uploadsBroader format support
feat: add background removal option for uploaded character imagesBackground removal for uploads
perf: remove end pose generation and inter-action throttlesRemove unnecessary steps and delays
feat: enforce full-body character generation and add asset download linksFull-body enforcement, download links
fix: add media preview modal with close button to emoji cardsMedia preview modal
perf: parallelize pose generation and eliminate startup delayParallel pose generation
docs: add SAM2 interactive background removal design specSAM2 design document
docs: add SAM2 interactive background removal implementation planSAM2 implementation plan
feat: add ultralytics SAM 2.1 dependency and sam_model configAdd SAM 2.1 dependency
feat: add awaiting_refinement status to modelsNew awaiting_refinement status
refactor: simplify process_video to extract-only (no bg removal)Simplify video processing
refactor: worker stage 3 extracts frames only, ends at awaiting_refinementWorker stage 3 stops at extraction
feat: add SAMSegmenter class with predict, apply_mask, predict_and_apply_allCore SAMSegmenter implementation
feat: add SAM2 endpoints and raw frame serving to FastAPISAM2 API endpoints
feat: add SAM embed/predict/apply API functionsFrontend SAM API functions
feat: add SegmentCanvas click-to-segment componentClick-to-segment canvas component
feat: add /refine page for interactive SAM2 background removal/refine page implementation
feat: add Refine Backgrounds link and awaiting_refinement status displayRefine link and status display
feat: treat awaiting_refinement as generation-complete in ProgressTrackerProgressTracker status handling
fix: address code review findingsCode review fixes
merge: integrate main refactors with SAM2 interactive bg removalMerge main refactors
merge: integrate main branch changes with SAM2 implementationMerge main changes
fix: return RGBA mask from SAM predict endpointFix SAM predict RGBA mask

Next Steps

  • Improve UX for applying segmentation results across all frames at once
  • Connect the final APNG/GIF asset generation pipeline
  • Optimize SAM model loading for deployment environments

This is the fourth post in the PopCon series. More to come.

Built with Hugo
Theme Stack designed by Jimmy