<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Serverless on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/serverless/</link><description>Recent content in Serverless on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 13 Apr 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/serverless/index.xml" rel="self" type="application/rss+xml"/><item><title>RunPod Serverless vs Pods — When Each Wins for GPU Workloads</title><link>https://ice-ice-bear.github.io/posts/2026-04-13-runpod-serverless-vs-pods/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-04-13-runpod-serverless-vs-pods/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post RunPod Serverless vs Pods — When Each Wins for GPU Workloads" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Wiring up &lt;a class="link" href="https://github.com/ice-ice-bear/popcon-matting-bench" target="_blank" rel="noopener"
 &gt;popcon&amp;rsquo;s&lt;/a&gt; GPU worker forced a real choice: should the inference pipeline run on RunPod Serverless or on a long-lived Pod? Both bill per-second, both use the same GPU SKUs, but the cost curves only cross at a specific utilization point. This post walks through the architecture difference and the break-even math.&lt;/p&gt;
&lt;h2 id="the-two-models"&gt;The Two Models
&lt;/h2&gt;&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 A["Workload type?"] --&gt; B{"Bursty &amp;lt;br/&amp;gt; (idle most of day)?"}
 B --&gt;|Yes| C[RunPod Serverless]
 B --&gt;|No, sustained| D[RunPod Pod]
 C --&gt; E[Flex worker &amp;lt;br/&amp;gt; cold-start, $0 idle]
 C --&gt; F[Active worker &amp;lt;br/&amp;gt; warm, discounted rate]
 D --&gt; G[Per-hour billing &amp;lt;br/&amp;gt; whether idle or not]
 D --&gt; H[Persistent volume]&lt;/pre&gt;&lt;h2 id="pods--long-lived-containers"&gt;Pods — Long-Lived Containers
&lt;/h2&gt;&lt;p&gt;A &lt;strong&gt;Pod&lt;/strong&gt; is a persistent container with attached volume disk. You pay the per-hour GPU rate &lt;strong&gt;continuously&lt;/strong&gt; while the Pod is running, whether it&amp;rsquo;s processing requests or idling. Storage is &lt;code&gt;$0.10/GB/month&lt;/code&gt; for running Pods (per-second billing) and &lt;strong&gt;doubles to &lt;code&gt;$0.20/GB/month&lt;/code&gt; when the Pod is stopped&lt;/strong&gt; — RunPod is incentivizing you to either keep using it or delete it. Volumes get deleted entirely when your account balance hits zero.&lt;/p&gt;
&lt;p&gt;Pricing rules require &lt;strong&gt;at least one hour&amp;rsquo;s worth of credits&lt;/strong&gt; in your account to deploy, and a default &lt;code&gt;$80/hr&lt;/code&gt; spending cap protects against runaway workloads.&lt;/p&gt;
&lt;p&gt;Pods make sense when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need a notebook environment, SSH access, or persistent state&lt;/li&gt;
&lt;li&gt;The GPU is running real work &amp;gt;40% of the time&lt;/li&gt;
&lt;li&gt;Cold-start latency would kill the UX (e.g., interactive video models)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="serverless--pay-per-second-handlers"&gt;Serverless — Pay-Per-Second Handlers
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Serverless workers&lt;/strong&gt; are stateless container handlers that spin up on demand, process a queue request, and tear down. Two worker classes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Flex&lt;/strong&gt; — cold-starts when traffic arrives, &lt;strong&gt;$0 idle cost&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active&lt;/strong&gt; — kept warm at a discounted rate, no cold-start&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You write a &lt;code&gt;handler(event)&lt;/code&gt; function and ship it as a Docker image. Network volumes (&lt;code&gt;$0.07/GB/month&lt;/code&gt; under 1TB, &lt;code&gt;$0.05/GB/month&lt;/code&gt; over) provide shared storage if model weights need to be cached across workers.&lt;/p&gt;
&lt;h3 id="the-cold-start-trap"&gt;The Cold-Start Trap
&lt;/h3&gt;&lt;p&gt;Cold starts count against billed time. For a 30-second image-matting request, a 10-second cold start means you&amp;rsquo;re billed for 40 seconds. If your model is 5GB+ and lives on a network volume, that cold start can balloon. The &lt;code&gt;gpu_worker/Dockerfile&lt;/code&gt; pattern in &lt;a class="link" href="https://github.com/ice-ice-bear/popcon-matting-bench" target="_blank" rel="noopener"
 &gt;popcon&lt;/a&gt; &lt;strong&gt;bakes the model weights into the image&lt;/strong&gt; specifically to avoid this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-dockerfile" data-lang="dockerfile"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;runpod/pytorch:2.1-cuda12.1&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; weights/birefnet.pth /app/weights/&lt;span class="err"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; handler.py /app/&lt;span class="err"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;CMD&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;python&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;/app/handler.py&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A 6GB image takes longer to pull but loads in seconds once cached on the worker.&lt;/p&gt;
&lt;h2 id="break-even-math"&gt;Break-Even Math
&lt;/h2&gt;&lt;p&gt;Rough numbers on an A100:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Rate&lt;/th&gt;
 &lt;th&gt;24h cost&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Pod&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;$1.89/hr&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;$45.36&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Serverless Flex (active compute)&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;$0.00076/sec&lt;/code&gt; ≈ &lt;code&gt;$2.74/hr&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;$2.74 × hours used&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Break-even is around 17 hours/day of utilization.&lt;/strong&gt; Below that, Serverless wins; above, Pods win. For a startup with bursty user traffic, Serverless is almost always correct. For a research lab fine-tuning continuously, Pods are.&lt;/p&gt;
&lt;h2 id="concurrency-pattern"&gt;Concurrency Pattern
&lt;/h2&gt;&lt;p&gt;Where Serverless really shines is parallel inference. Fire N requests at once via &lt;code&gt;asyncio.gather&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;gpu_client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;infer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;task&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;rembg&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;image&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;frames&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The bottleneck shifts from compute to RunPod&amp;rsquo;s autoscaler — when 30 requests land at once, the cold-start of additional Flex workers caps wall-clock latency at &lt;em&gt;the slowest cold start&lt;/em&gt;, not 30× the warm latency. Doing the same with a single Pod requires you to either batch the requests (extra code, harder to reason about) or spin up multiple Pods (and pay for all of them continuously).&lt;/p&gt;
&lt;h2 id="when-not-to-use-serverless"&gt;When NOT to Use Serverless
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Long-running training jobs&lt;/strong&gt; — RunPod Serverless has a max execution time per request. Multi-hour fine-tuning belongs on a Pod.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Models with non-trivial state&lt;/strong&gt; — if your inference reads from a hot in-memory KV cache, Serverless&amp;rsquo;s stateless workers will rebuild that cache on every cold start.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency-critical interactive UX&lt;/strong&gt; — if a user is waiting in a UI for &amp;lt;2 second response, Active workers help but still don&amp;rsquo;t match a warmed Pod.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;The Serverless model is the most interesting thing happening in GPU compute right now — it makes &amp;ldquo;deploy a model as an API&amp;rdquo; feel like deploying a Lambda. For 90% of inference workloads at startup scale, Serverless is the right default; the break-even doesn&amp;rsquo;t favor Pods until you&amp;rsquo;re running close to round-the-clock. The trap to watch is cold-start cost amortization: bake weights into the image, not the network volume, and your effective Serverless cost stays close to the warm rate. RunPod&amp;rsquo;s pricing model is essentially saying &amp;ldquo;we believe most GPU work is bursty,&amp;rdquo; and for product workloads they&amp;rsquo;re probably right.&lt;/p&gt;
&lt;h2 id="quick-links"&gt;Quick Links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.runpod.io/pods/pricing" target="_blank" rel="noopener"
 &gt;RunPod Pods Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.runpod.io/serverless/pricing" target="_blank" rel="noopener"
 &gt;RunPod Serverless Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://docs.runpod.io/accounts-billing/billing" target="_blank" rel="noopener"
 &gt;RunPod Billing Overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>