Cloud Run scaling from zero is a feature until it isn't

Scale to zero is a good default for request-driven services, until startup delay, warm-capacity needs, or instance caps turn it into user-visible reliability behavior instead of a pricing feature.

Operating principle Infrastructure

By Ivan RichterLinkedIn

Last updated: Mar 25, 2026

4 min read

cloud-run serverless reliability

On this page

The default

Scale to zero is a feature.

For the right workload, it’s one of the best parts of Cloud Run. Idle capacity disappears. Costs stay closer to actual use. Small internal systems can exist without carrying warm infrastructure all day just in case somebody clicks a button.

The mistake is treating that feature like a law of nature instead of a workload decision.

The real question isn’t just cold starts

People talk about scale to zero as if the whole issue were cold-start latency on the first request. That’s only part of it.

The more important question is whether the service can honestly tolerate request-driven wake-up behavior at all. Cloud Run can only scale from zero because a request arrives. If the workload needs to keep making progress while the system is otherwise idle, or if useful work depends on background activity that has no incoming request to wake it up, then the scaling model is already part of the architecture.

At that point, scale to zero becomes a runtime boundary.

Not every cheap default is a good fit

For plenty of services, scale to zero is exactly right.

Internal APIs. Admin surfaces. Event handlers. Lightweight automation entry points. Small tools with uneven usage. Those all tend to benefit from not paying for warm instances nobody needs most of the day.

But some workloads start asking for more than that model wants to give. They need predictable startup behavior. They need warm capacity because latency is visible. They need to do work even when nobody’s actively calling them. Or they need enough burst headroom that “we’ll wake up when traffic arrives” stops sounding like a serious plan.

The workload is revealing the shape it actually has.

Request-driven wake-up is part of the contract

Scale to zero works because Cloud Run is willing to let a service sit at nothing and wake it back up only when a request comes in.

It’s a very good trade when the request really is the unit of work. It’s a much worse trade when the request is only the trigger for work that needs to keep going after the caller disappears, or when the system needs standing capacity to feel responsive enough in practice.

That link to request timeouts is part of the same runtime question. Startup behavior, request lifetime, and retry semantics all belong to the workload shape. They aren’t separate tuning checklists.

Minimum instances are sometimes the honest answer

There is nothing morally superior about running at zero.

If the service can’t tolerate startup delay, or if it needs CPU even while it isn’t actively handling requests, then keeping at least one instance warm is usually the honest configuration. You’re paying for the behavior the workload already requires.

A lot of teams resist this longer than they should because “serverless” sounds cleaner when it implies nothing is running. Fine. The bill doesn’t care about branding. If the service wants warm capacity, pretending otherwise usually just pushes the cost into latency, retries, and user annoyance.

Maximum instances are not just a cost setting

Once you cap how far the service can scale, you’re choosing what happens when traffic wants more than the service is allowed to provide. Cloud Run will queue pending requests for a short window. After that, they can fail. That user-visible behavior follows from deciding the service should stop scaling past a certain point.

So instance caps aren’t just about cost control or protecting a backing service. They’re also part of your queueing and rejection policy, whether you wrote that policy down anywhere or not.

Idle time isn’t always really idle

Another thing that often gets overlooked is that scale to zero isn’t the same as “the platform instantly kills idle capacity.”

Cloud Run may keep idle instances around for a while to soften cold starts. The platform sometimes gives you a short grace period before dropping back down. Useful, yes. Something to build guarantees on, no.

If the workload needs warm capacity as part of normal behavior, set it explicitly. Relying on the platform maybe keeping an instance around for a bit is how people end up acting surprised by behavior the platform never promised them.

This is still usually a very good trade

For a lot of SME systems, the trade is still excellent.

Pay-per-use scaling with optional warm instances covers a large share of real internal platform workloads without dragging in a heavier runtime too early. That’s a big part of Cloud Run as the default. The platform stays small until the workload proves it wants something more expensive, more opinionated, or more continuously alive.

When the workload stops matching the runtime

If the service now wants stable warm capacity, more involved topology, continuous background activity, or workload controls that no longer fit comfortably inside the Cloud Run model, the clean answer may be a different runtime shape altogether.

Kubernetes via GKE Autopilot starts to make sense when the workload no longer matches the conditions that made Cloud Run a good default.

The point

Treat scale to zero as an optimization whose fit can expire.

Use it when the workload can honestly tolerate request-driven wake-up, variable startup behavior, and the scaling limits you put around it. Stop pretending it’s free when the system has already told you it wants warm capacity or a different execution model.

More in this domain: Infrastructure

Browse all

How we decide between Cloud SQL connectors, Auth Proxy, and private IP

Cloud SQL connectors, the Auth Proxy, and private IP are not interchangeable secure connection options. They change identity, routing, deployment shape, and how much network plumbing the team actually owns.

Safe scaling defaults for Cloud Run + Postgres

Cloud Run autoscaling is not a database strategy. Safe defaults keep the application from scaling itself into a Postgres incident before the team understands the workload.

IAM DB auth for Cloud SQL: when it simplifies security and when it complicates delivery

IAM DB auth can reduce password sprawl and make revocation cleaner, but it also turns database access into an identity operating model that depends on disciplined service-account boundaries.

Cloud Run request timeouts don't kill your code (so your architecture has to)

A Cloud Run request timeout ends the request, not necessarily the work. If the operation can outlive its caller, the system needs explicit job semantics instead of hope.

Direct VPC egress vs Serverless VPC Access for Cloud Run: our default

We default to Direct VPC egress for Cloud Run because it is the cleaner networking shape: fewer moving parts, no connector resource, and costs that scale with the service instead of beside it.

Related patterns

"Internal-only" Cloud Run isn't just a checkbox

Making a Cloud Run service private is not one toggle. It is a decision about ingress, routing, caller path, and IAM working together as one access model.

GKE Autopilot as the escape hatch from Cloud Run

When Cloud Run stops fitting, the next move is usually GKE Autopilot: more Kubernetes-shaped control without immediately taking on the full burden of Standard clusters.

Why we default to Cloud Run for SME internal platforms

For SME internal platforms, Cloud Run is our default because it covers a large share of useful workload shapes without forcing teams to own cluster operations before they have earned that surface area.

How we treat Terraform state in team environments

Terraform starts feeling fragile in teams when state is treated like a backend setting instead of a shared dependency for safe change.