How we diagnose and fix a "too many connections" incident for Cloud Run + Postgres

A "too many connections" incident is rarely a one-line fix. It usually exposes a bad contract between Cloud Run scaling, app pool behavior, and database capacity.

Implementation note Operations

By Ivan RichterLinkedIn

Last updated: Apr 4, 2026

11 min read

cloud-run postgres incident-response

On this page

A “too many connections” incident looks simple right up until someone tries to fix it. Postgres refuses new sessions, the application starts throwing connection errors, and the room immediately fills with teams wanting one number to blame. Sometimes there is one bad number. More often the error is just the visible edge of a broader contract failure between Cloud Run scale behavior, per-instance pool claims, transaction lifetime, and database capacity. By the time Postgres starts refusing sessions, the service has usually been negotiating with the database dishonestly for a while.

Start by classifying the pressure instead of blaming max_connections. The same alert can come from a leak, an oversized pool, a rollout that widened too quickly, long transactions that pinned backends, or slow queries that turned each session into a longer-lived claim. If teams guess too early, it usually buys time in the wrong place and destroys the evidence that would have made the real cause obvious.

This page is for the point where a team needs order, not philosophy. The broader rule still lives in connection budgets. The job here is triage: stop the blast radius, keep enough evidence to classify the failure, and separate the moves that buy time from the work that changes the contract that allowed the incident in the first place.

The first job is to stop making the situation worse

The first few minutes are not the time to aim for elegant tuning. If Cloud Run is still widening the fleet and every new instance arrives with its own expectation of database access, then the service is still adding pressure faster than teams can understand the old pressure. The first useful move is often to reduce how much new demand the runtime is allowed to create.

Usually that means capping max scale, sometimes lowering concurrency, and sometimes pausing the loudest caller entirely. Backfills, admin jobs, side workers, retry storms, or a freshly deployed revision can all widen the pressure field without adding much useful work. During the live incident, stop negotiating for more database capacity; Postgres is already refusing the terms.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: api
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '3'
    spec:
      containerConcurrency: 10
      timeoutSeconds: 30

At the same time, connection acquisition needs to become honest. If the application is willing to wait forever for a session, the service starts building a queue while pretending it is still making progress. Shorter acquire waits usually make the failure mode clearer and smaller.

DB_POOL_MAX=4
DB_POOL_ACQUIRE_TIMEOUT_MS=1000
DB_POOL_IDLE_TIMEOUT_MS=30000

first 15 minutes
- cap max instances if the fleet is still widening
- shorten connection-acquire waits if requests are piling up
- pause non-essential workers, backfills, or admin jobs
- preserve evidence before restarting everything
- confirm whether pressure is still rising or merely stuck

An incident that is still widening needs containment. An incident that has stopped widening but is still sick needs classification.

Preserve evidence before the graph gets pretty again

The most common self-inflicted mistake is tidying up too early. Somebody restarts the service. Somebody bounces the proxy or pooler. Somebody kills a pile of sessions. The graph improves, everyone exhales, and the best evidence disappears. Sometimes those interventions are necessary, but they should not happen before anyone captures enough of the shape to know what it is looking at.

We want a few plain facts while the system is still telling the truth. How many Cloud Run instances were live? Did a rollout just happen? Did a job or retry wave start shortly before the error? Was the session count still rising or had it flattened while waits got worse? Were sessions mostly active, mostly idle, mostly blocked, or sitting idle in transaction? Was one service suddenly much wider than usual?

A short timestamped note during the incident is often more valuable than a polished reconstruction later. These incidents blur fast. Most of the room remembers the error string and forgets the pressure shape that produced it.

The app layer usually explains whether the service is multiplying, hoarding, or waiting

We start on the application side because Cloud Run can create a lot of database pressure without looking especially dramatic from the service itself. The first thing to inspect is scale shape. Did instance count jump because of traffic, a bad rollout, aggressive retries, or some combination of all three? If it did, compare that with the per-instance pool claim. A pool that looks moderate in a config file can become an ugly fleet-wide number very quickly.

Then we look at how the application manages sessions. How many connections can one instance open? Does the runtime create one pool per container or one pool per worker process? Does the application block for too long while trying to acquire a session, hiding pressure behind latency? Are retries creating more demand precisely when the database is already under strain?

Request shape matters just as much. We want to know whether a request is holding a database session while doing work that has no business sitting inside the database window. Slow downstream calls, filesystem work, large in-memory transforms, or response shaping done while the transaction is still open can all turn one borrowed session into a much longer claim than the service budget assumed. If requests can outlive their callers or continue working after the client has already given up, the connection incident may be sitting right next to the runtime problem described in Cloud Run request timeouts.

A recurring failure mode is a process model nobody priced correctly. The config says pool max five, which sounds restrained, but the container is running several worker processes and each worker owns its own pool. The real per-instance claim isn’t five. It’s five multiplied by however many workers were quietly launched.

container
- worker process A -> pool max 5
- worker process B -> pool max 5
- worker process C -> pool max 5
real per-instance claim: 15

The calmer pattern is being able to answer one unglamorous question without hesitation: when one instance is fully awake, how many database sessions can it actually claim?

Postgres tells you whether the pressure is broad, sticky, or fake

Once the application shape is clear enough, we go to Postgres and stop arguing from hunches. We want to know whether the sessions are leaking, active, blocked, idle, idle in transaction, or simply too numerous for the contract the service is trying to impose. That usually starts with a classification query against pg_stat_activity so teams stop treating all sessions as equivalent.

select
  application_name,
  state,
  wait_event_type,
  wait_event,
  count(*) as sessions,
  max(now() - xact_start) as oldest_xact_age,

from
  pg_stat_activity

where
  datname = current_database()

group by
  application_name,
  state,
  wait_event_type,
  wait_event,

order by sessions desc;

If the broad view suggests a few older or stranger sessions are holding things up, then we zoom in on age, waits, and the actual queries.

select
  pid,
  application_name,
  state,
  now() - xact_start as xact_age,
  now() - query_start as query_age,
  wait_event_type,
  wait_event,
  query,

from
  pg_stat_activity

where
  datname = current_database()
  and xact_start is not null

order by xact_age desc

limit 20;

Those queries are there to classify the failure, not decorate the postmortem. We want to know whether the pressure is a fleet-wide multiplication problem, a leak, a small number of long transactions, a lock problem, or slow work that is making every session live too long.

One common pattern is that Cloud Run widened faster than the database contract allowed

The familiar serverless version of the incident starts when traffic spikes, a new revision comes online, or retries start stacking. Cloud Run does what it was told to do and adds instances. Each instance either opens or reserves its expected pool. The database sees a sharp increase in demand long before CPU graphs on the application side look dramatic enough to frighten anyone.

The signature is usually a matching rise in instance count and session count without query slowness being the original trigger. Containment is mostly about reducing width. Durable repair is usually a better scale and pool contract, which is why this page naturally points back to safe scaling defaults. The service widened beyond the contract the database could tolerate.

Another common pattern is that one instance was already too greedy

Sometimes the fleet is not especially wide. The problem is that each instance believes it deserves too much of the database. A pool of fifteen or twenty may have survived early testing because nothing else was competing for sessions yet. Framework defaults are especially good at leaving this kind of trap behind because they sound reasonable in isolation and ridiculous only after multiplication.

It usually appears as a modest instance count paired with a session count that makes no sense once the fleet size is taken into account. The repair is rarely glamorous. Shrink the pool. Shorten acquire waits. Stop assuming every request path deserves immediate database access.

- DB_POOL_MAX=15
- DB_POOL_ACQUIRE_TIMEOUT_MS=10000
+ DB_POOL_MAX=4
+ DB_POOL_ACQUIRE_TIMEOUT_MS=1000

Those settings illustrate the move: smaller claims, shorter waits, less fantasy.

Sometimes the incident is a leak, not load

A real leak has a different feel. Session count keeps drifting upward without a matching increase in useful work. Restarts appear to fix it, which makes the system look random until someone notices that the restart is only tearing down leaked state and buying another quiet period before the same bug returns.

The root usually lives in the application contract, not in Postgres. Transaction wrappers, early returns, error paths, retries, or library misuse can all keep sessions from being returned properly. The database sees the result. The bug itself is usually in how the application borrows and releases a connection under non-happy paths. If the pool exposes checked-out counts, wait time, or stuck-borrow metrics, that is usually where the trail becomes clearer.

Long transactions can make a smaller number of sessions feel like a connection incident

A system can look short on connections when a smaller set of sessions is holding important resources for too long. Long transactions, especially idle-in-transaction sessions, make the database slower at finishing work. The rest of the application then backs up behind a smaller set of stuck or slow claims, and the symptom widens into broad connection pressure.

It tends to show up as old transaction ages, waiting sessions, lock contention, or a cluster of clients that are technically present but not meaningfully moving. The repair requires tighter transaction scope, less non-database work inside transaction windows, and a firmer distinction between “holding a session” and “doing work in general.”

Slow queries often arrive wearing a connection error

Slow queries do not have to trigger “too many connections” directly to cause the incident. They can get there by making every session live longer. Requests that would normally borrow and release quickly now sit active or waiting for longer. Pool checkout slows down. Request lifetime widens. Cloud Run sees slower completion and gets more chances to widen the fleet. By the time the alert lands, the incident looks like a connection problem even though the first break was query latency or a lock pattern.

Counts are never enough. Query age and wait behavior matter. If the root cause is slow work, then increasing ceilings or resizing pools without fixing the query path just changes the costume the next incident arrives in.

Containment and repair are different jobs

A live incident tempts teams to blur these together. They overlap, but they are not the same work. Containment is about buying time and restoring service. Repair is about changing the contract that made the incident possible.

Containment might mean capping instance growth, pausing non-essential work, shortening acquire waits, killing obviously stuck sessions, or in some cases temporarily raising a database ceiling because it is the least bad move available. Those are tactical actions. They can be correct and still not tell you much about what should remain after the incident.

Repair is slower and usually less flattering. Smaller pools. Clearer scale limits. Tighter transaction scope. Better query behavior. Separation between request work and asynchronous work. Sometimes a pooler. Sometimes a different worker shape. The durable fix is the thing that changes why the system could use the database as a queue in the first place.

containment                    durable repair
cap instance growth            rewrite pool and scale contract
pause backfills/workers        separate noisy workloads
kill stuck sessions            fix leak or long-transaction pattern
shorten acquire waits          add honest backpressure permanently

We do not call it solved because a restart brought the graph down

These incidents get misclassified constantly because restarting the service often appears to solve it. It may even restore users quickly. The contract still isn’t fixed. We want evidence that the pressure shape is now understood and bounded.

We want stable session counts under representative load. We want short and observable acquire waits. We want confirmation that long transactions or bad query paths have been removed rather than hidden. We want to see that rollouts or traffic bursts no longer widen the fleet into the same database collapse.

If the repair involved smaller pools or tighter max scale, we also want to know how the application now fails under pressure. Does it queue briefly? Does it fail fast and honestly? Good. That’s usually healthier than pretending to make progress while the database is already drowning. If the pain merely moved into another opaque corner of the system, the incident was only relocated.

evidence before we close it
- session counts stay inside the intended budget
- acquire waits are short and observable
- no old long-lived transactions remain in normal operation
- traffic bursts no longer widen the fleet into DB collapse
- request failure mode is understandable under pressure

The incident should leave one cleaner boundary behind

Every real connection incident should produce one structural improvement. Maybe the service finally gets an explicit connection budget. Maybe max scale stops being an unexamined default. Maybe worker services stop inheriting API pool assumptions. Maybe transaction scope gets cut down. Maybe the runbook now includes the exact pg_stat_activity queries the team actually reached for instead of the vague promise that somebody knows where to look.

If nothing structural changed, then the service mostly survived through team judgment and timing. The team only bought an intermission.

A “too many connections” incident is rarely asking for one bigger number. It is usually exposing a bad contract between Cloud Run scale, pool behavior, query lifetime, and database capacity. Stop the blast radius first. Keep enough evidence to classify the failure honestly. Then fix the service shape that turned Postgres into the queue.

More in this domain: Operations

Browse all

An alert is not a notification

A notification says something happened. An operational alert identifies a business situation, assigns ownership, carries enough context to act, records the response, and becomes workflow state.

Why alert feedback should be structured first

Free text helps, but structured alert feedback lets the system measure relevance, timing, duplicates, bad data, and rule quality. Human response becomes evidence the rules can learn from.

Why Cloud Run + Postgres needs a connection budget

Cloud Run and Postgres get fragile when connection growth is left implicit. We treat connections as a finite runtime budget, not as plumbing the app can multiply without consequence.

AlloyDB managed connection pooling: when we'd trust it over PgBouncer

AlloyDB managed pooling is attractive because it removes a moving part, but the useful decision is whether the managed path gives enough semantic confidence, observability, and migration predictability to replace PgBouncer.

Cloud SQL to AlloyDB migration: what actually changes, what doesn't, and what we'd test first

A Cloud SQL to AlloyDB move is not a philosophical upgrade. It changes the operational boundary, and the useful work is re-proving the parts of the system that may no longer behave the same.

Related patterns

Cloud SQL vs AlloyDB: the real difference is operational boundary, not benchmarks

The useful comparison between Cloud SQL and AlloyDB is not raw speed. It is how the operating boundary changes around scaling, pooling, failover, migration, and team burden.

Managed connection pooling in Cloud SQL: when it helps and when it complicates things

Managed connection pooling in Cloud SQL can reduce bursty connection pressure, but it also changes session behavior and should be adopted like a runtime boundary, not like a harmless checkbox.

Safe scaling defaults for Cloud Run + Postgres

Cloud Run autoscaling is not a database strategy. Safe defaults keep the application from scaling itself into a Postgres incident before the team understands the workload.

What we keep out of orchestration in data platforms

We use orchestration to sequence work, not to become the real home of model semantics, cleanup logic, or hidden branching behavior in the data platform.