Decision Library
Patterns
Reusable decision frameworks, implementation notes, and system breakdowns from production delivery work.
No entries match the current filters.
AlloyDB managed connection pooling: when we'd trust it over PgBouncer
AlloyDB managed pooling is attractive because it removes a moving part, but the useful decision is whether the managed path gives enough semantic confidence, observability, and migration predictability to replace PgBouncer.
Last updated: Apr 4, 2026
Cloud SQL to AlloyDB migration: what actually changes, what doesn't, and what we'd test first
A Cloud SQL to AlloyDB move is not a philosophical upgrade. It changes the operational boundary, and the useful work is re-proving the parts of the system that may no longer behave the same.
Last updated: Apr 4, 2026
Cloud SQL vs AlloyDB: the real difference is operational boundary, not benchmarks
The useful comparison between Cloud SQL and AlloyDB is not raw speed. It is how the operating boundary changes around scaling, pooling, failover, migration, and team burden.
Last updated: Apr 4, 2026
How we decide between Cloud SQL connectors, Auth Proxy, and private IP
Cloud SQL connectors, the Auth Proxy, and private IP are not interchangeable secure connection options. They change identity, routing, deployment shape, and how much network plumbing the team actually owns.
Last updated: Apr 4, 2026
How we diagnose and fix a "too many connections" incident for Cloud Run + Postgres
A "too many connections" incident is rarely a one-line fix. It usually exposes a bad contract between Cloud Run scaling, app pool behavior, and database capacity.
Last updated: Apr 4, 2026
IAM DB auth for Cloud SQL: when it simplifies security and when it complicates delivery
IAM DB auth can reduce password sprawl and make revocation cleaner, but it also turns database access into an identity operating model that depends on disciplined service-account boundaries.
Last updated: Apr 4, 2026
Managed connection pooling in Cloud SQL: when it helps and when it complicates things
Managed connection pooling in Cloud SQL can reduce bursty connection pressure, but it also changes session behavior and should be adopted like a runtime boundary, not like a harmless checkbox.
Last updated: Apr 4, 2026
Safe scaling defaults for Cloud Run + Postgres
Cloud Run autoscaling is not a database strategy. Safe defaults keep the application from scaling itself into a Postgres incident before the team understands the workload.
Last updated: Apr 4, 2026
Why Cloud Run + Postgres needs a connection budget
Cloud Run and Postgres get fragile when connection growth is left implicit. We treat connections as a finite runtime budget, not as plumbing the app can multiply without consequence.
Last updated: Apr 4, 2026
BI Engine: when it matters, when it's a trap
BI Engine can be useful, but only after you prove it is actually accelerating the workload you care about. Otherwise it turns into configuration thrashing around the wrong problem.
Last updated: Mar 29, 2026
BigQuery cost guardrails that won't break your teams
BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.
Last updated: Mar 29, 2026
Constraints without enforcement: still worth it?
Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.
Last updated: Mar 29, 2026
On-demand vs slots: the SME decision boundary
For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.
Last updated: Mar 29, 2026
Partitioning defaults for event tables that don't lie
Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.
Last updated: Mar 29, 2026
Physical vs logical storage: a dataset classification rule for SMEs
Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.
Last updated: Mar 29, 2026
Precompute ladder: cache -> scheduled tables -> MVs -> extracts
Precompute is not mainly a feature choice. It is a freshness budget decision: use the cheapest mechanism that meets the reporting need, then stop paying live query cost out of habit.
Last updated: Mar 29, 2026
Reservations for workload isolation: the minimal setup
Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.
Last updated: Mar 29, 2026
Streaming buffer is your hidden constraint
When BigQuery streaming pain shows up as a DML error, the real problem is usually workload shape. Streaming wants append-and-reconcile thinking, not row-by-row sync fantasies.
Last updated: Mar 29, 2026
Why your BI dashboards melt BigQuery
Dashboards do not passively read data. They generate repeated, variable workload, and that behavior is often the real source of BigQuery cost and latency pain.
Last updated: Mar 29, 2026
A dashboard is not an operating system
Dashboards are good at showing state. They are bad at routing action, assigning ownership, and closing operational loops once a metric requires intervention.
Last updated: Mar 26, 2026
How we decide which metrics deserve a dashboard and which deserve a workflow
Some metrics are for observation. Others need ownership, thresholds, timing, and structured action. We decide explicitly which system shape each metric actually deserves.
Last updated: Mar 26, 2026
Looker Studio blending limits expose your real data model problems
When a report starts depending on heroic Looker Studio blending, the issue is usually upstream structure, not dashboard craftsmanship.
Last updated: Mar 26, 2026
What makes a KPI trustworthy enough to automate around
A KPI is not ready to drive action just because it exists on a dashboard. It needs stable meaning, reliable updates, and failure behavior that will not create new chaos.
Last updated: Mar 26, 2026
When reporting logic belongs upstream instead of in the BI layer
If reporting logic affects business meaning, reuse, or trust, it usually belongs upstream where it can be reviewed, reused, and kept consistent across reports.
Last updated: Mar 26, 2026
Why freshness matters less than trust in most reporting systems
A slightly delayed metric that people trust is usually more valuable than a real-time metric nobody believes.
Last updated: Mar 26, 2026
Cloud Run request timeouts don't kill your code (so your architecture has to)
A Cloud Run request timeout ends the request, not necessarily the work. If the operation can outlive its caller, the system needs explicit job semantics instead of hope.
Last updated: Mar 25, 2026
Cloud Run scaling from zero is a feature until it isn't
Scale to zero is a good default for request-driven services, until startup delay, warm-capacity needs, or instance caps turn it into user-visible reliability behavior instead of a pricing feature.
Last updated: Mar 25, 2026
Direct VPC egress vs Serverless VPC Access for Cloud Run: our default
We default to Direct VPC egress for Cloud Run because it is the cleaner networking shape: fewer moving parts, no connector resource, and costs that scale with the service instead of beside it.
Last updated: Mar 25, 2026
GKE Autopilot as the escape hatch from Cloud Run
When Cloud Run stops fitting, the next move is usually GKE Autopilot: more Kubernetes-shaped control without immediately taking on the full burden of Standard clusters.
Last updated: Mar 25, 2026
"Internal-only" Cloud Run isn't just a checkbox
Making a Cloud Run service private is not one toggle. It is a decision about ingress, routing, caller path, and IAM working together as one access model.
Last updated: Mar 25, 2026
Why we default to Cloud Run for SME internal platforms
For SME internal platforms, Cloud Run is our default because it covers a large share of useful workload shapes without forcing teams to own cluster operations before they have earned that surface area.
Last updated: Mar 25, 2026
BigQuery cost spikes usually come from table shape, not queries
When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.
Last updated: Mar 24, 2026
Dataform vs. script piles: how we keep transformations reviewable
We prefer a declarative transformation layer over ad hoc script piles once warehouse logic becomes shared, incremental, and worth reviewing as a system.
Last updated: Mar 24, 2026
How we decide whether a transformation belongs in SQLX, code, or orchestration
We keep transformations in SQLX by default, move to code when the logic truly stops being legible in SQL, and keep orchestration for sequencing rather than business meaning.
Last updated: Mar 24, 2026
How we prevent stale rows in incremental fact models
Incremental fact models stay trustworthy only when record identity, reprocessing rules, and cleanup boundaries are designed on purpose instead of patched after drift shows up.
Last updated: Mar 24, 2026
Incremental models are only safe when change detection is explicit
Incremental models are trustworthy only when they can deliberately identify which records need another pass after late or changed upstream data shows up.
Last updated: Mar 24, 2026
Reviewability is a data platform feature
Reviewability is not decoration for data work. It is part of whether a shared platform can change safely once more than one person has to reason about the same models and workflows.
Last updated: Mar 24, 2026
Unique keys are not optional in analytical incrementals
Incremental analytical models need an explicit notion of row identity. Without it, merges drift, updates go missing, and review of correctness turns into guesswork.
Last updated: Mar 24, 2026
What we keep out of orchestration in data platforms
We use orchestration to sequence work, not to become the real home of model semantics, cleanup logic, or hidden branching behavior in the data platform.
Last updated: Mar 24, 2026
When repeated Pulumi code earns abstraction and when it doesn't
We don't abstract repeated Pulumi code just because it shows up more than once. We do it when the shared shape is real, the behavior is stable enough to deserve a boundary, and the result is easier to read than the duplication it replaces.
Last updated: Mar 24, 2026
Why declarative data models scale better than script-driven pipelines
Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.
Last updated: Mar 24, 2026
Why we model around decision boundaries, not source cleanup
We shape analytical models around the business decision or entity they need to represent, not around the temporary cleanup steps needed to tame source data on the way in.
Last updated: Mar 24, 2026
How we decide between directory per environment and shared stacks in Pulumi
We do not force DRY across environments by default. We keep Pulumi environments separate until shared code, shared rules, and drift risk make consolidation cheaper than duplication.
Last updated: Mar 23, 2026
How we structure a directory per environment in Pulumi
When we keep Pulumi environments separate, we make the environment boundary obvious in the filesystem and keep shared logic outside it.
Last updated: Mar 23, 2026
What goes in Pulumi stack config and what doesn't
We use Pulumi stack config for environment-specific values, not as a hiding place for infrastructure logic.
Last updated: Mar 23, 2026
How we treat Terraform state in team environments
Terraform starts feeling fragile in teams when state is treated like a backend setting instead of a shared dependency for safe change.
Last updated: Mar 22, 2026
Why we usually choose Pulumi over Terraform
Pulumi is our default when infrastructure starts behaving like software. Existing Terraform estates can still be the better decision when the migration cost is higher than the operational gain.
Last updated: Mar 22, 2026