What we keep out of orchestration in data platforms

We use orchestration to sequence work, not to become the real home of model semantics, cleanup logic, or hidden branching behavior in the data platform.

Operating principle Operations

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

4 min read

orchestration data-platforms operations

On this page

The rule

We keep orchestration thin.

Workflow tools are for sequencing, triggering, retries, and operational control. They are not where we want the meaning of a transformation to live. If the orchestration layer becomes the place where business rules, cleanup semantics, and model shape actually get decided, the platform gets harder to review, harder to trust, and uglier to operate.

The more explicit version of that boundary lives in layer boundaries. This page is the operating rule that falls out of it.

Business logic should not live in workflow glue

A scheduler should not be the place where a table means one thing on Monday and something slightly different on Friday.

If business logic depends on task arguments, branch conditions, or run-specific flags in orchestration, the warehouse no longer has a clear semantic layer. At that point, the workflow is no longer coordinating the system. It’s quietly defining it.

That’s usually a sign the model never got a clean boundary in the first place. It’s the same problem behind decision boundaries. The model should own the meaning. The workflow should own the order.

Cleanup logic should not sprawl into workflows

One of the fastest ways to ruin orchestration is to start patching model problems there.

A stale-row issue appears, so a branch gets added for backfills. A table needs selective cleanup, so a pre-step starts deleting partitions. A late-arriving edge case shows up once, and now there’s a workflow path that exists forever because nobody wants to be the one who removes it and finds out the hard way.

Sometimes those patches are necessary. But once the workflow becomes the main place where cleanup behavior lives, the system starts getting harder to reason about. You can’t understand the model by reading the model anymore. You have to inspect the operational glue around it and hope nothing important is hiding there.

That’s usually a sign the cleanup boundary belongs closer to the model than the workflow.

Thin orchestration is easier to inspect

Orchestration gets ugly for the same reason piles of scripts do. Logic leaks into the layer that was supposed to stay simple.

A task graph should be understandable at a glance. Which models run? What depends on what? Where does a failure retry? Which path is manual versus automatic? Those are good orchestration questions. Hidden transformation rules, cleanup semantics, and run-specific business logic are not.

That’s why thin workflows pair naturally with reviewable transformations. The more model behavior stays in named transformations, the less the workflow layer has to compensate for logic it should never have owned.

Hidden branching makes review worse

The orchestration layer becomes dangerous when important behavior hides behind conditionals people stop noticing.

A task name looks harmless, but it runs a different path based on some flag. A backfill mode changes how a model is built. A retry path quietly replays work with different assumptions. A manual run behaves differently from the scheduled one in ways nobody can see from the graph itself.

That’s why reviewability matters here too. If the operational layer can’t be reviewed clearly, the team starts borrowing confidence from memory instead of structure.

Operational clarity is part of the platform

Readable orchestration helps the platform scale without becoming dependent on a few people who know where the weird parts are.

When incidents happen, people need to tell quickly whether the problem is a failed dependency, a bad model change, a stale input, or a workflow path that retried the wrong thing. If that takes too long, the platform starts accumulating operational superstition. People stop trusting what ran, what will rerun, and what side effects are attached to each path.

The problem lies in the structure, regardless of tooling.

It’s also the same family of rule as Pulumi config boundaries. Configuration and workflow layers should support the system, not quietly become the place where the real behavior hides.

What orchestration should own

We want orchestration to own the things orchestration is actually good at.

Scheduling. Dependency execution. Retry policy. Triggering. Parallelism. Failure handling. Manual versus automatic entry points. Operational controls that help the system run predictably.

Those are all useful concerns, and they’re enough. Once a workflow layer starts taking on semantic decisions about what a model means or how it should correct itself, it begins to compete with the modeling layer instead of supporting it.

That’s usually where the mess starts.

The point

We keep orchestration out of model semantics because workflow glue is a bad semantic layer.

When orchestration stays thin, models stay easier to understand, incidents stay easier to trace, and the platform is cheaper to change.

More in this domain: Operations

Browse all

An alert is not a notification

A notification says something happened. An operational alert identifies a business situation, assigns ownership, carries enough context to act, records the response, and becomes workflow state.

Why alert feedback should be structured first

Free text helps, but structured alert feedback lets the system measure relevance, timing, duplicates, bad data, and rule quality. Human response becomes evidence the rules can learn from.

How we diagnose and fix a "too many connections" incident for Cloud Run + Postgres

A "too many connections" incident is rarely a one-line fix. It usually exposes a bad contract between Cloud Run scaling, app pool behavior, and database capacity.

Why Cloud Run + Postgres needs a connection budget

Cloud Run and Postgres get fragile when connection growth is left implicit. We treat connections as a finite runtime budget, not as plumbing the app can multiply without consequence.

AlloyDB managed connection pooling: when we'd trust it over PgBouncer

AlloyDB managed pooling is attractive because it removes a moving part, but the useful decision is whether the managed path gives enough semantic confidence, observability, and migration predictability to replace PgBouncer.

Related patterns

Reviewability is a data platform feature

Reviewability is not decoration for data work. It is part of whether a shared platform can change safely once more than one person has to reason about the same models and workflows.

A dashboard is not an operating system

Dashboards are good at showing state. They are bad at routing action, assigning ownership, and closing operational loops once a metric requires intervention.

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

Cloud SQL to AlloyDB migration: what actually changes, what doesn't, and what we'd test first

A Cloud SQL to AlloyDB move is not a philosophical upgrade. It changes the operational boundary, and the useful work is re-proving the parts of the system that may no longer behave the same.