Dataform vs. script piles: how we keep transformations reviewable

We prefer a declarative transformation layer over ad hoc script piles once warehouse logic becomes shared, incremental, and worth reviewing as a system.

Implementation note Data

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

5 min read

dataform data-engineering transformations

On this page

The rule

Once transformations start carrying shared business meaning, we want a real modeling layer, not a pile of scripts.

That usually means Dataform, SQLX, explicit dependencies, named models, and code only where code is actually doing something worth isolating. For maintainability, we keep the part of the system that defines business logic in a place where another person can open it, review it, and still trust what it’ll do a month later.

Scripts feel cheap early because they usually are. One file extracts. Another mutates. Another handles a backfill. Somebody adds a special case, then another one, then one more because the source system did something annoying on a Friday night. That can work for a while. The problem starts when those transformations stop being one person’s local workflow and turn into platform behavior other people depend on.

Why script piles stop being cheap

Script piles spread behavior across too many layers too quickly.

A filter lives in one SQL file. A dedupe rule lives in a helper. A parameter in orchestration decides whether late rows get picked up. A cleanup step only runs in a separate script because someone once found an edge case and patched it in the fastest place available. Everything still “works,” but now the behavior of one table is no longer visible in one place.

That’s where review starts getting expensive.

A reviewer is no longer reading a model and its direct dependencies. They’re reconstructing behavior across scripts, scheduler inputs, temporary assumptions, and side effects. At that point, the repo may still look productive, but it’s already becoming dependent on memory.

That’s exactly why reviewability matters. Review keeps the system legible once multiple people are changing it.

Shorter code can still be worse

A wrapper, helper, or macro only helps if it makes the behavior easier to see. If it just hides a messy sequence of steps behind a cleaner entry point, the code got shorter and the system got harder to inspect. The visibility trade went the wrong way.

That’s the same judgment behind earned abstraction. We don’t compress logic just because it repeats. We compress it when the shared shape is real and the result is easier to reason about than the duplication it replaces.

The same standard applies here. A transformation layer should make the system easier to inspect, not more elegant from ten thousand feet.

Why a declarative layer helps

A declarative transformation layer helps because it keeps more of the important behavior attached to named models.

A model has a grain. It has inputs. It has a contract with downstream readers. Its dependencies are visible. Its assertions sit close to the thing they protect. A change can usually be reviewed by reading the SQLX, the model config, and the upstream models it depends on. You don’t have to replay a procedural workflow in your head just to answer “what builds this table” or “what changes if I touch this logic.”

That’s the real advantage behind declarative models. Change stays legible as the system grows.

Dataform is useful here because it gives the repo a center of gravity. Models live where people expect them. Dependencies are explicit. Assertions and tests stay near the transformations they belong to. Incremental behavior is declared with the model instead of being passed in sideways through a script argument or scheduler flag.

Incrementals expose weak structure fast

The moment a model updates over time, the system needs to be explicit about row identity, merge behavior, refresh scope, and stale-row handling. Those aren’t implementation details. They’re part of the model’s correctness.

Unique keys matter here. If the model can’t state what makes a row the same row over time, the incremental path is already on weak ground.

Stale-row handling matters for the same reason. Once records can change after first arrival, you need a real plan for how old results get corrected instead of hoping the next run somehow makes them true.

If the model depends on late events, changed children, status corrections, or other forms of drift, the refresh path has to detect those changes on purpose. That’s where explicit change detection matters.

Scripts don’t remove any of these problems. They just distribute them across more files, more flags, and more room for quiet mistakes.

Boundaries matter more than language preference

We default to SQLX when the work is best expressed as a named transformation with visible grain and a readable dependency graph.

Once the logic stops being straightforward transformation logic, we make a boundary decision. Does this belong in the model, in a helper, or in orchestration? That’s the practical question behind layer boundaries.

Keep behavior in the layer where reviewers can still see what matters without following a trail of indirection.

That’s the same instinct behind config boundaries. Different system, same discipline. Values should live in one place. Behavior in another. Workflow glue shouldn’t quietly become the real program.

What we’re optimizing for

A good transformation repo lets somebody open a model, understand what it does, inspect what it depends on, and review a change without needing a guided tour from the person who built it. A bad one turns every change into a scavenger hunt through scripts, scheduler arguments, temporary tables, and tribal memory.

We prefer Dataform once a script pile becomes platform behavior. Scripts work, but they stop being cheap when the platform needs predictable change.

The point

We keep transformations reviewable by giving them a declarative home, clear boundaries, and visible contracts.

If the work is a quick one-off, a script may be fine. If it’s turning into platform behavior, it deserves named models, explicit dependencies, and a structure other people can safely change.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Related patterns

How we decide whether a transformation belongs in SQLX, code, or orchestration

We keep transformations in SQLX by default, move to code when the logic truly stops being legible in SQL, and keep orchestration for sequencing rather than business meaning.

Incremental models are only safe when change detection is explicit

Incremental models are trustworthy only when they can deliberately identify which records need another pass after late or changed upstream data shows up.

Why declarative data models scale better than script-driven pipelines

Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.

Streaming buffer is your hidden constraint

When BigQuery streaming pain shows up as a DML error, the real problem is usually workload shape. Streaming wants append-and-reconcile thinking, not row-by-row sync fantasies.