← Back to Patterns

Dataform vs. script piles: how we keep transformations reviewable

We prefer a declarative transformation layer over ad hoc script piles once warehouse logic becomes shared, incremental, and worth reviewing as a system.

By Ivan Richter LinkedIn

Last updated: Mar 24, 2026

5 min read

On this page

The rule

Once transformations start carrying shared business meaning, we want a real modeling layer, not a pile of scripts.

That usually means Dataform, SQLX, explicit dependencies, named models, and code only where code is actually doing something worth isolating. This isn’t an aesthetics argument against Python. It’s a maintainability decision. We want the part of the system that defines business logic to live in a place where another person can open it, review it, and still trust what it’ll do a month later.

Scripts feel cheap early because they usually are. One file extracts. Another mutates. Another handles a backfill. Somebody adds a special case, then another one, then one more because the source system did something annoying on a Friday night. That can work for a while. The problem starts when those transformations stop being one person’s local workflow and turn into platform behavior other people depend on.

Why script piles stop being cheap

The failure mode isn’t that scripts are bad. It’s that they spread behavior across too many layers too quickly.

A filter lives in one SQL file. A dedupe rule lives in a helper. A parameter in orchestration decides whether late rows get picked up. A cleanup step only runs in a separate script because someone once found an edge case and patched it in the fastest place available. Everything still “works,” but now the behavior of one table is no longer visible in one place.

That’s where review starts getting expensive.

A reviewer is no longer reading a model and its direct dependencies. They’re reconstructing behavior across scripts, scheduler inputs, temporary assumptions, and side effects. At that point, the repo may still look productive, but it’s already becoming dependent on memory.

That’s exactly why reviewability matters. Review isn’t a cleanliness preference. It’s part of how the system stays legible once multiple people are changing it.

Shorter code can still be worse

A wrapper, helper, or macro only helps if it makes the behavior easier to see. If it just hides a messy sequence of steps behind a cleaner entry point, the code got shorter and the system got harder to inspect. That isn’t an improvement. It’s a visibility trade where the wrong side won.

That’s the same judgment behind earned abstraction. We don’t compress logic just because it repeats. We compress it when the shared shape is real and the result is easier to reason about than the duplication it replaces.

The same standard applies here. A transformation layer should make the system easier to inspect, not more elegant from ten thousand feet.

Why a declarative layer helps

A declarative transformation layer helps because it keeps more of the important behavior attached to named models.

A model has a grain. It has inputs. It has a contract with downstream readers. Its dependencies are visible. Its assertions sit close to the thing they protect. A change can usually be reviewed by reading the SQLX, the model config, and the upstream models it depends on. You don’t have to replay a procedural workflow in your head just to answer “what builds this table” or “what changes if I touch this logic.”

That’s the real advantage behind declarative models. The benefit isn’t that the syntax is cleaner. It’s that change stays legible as the system grows.

Dataform is useful here because it gives the repo a center of gravity. Models live where people expect them. Dependencies are explicit. Assertions and tests stay near the transformations they belong to. Incremental behavior is declared with the model instead of being passed in sideways through a script argument or scheduler flag.

Incrementals expose weak structure fast

The moment a model updates over time, the system needs to be explicit about row identity, merge behavior, refresh scope, and stale-row handling. Those aren’t implementation details. They’re part of the model’s correctness.

That’s why unique keys matter. If the model can’t state what makes a row the same row over time, the incremental path is already on weak ground.

It’s also why stale-row handling matters. Once records can change after first arrival, you need a real plan for how old results get corrected instead of hoping the next run somehow makes them true.

And if the model depends on late events, changed children, status corrections, or other forms of drift, the refresh path has to detect those changes on purpose. That’s why explicit change detection matters.

Scripts don’t remove any of these problems. They just distribute them across more files, more flags, and more room for quiet mistakes.

Boundaries matter more than language preference

We don’t default to SQLX because SQL is morally superior to code. We default to SQLX when the work is best expressed as a named transformation with visible grain and a readable dependency graph.

Once the logic stops being straightforward transformation logic, we make a boundary decision. Does this belong in the model, in a helper, or in orchestration? That is the practical question behind layer boundaries.

The point isn’t to ban code. The point is to keep behavior in the layer where reviewers can still see what matters without following a trail of indirection.

That’s the same instinct behind config boundaries. Different system, same discipline. Values should live in one place. Behavior in another. Workflow glue shouldn’t quietly become the real program.

What we’re optimizing for

A good transformation repo lets somebody open a model, understand what it does, inspect what it depends on, and review a change without needing a guided tour from the person who built it. A bad one turns every change into a scavenger hunt through scripts, scheduler arguments, temporary tables, and tribal memory.

That’s why we prefer Dataform over script piles once the warehouse starts to matter. Not because scripts never work. They do. But they stop being cheap at exactly the point where the platform needs more predictability, not less.

The point

We keep transformations reviewable by giving them a declarative home, clear boundaries, and visible contracts.

If the work is a quick one-off, a script may be fine. If it’s turning into platform behavior, it deserves named models, explicit dependencies, and a structure other people can safely change.

More in this domain: Data

Browse all

Related patterns