Why we model around decision boundaries, not source cleanup

We shape analytical models around the business decision or entity they need to represent, not around the temporary cleanup steps needed to tame source data on the way in.

Decision memo Data

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

4 min read

data-modeling analytics-engineering transformations

On this page

The default

If a model is meant to represent an order, a subscription state, a customer interaction, a claim decision, or any other analytical entity, its shape should be driven by that responsibility. We don’t want the semantic layer inheriting the accidental shape of raw source messiness any more than necessary.

That’s the broader posture behind reviewable transformations. We want named models that explain what they mean, not a sequence of cleanup steps that only happens to leave something useful behind.

Cleanup is real, but it isn’t the model

Source cleanup is real work.

Types need normalizing. Duplicates need handling. Bad records need quarantine paths. Nested payloads may need flattening before they become usable. We aren’t pretending any of that disappears just because the warehouse would look cleaner without it.

What we resist is letting those steps define the final model.

A table called orders_cleaned_v7 may describe the history of the pipeline, but it doesn’t tell the reader what the data is for. It tells them the source was ugly and somebody kept patching it. That’s a maintenance diary posing as a semantic layer.

A useful model explains a business shape. Its grain says what one row represents. Its columns support that meaning. The cleanup work may still exist upstream, but it shouldn’t be the main thing the downstream table is organized around.

The model should describe the business shape

Once a table enters the analytical layer, we want it to answer a business question honestly.

What is this row? One order? One order line? One active subscription state? One claim outcome? One customer-day snapshot? Those are the questions that should decide the grain, not whichever transformation step was hardest to implement in staging.

This matters because model shape doesn’t stay local. Downstream joins, metrics, filters, marts, and dashboards all inherit the boundary the model chose. If that boundary came from source cleanup residue instead of a real analytical entity, every later consumer has to keep repairing the same ambiguity.

That’s where a lot of downstream nonsense starts. The warehouse keeps returning answers, but half the effort goes into reconstructing what the table should’ve represented in the first place.

Cleanup, semantics, and workflow are different jobs

One of the easier ways to make a repo confusing is to blur cleanup work, model semantics, and orchestration into one mushy responsibility.

A transformation starts by normalizing source fields. Then it quietly decides business grain. Then some runtime flag changes what gets materialized. Then a scheduler branch decides which cleanup path is real. Now the system still runs, but nobody can point to one clear place and say, “this is where the model means what it means.”

Layer boundaries matter. Cleanup, model semantics, and workflow behavior are not the same job. Once they get blended together, every later change becomes harder to review because the actual boundary keeps moving.

Orchestration shouldn’t own business meaning

Workflow tools are useful for sequencing work. They’re bad homes for business semantics.

If the real definition of a model lives in scheduler branches, runtime parameters, or task-level special cases, the warehouse no longer has a clear semantic layer. It has a logic trail.

Orchestration boundaries matter here too. Orchestration can coordinate models. Once it starts carrying model meaning, it’s usually compensating for a boundary the SQL layer never defined cleanly.

Stable boundaries make change easier to handle

Analytical entities don’t stop changing just because we gave them a cleaner table name.

A subscription can be reactivated. An order can be refunded. A claim can be reopened. A customer state can be corrected after another upstream record arrives. Those changes are normal. The question is whether the model boundary is strong enough to absorb them without turning each case into special repair logic.

Stale rows show the same boundary problem from another angle. A stable model boundary makes it easier to decide which upstream changes belong to the model and how those rows should be revisited over time.

If the boundary is weak, every later correction becomes an argument about whether the table was really meant to represent that case at all.

Good shape affects cost too

A semantically clean model is usually cheaper to run too.

When the table shape follows a real decision boundary, downstream work gets simpler. Fewer duplicated rows leak into marts. Fewer downstream queries need to reconstruct the intended entity again. Fewer rebuilds happen for problems that were really caused by model shape in the first place. That’s the same cost pattern behind cost spikes: bad shape creates repeat work everywhere.

What we do instead

We let cleanup support the model. We don’t let cleanup define it.

If a transformation is still mostly about taming a raw source, we name it accordingly and keep it in the preparation layer. If it’s meant to represent a business entity or decision, we model that entity directly and keep the grain honest.

That doesn’t mean the line is always easy. Sometimes the source is ugly enough that the cleanup work is half the effort. Fine. Keep the final model organized around what the row means while the cleanup stays in the preparation layer.

The point

We model around decision boundaries because cleanup artifacts make bad semantic layers.

The goal is a model shape that stays understandable when the system changes, the team changes, and the source keeps doing annoying source-system things like it was born to do.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Related patterns

Why declarative data models scale better than script-driven pipelines

Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.

Constraints without enforcement: still worth it?

Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.

BigQuery cost spikes usually come from table shape, not queries

When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.

Dataform vs. script piles: how we keep transformations reviewable

We prefer a declarative transformation layer over ad hoc script piles once warehouse logic becomes shared, incremental, and worth reviewing as a system.