← Back to Patterns

Why we model around decision boundaries, not source cleanup

We shape analytical models around the business decision or entity they need to represent, not around the temporary cleanup steps needed to tame source data on the way in.

By Ivan Richter LinkedIn

Last updated: Mar 24, 2026

5 min read

On this page

The default

If a model is meant to represent an order, a subscription state, a customer interaction, a claim decision, or any other analytical entity, its shape should be driven by that responsibility. We don’t want the semantic layer inheriting the accidental shape of raw source messiness any more than necessary.

That’s the broader posture behind reviewable transformations. We want named models that explain what they mean, not a sequence of cleanup steps that only happens to leave something useful behind.

Cleanup is real, but it isn’t the model

Source cleanup is real work.

Types need normalizing. Duplicates need handling. Bad records need quarantine paths. Nested payloads may need flattening before they become usable. We aren’t pretending any of that disappears just because the warehouse would look cleaner without it.

What we resist is letting those steps define the final model.

A table called orders_cleaned_v7 may describe the history of the pipeline, but it doesn’t tell the reader what the data is for. It tells them the source was ugly and somebody kept patching it. That’s not a semantic layer. That’s a maintenance diary.

A useful model explains a business shape. Its grain says what one row represents. Its columns support that meaning. The cleanup work may still exist upstream, but it shouldn’t be the main thing the downstream table is organized around.

The model should describe the business shape

Once a table enters the analytical layer, we want it to answer a business question honestly.

What is this row? One order? One order line? One active subscription state? One claim outcome? One customer-day snapshot? Those are the questions that should decide the grain, not whichever transformation step was hardest to implement in staging.

This matters because model shape doesn’t stay local. Downstream joins, metrics, filters, marts, and dashboards all inherit the boundary the model chose. If that boundary came from source cleanup residue instead of a real analytical entity, every later consumer has to keep repairing the same ambiguity.

That’s where a lot of downstream nonsense starts. The warehouse keeps returning answers, but half the effort goes into reconstructing what the table should’ve represented in the first place.

Cleanup, semantics, and workflow are different jobs

One of the easier ways to make a repo confusing is to blur cleanup work, model semantics, and orchestration into one mushy responsibility.

A transformation starts by normalizing source fields. Then it quietly decides business grain. Then some runtime flag changes what gets materialized. Then a scheduler branch decides which cleanup path is real. Now the system still runs, but nobody can point to one clear place and say, “this is where the model means what it means.”

Layer boundaries matter. Cleanup, model semantics, and workflow behavior are not the same job. Once they get blended together, every later change becomes harder to review because the actual boundary keeps moving.

Orchestration shouldn’t own business meaning

Workflow tools are useful for sequencing work. They’re bad homes for business semantics.

If the real definition of a model lives in scheduler branches, runtime parameters, or task-level special cases, the warehouse no longer has a clear semantic layer. It has a logic trail.

Orchestration boundaries matter here too. Orchestration can coordinate models. Once it starts carrying model meaning, it’s usually compensating for a boundary the SQL layer never defined cleanly.

Stable boundaries make change easier to handle

Analytical entities don’t stop changing just because we gave them a cleaner table name.

A subscription can be reactivated. An order can be refunded. A claim can be reopened. A customer state can be corrected after another upstream record arrives. Those changes are normal. The question is whether the model boundary is strong enough to absorb them without turning each case into special repair logic.

Stale rows show the same boundary problem from another angle. A stable model boundary makes it easier to decide which upstream changes belong to the model and how those rows should be revisited over time.

If the boundary is weak, every later correction becomes an argument about whether the table was really meant to represent that case at all.

Good shape affects cost too

A semantically clean model isn’t just nicer for analysts. It’s usually cheaper to run.

When the table shape follows a real decision boundary, downstream work gets simpler. Fewer duplicated rows leak into marts. Fewer downstream queries need to reconstruct the intended entity again. Fewer rebuilds happen for problems that were really caused by model shape in the first place. That’s the same cost pattern behind cost spikes: bad shape creates repeat work everywhere.

What we do instead

We let cleanup support the model. We don’t let cleanup define it.

If a transformation is still mostly about taming a raw source, we name it accordingly and keep it in the preparation layer. If it’s meant to represent a business entity or decision, we model that entity directly and keep the grain honest.

That doesn’t mean the line is always easy. Sometimes the source is ugly enough that the cleanup work is half the effort. Fine. The point is still to keep the final model organized around what the row means, not around the pain it took to get there.

The point

We model around decision boundaries because cleanup artifacts make bad semantic layers.

The goal isn’t a prettier warehouse diagram. It’s a model shape that stays understandable when the system changes, the team changes, and the source keeps doing annoying source-system things like it was born to do.

More in this domain: Data

Browse all

Related patterns