Incremental models are only safe when change detection is explicit

Incremental models are trustworthy only when they can deliberately identify which records need another pass after late or changed upstream data shows up.

Operating principle Data

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

4 min read

incremental-models data-engineering transformations

On this page

The rule

Change detection has to be explicit.

A merge by key still needs an explicit answer to the question that decides whether the model stays trustworthy: which existing rows should be reconsidered on this run, and why?

If that answer is vague, the model may still run fast. It just won’t stay correct for very long.

Row identity is only half of the contract

Every incremental model needs a stable notion of row identity. That’s why unique keys matter. The key tells the model how to match one target row to one analytical entity over time.

But identity only tells you how to match a row once you’ve decided it belongs in the update set. It doesn’t tell the model when a previously built row needs another pass. A merge can only update the rows that make it into the merge input. Change detection is what decides that input.

Without that second piece, “incremental” just means the system is fast at preserving mistakes.

Missed updates are usually predictable

The dangerous cases are rarely exotic.

A child record arrives late and changes an aggregate. A source system corrects a status. A deletion means a count should go down. A linked dimension changes in a way that affects downstream classification. A load replay republishes an older business event with newer extraction metadata. None of this is weird. It’s normal system behavior once data starts arriving out of order or getting corrected after the fact.

Those are exactly the situations that create stale rows when the model only looks at “new data since the last run” and calls it a day.

So we want the change-detection rule to be concrete. Maybe it’s a lookback window. Maybe it’s a set of changed business keys. Maybe it’s a partition rebuild rule. Maybe it’s a dependency-driven recompute set. The specific mechanism matters less than the clarity. A reviewer should be able to read the model and understand why a changed upstream record will trigger the right downstream reprocessing.

The model decides what counts as change

Change detection depends on what the model is supposed to represent.

A late update only matters if it changes the analytical entity the row stands for. That takes us back to decision boundaries. If the grain is unclear, the change-detection rule will usually be unclear too, because nobody can say which business changes are supposed to alter the row and which ones are just noise.

This is why vague models age badly. They don’t just make queries uglier. They make it harder to decide what deserves a revisit, so the incremental path either misses real changes or starts reprocessing far more than it should.

Weak detection usually turns into waste

Teams usually compensate for weak change detection with brute force.

They widen the lookback. They rebuild more partitions than necessary. They schedule frequent backfills just to feel safe. They rerun heavy joins because that feels less risky than understanding the actual change boundary. The model becomes “safe” only in the sense that it’s now doing a lot more work than the business question required.

That’s how a correctness problem turns into a cost problem. It’s one reason cost spikes often have more to do with model design than with one ugly query. If the model can’t cheaply identify the rows that need work, the platform ends up paying to reprocess rows that didn’t.

What we want to be able to explain

For an incremental model to be trustworthy, we want to be able to answer a few plain questions without hand-waving.

What does one row represent? What key identifies that row over time? What kinds of upstream change should force a revisit? How does the model find those cases? Where does a lookback help, and where is it not enough? When do we merge, and when do we selectively replace?

If those answers aren’t clear, the model isn’t safe just because it finishes quickly.

The point

Safe incrementals aren’t defined by speed. They’re defined by whether the model can deliberately revisit the right records when reality changes.

A key tells us what one row is. Explicit change detection tells us when that row is no longer current. We need both.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Related patterns

Dataform vs. script piles: how we keep transformations reviewable

We prefer a declarative transformation layer over ad hoc script piles once warehouse logic becomes shared, incremental, and worth reviewing as a system.

Why declarative data models scale better than script-driven pipelines

Declarative modeling scales better because it keeps business shape, dependencies, and reviewable intent visible as the platform and team both grow.

Reviewability is a data platform feature

Reviewability is not decoration for data work. It is part of whether a shared platform can change safely once more than one person has to reason about the same models and workflows.

Unique keys are not optional in analytical incrementals

Incremental analytical models need an explicit notion of row identity. Without it, merges drift, updates go missing, and review of correctness turns into guesswork.