← Back to Patterns

How we prevent stale rows in incremental fact models

Incremental fact models stay trustworthy only when record identity, reprocessing rules, and cleanup boundaries are designed on purpose instead of patched after drift shows up.

By Ivan Richter LinkedIn

Last updated: Mar 24, 2026

5 min read

On this page

The problem

Stale rows show up when an incremental fact model stops revisiting the records that can still change.

A lot of teams treat that as a cleanup issue they discover later. We don’t. We treat it as part of the model design from the start. If a fact table can be corrected, enriched, reopened, refunded, linked to later children, or otherwise changed after first arrival, the incremental path has to account for that on purpose.

Otherwise the table starts drifting in a very predictable way. Old rows stay behind. New logic only fixes fresh data. Trust drops quietly until someone notices a discrepancy in a dashboard and now everybody gets to pretend this was somehow surprising.

Record identity has to be real

The first defense against stale rows is explicit row identity.

If the model can’t reliably say which target row corresponds to a changed source entity, every later update rule gets shaky fast. At that point, merge logic is mostly theater. The system is still writing rows, but it isn’t doing it from a stable understanding of what one row is supposed to represent.

That’s why unique keys matter. A merge strategy without a trustworthy key isn’t really a merge strategy. It’s just a polite way to accumulate drift.

We want the target row to represent a clear analytical entity and the key to identify that entity in a stable way over time. That makes update behavior reviewable instead of mysterious.

We plan for late change, not just first arrival

Most stale-row problems come from cases that aren’t even unusual.

A child record arrives after the parent fact was first built. A status gets corrected later. A cancellation changes what should still count. A slowly changing attribute affects a row that was already materialized. A source reload republishes an event with important differences but an older event timestamp. None of that is exotic. It’s just how systems behave once they’ve been alive for more than ten minutes.

Those are all versions of the same problem. The model needs explicit logic for what kinds of upstream change should force old rows to be revisited.

That’s why change detection matters. Lookbacks, changed-parent sets, changed-child sets, selective rebuilds, and targeted replacements aren’t hacks. They’re normal machinery for keeping an incremental model honest.

Lookbacks help, but they aren’t the whole plan

A lookback window is useful, but it isn’t a real strategy by itself.

Sometimes it’s enough. If the source behavior is simple and the delay pattern is tight, revisiting the last few days may catch what matters. But plenty of stale-row cases don’t respect a neat time window. A child change can hit an older parent. A correction can arrive long after the original row. A source reload can touch data outside the range you felt good about last month.

That’s where teams get into trouble. The lookback starts as a small safety measure, then keeps getting widened because trust keeps slipping. Soon the model is technically incremental and operationally halfway to a rebuild.

We’d rather be explicit about which changes matter than keep buying confidence with a fatter time window.

Cleanup should stay close to the model

When stale rows start appearing, teams usually patch the fastest layer available.

A scheduler gets a backfill branch. A script runs a delete before the main build. A helper computes partitions to rewrite. Sometimes that is necessary. But it’s also how the system turns into a scavenger hunt.

We want the cleanup path to stay aligned with the model boundary. The model should explain what needs to be recomputed and why. If code or orchestration is involved, it should be because the logic really exceeds what belongs in SQL, not because nobody decided the boundary and the mess leaked outward.

That’s the same question behind layer boundaries. The point isn’t to keep everything in one file. It’s to keep the behavior in the layer where a reviewer can still see the actual system.

Bad incrementals get expensive fast

Stale rows are a correctness problem first. They turn into a cost problem almost immediately after that.

Once a team stops trusting the incremental path, the response is usually predictable. Widen the lookback. Rewrite more partitions. Rebuild more often. Rerun extra cleanup to be safe. Keep paying for uncertainty because the model isn’t specific enough to control where change lands.

That’s one reason cost spikes often have more to do with model design than with one bad query. Weak incrementals make the warehouse scan and rewrite far more data than the analytical question actually required.

Our default posture

We assume fact models can drift unless the model gives us a good reason to trust it.

So we ask the boring questions early, because they’re only boring right up until the table starts lying. What does one row represent? What can change after initial load? Which upstream events should force a revisit? How far back do we need to look, and where is that not enough? Which cases need replacement instead of merge? Which child changes should recompute an older parent row?

If we can’t answer those questions, the incremental path isn’t finished. It’s just been allowed to run.

The point

We prevent stale rows by treating reprocessing as part of the model, not as cleanup somebody bolts on later.

Identity has to be explicit. Change detection has to be deliberate. Cleanup has to respect the model boundary. If those pieces stay fuzzy, stale rows aren’t bad luck. They’re the expected outcome.

More in this domain: Data

Browse all

Related patterns