How we prevent stale rows in incremental fact models

Incremental fact models stay trustworthy only when record identity, reprocessing rules, and cleanup boundaries are designed on purpose instead of patched after drift shows up.

Implementation note Data

By Ivan RichterLinkedIn

Last updated: Mar 24, 2026

4 min read

incremental-models fact-modeling data-modeling

On this page

The problem

Stale rows show up when an incremental fact model stops revisiting the records that can still change.

A lot of teams treat that as a cleanup issue they discover later. We don’t. We treat it as part of the model design from the start. If a fact table can be corrected, enriched, reopened, refunded, linked to later children, or otherwise changed after first arrival, the incremental path has to account for that on purpose.

Otherwise the table starts drifting in a very predictable way. Old rows stay behind. New logic only fixes fresh data. Trust drops quietly until someone notices a discrepancy in a dashboard and now everybody gets to pretend this was somehow surprising.

Record identity has to be real

The first defense against stale rows is explicit row identity.

If the model can’t reliably say which target row corresponds to a changed source entity, every later update rule gets shaky fast. At that point, merge logic is mostly theater. The system is still writing rows, but it isn’t doing it from a stable understanding of what one row is supposed to represent.

That’s why unique keys matter. A merge strategy without a trustworthy key is a polite way to accumulate drift.

We want the target row to represent a clear analytical entity and the key to identify that entity in a stable way over time. That makes update behavior reviewable instead of mysterious.

We plan for late change, not just first arrival

Most stale-row problems come from cases that aren’t even unusual.

A child record arrives after the parent fact was first built. A status gets corrected later. A cancellation changes what should still count. A slowly changing attribute affects a row that was already materialized. A source reload republishes an event with important differences but an older event timestamp. None of that is exotic. It’s just how systems behave once they’ve been alive for more than ten minutes.

Those are all versions of the same problem. The model needs explicit logic for what kinds of upstream change should force old rows to be revisited.

That’s why change detection matters. Lookbacks, changed-parent sets, changed-child sets, selective rebuilds, and targeted replacements aren’t hacks. They’re normal machinery for keeping an incremental model honest.

Lookbacks help, but they aren’t the whole plan

A lookback window is useful, but it isn’t a real strategy by itself.

Sometimes it’s enough. If the source behavior is simple and the delay pattern is tight, revisiting the last few days may catch what matters. But plenty of stale-row cases don’t respect a neat time window. A child change can hit an older parent. A correction can arrive long after the original row. A source reload can touch data outside the range you felt good about last month.

That’s where teams get into trouble. The lookback starts as a small safety measure, then keeps getting widened because trust keeps slipping. Soon the model is technically incremental and operationally halfway to a rebuild.

We’d rather be explicit about which changes matter than keep buying confidence with a fatter time window.

Cleanup should stay close to the model

When stale rows start appearing, teams usually patch the fastest layer available.

A scheduler gets a backfill branch. A script runs a delete before the main build. A helper computes partitions to rewrite. Sometimes that is necessary. But it’s also how the system turns into a scavenger hunt.

We want the cleanup path to stay aligned with the model boundary. The model should explain what needs to be recomputed and why. If code or orchestration is involved, it should be because the logic really exceeds what belongs in SQL, not because nobody decided the boundary and the mess leaked outward.

That’s the same question behind layer boundaries. Keep the behavior in the layer where a reviewer can still see the actual system.

Bad incrementals get expensive fast

Stale rows are a correctness problem first. They turn into a cost problem almost immediately after that.

Once a team stops trusting the incremental path, the response is usually predictable. Widen the lookback. Rewrite more partitions. Rebuild more often. Rerun extra cleanup to be safe. Keep paying for uncertainty because the model isn’t specific enough to control where change lands.

That’s one reason cost spikes often have more to do with model design than with one bad query. Weak incrementals make the warehouse scan and rewrite far more data than the analytical question actually required.

Our default posture

We assume fact models can drift unless the model gives us a good reason to trust it.

So we ask the boring questions early, because they’re only boring right up until the table starts lying. What does one row represent? What can change after initial load? Which upstream events should force a revisit? How far back do we need to look, and where is that not enough? Which cases need replacement instead of merge? Which child changes should recompute an older parent row?

Without those answers, the incremental path is unfinished; it has merely been allowed to run.

The point

We prevent stale rows by making reprocessing part of the model’s contract.

Identity has to be explicit. Change detection has to be deliberate. Cleanup has to respect the model boundary. If those pieces stay fuzzy, stale rows aren’t bad luck. They’re the expected outcome.

More in this domain: Data

Browse all

BigQuery cost guardrails that won't break your teams

BigQuery cost control works when guardrails are designed around workload shape and blast radius, not around shaming whoever happened to run the last expensive query.

On-demand vs slots: the SME decision boundary

For SMEs, the question is not which BigQuery pricing model is more sophisticated. The question is when workload classes have become distinct enough to deserve different compute lanes.

Partitioning defaults for event tables that don't lie

Partitioning is not just a performance tweak. It is one of the cheapest ways to control scan blast radius, but only if the partition contract matches how the table is actually queried.

Physical vs logical storage: a dataset classification rule for SMEs

Physical versus logical storage billing is not a warehouse philosophy debate. It is a dataset classification choice based on change rate, retention behavior, and how much storage churn the table creates.

Reservations for workload isolation: the minimal setup

Reservation design for SMEs is usually not an enterprise org chart. It is a small blast-radius pattern that keeps BI, batch, and sandbox work from bullying each other.

Related patterns

Unique keys are not optional in analytical incrementals

Incremental analytical models need an explicit notion of row identity. Without it, merges drift, updates go missing, and review of correctness turns into guesswork.

Constraints without enforcement: still worth it?

Non-enforced constraints are useful when they tell the truth. They act as semantic contracts and optimizer hints, but they become actively dangerous the moment the warehouse is asked to trust a lie.

BigQuery cost spikes usually come from table shape, not queries

When BigQuery spend jumps, the cause is usually in model shape, weak incremental design, or unnecessary reprocessing long before it's a single bad query.

Incremental models are only safe when change detection is explicit

Incremental models are trustworthy only when they can deliberately identify which records need another pass after late or changed upstream data shows up.