Most data teams ship the same SCD pattern for every dimension in their warehouse: Type 2, hand-rolled or via a dbt snapshot, with an is_current flag and effective dates. It’s the only type they remember from the Kimball book they read once. It produces, in a meaningful share of the engagements we audit, the wrong answers on regulated workloads, exploded dimension tables on volatile attributes, and a stack of is_current = true filters baked into BI queries that quietly break the moment a fact arrives late.
There are three original Slowly Changing Dimension types (Type 1, 2, 3) plus five hybrid types (0, 4, 5, 6, 7) standardized by Kimball’s Design Tip #152, three modern variants that ship as native database features, and three hybrid construction patterns. Picking the right one is not a matter of tradition or convenience — it’s a diagnostic process you run against the incoming data signals and the query patterns the dimension has to serve. This article walks the entire loop, type by type, with the signals that point to each, a worked example, the dbt + Iceberg implementation, the trade-off you accepted when you picked it, and the hybrids that combine atomic types when no single one fits.
By the end, you should be able to look at a new dimension in your own pipeline and pattern-match it to the right SCD type — or to a hybrid you compose deliberately — without guessing.
The diagnostic workflow, end to end
Every SCD decision follows the same four-step loop. Most teams skip steps 1 and 2 and jump straight to step 3 (“which type?”), which is exactly backwards — the right type almost always falls out of the first two answers.
Step 1 asks: what is your data actually doing right now? Mutation frequency, attribute volatility, whether the same business key mutates over time or each “change” is really a new entity, whether the source system even tells you a row changed (does it ship an updated_at? is there logical replication?). Step 2 asks: what queries does this dimension need to serve six months from now? Audit lookups, “as-of date” reporting, current-state filters, both lenses simultaneously, regulatory restatements. Steps 3 and 4 — pick the type, compose a hybrid if needed — are mechanical once 1 and 2 are answered honestly.
Worked diagnostic, one engagement
Concretely: a Series-B SaaS customer just provisioned AWS, with their customers table sitting in a Postgres replica. They want clean ARR cohorts in Looker AND a churn-prediction feature store. The customer dim is the obvious place to start.
Step 1 — incoming data signals:
customer.email,customer.name,customer.addresschange rarely (a few per thousand customers per month). Mostly typo fixes.customer.plan_tierchanges monthly-ish (people upgrade/downgrade as the business evolves).customer.behavior_scoreis recomputed nightly for every customer based on the past 30 days of product events. Every row mutates every day.- Postgres has logical replication enabled. Source ships
updated_at. Change detection is reliable.
Step 2 — query-pattern needs:
- Finance wants ARR cohorts reported as of historical plan tier (cohorts must be reconstructible for the past 18 months — billing reconciliation).
- BI wants dashboards filtered by current plan tier (today’s segmentation).
- The ML team wants “behavior score as it was at decision time” for training the churn model.
- Compliance wants the customer’s historical email retained for 7 years, but expects right-to-be-forgotten requests.
Step 3 — pick the atomic types:
email,name,address→ Type 2 (compliance retention, but rare changes mean storage cost is negligible). With pseudonymization at GDPR-erasure time.plan_tier→ Type 2 (finance needs as-of, BI needs current).behavior_score→ Type 4 mini-dim. Daily mutation × all customers would explode a Type 2 dim. Mini-dim of distinct (score-band, engagement-band) combinations is at most a few hundred rows.
Step 4 — compose hybrids if needed:
plan_tierneeds both historical (finance) AND current (BI) on the same fact row. The atomic Type 2 covers the historical lens. The current-lens demand pushes us to Type 7: facts carry both the surrogate key (historical) and the business key (current view). Cleaner than Type 6 in this engagement because the BI tool is Looker, which handles both joins via LookML naturally.behavior_scoreneeds “as-of decision time” for the ML feature store — that’s a bitemporal requirement on the mini-dim. Addrecorded_from/recorded_toso the ML pipeline can reproduce features that were known at any training-set cutoff date.
The result: the customer dim is a deliberate stack of three atomic patterns plus two hybrid compositions, chosen because steps 1 and 2 forced the choice. Nobody guessed. No section said “Type 2 because tradition.” If a new attribute is added next quarter (account_manager_id), we re-run steps 1 and 2 on it specifically and pick again.
This is the loop. The rest of this article walks each atomic type and each hybrid in detail — what signals point to each, what they’re built for, how to implement them in dbt + Iceberg, and what you accepted by picking them.
Type 0 retain original
Value never changes after first write.
business_key signup_tier
CUST_42 Bronze ← unchanged forever Use for: signup tier, original credit score, date of birth.
Data signals that point here: the value is born immutable in the world. It was true at the moment the row was created, and any later “change” is either an error or a fundamentally new entity. Date of birth. Original credit score at account opening. Original sale price. The fiscal-year-bucket on a date dimension. (Kimball Design Tip #152)
Query-pattern need: every report grouped by this attribute uses the original value forever. “Customers who opened accounts in 2024” stays “Customers who opened accounts in 2024” even when 2025 rolls over.
What Type 0 does: writes the value once and locks it. Subsequent updates from the source are rejected, not absorbed. The dimension column ignores future change events from CDC streams.
Worked example. A SaaS product opens accounts continuously. The account_opening_tier column on dim_customer is Type 0 — it records whether the customer signed up under the Free, Starter, or Enterprise plan at signup, regardless of any later upgrade or downgrade. Six months later, when the customer’s current_tier changes from Starter to Enterprise, that change rolls through the Type 2 column (current_tier) but never touches account_opening_tier. Quarterly board reports on “growth from signup cohort” stay consistent forever.
dbt + Iceberg implementation. Type 0 columns are populated once at insert and never overwritten. The dbt pattern uses incremental_strategy='merge' with an empty merge_update_columns list, so matched rows are not updated — only net-new business keys are inserted:
{{ config(
materialized='incremental',
unique_key='customer_business_key',
incremental_strategy='merge',
merge_update_columns=[] -- empty: matched rows are NOT updated
) }}
select
customer_business_key,
signup_tier as account_opening_tier, -- Type 0: written once, never updated
...
from {{ ref('stg_customer') }}
{% if is_incremental() %}
where customer_business_key not in (select customer_business_key from {{ this }})
{% endif %} The empty merge_update_columns list is what makes this Type 0: dbt compiles the MERGE INTO ... WHEN MATCHED THEN DO NOTHING pattern (or its adapter-specific equivalent). The WHERE NOT IN further filters source rows so existing customers don’t even enter the merge as candidates. Two safety mechanisms because Type 0 is the one type where silent overwrites are the entire failure mode.
Adapter-specific note for Iceberg: the merge_update_columns=[] pattern is supported on dbt-athena, dbt-spark, and dbt-glue. On older adapter versions, drop to a hand-rolled pre_hook that asserts count(distinct customer_business_key) == count(*) before each run.
What you give up. If the source system mutates the attribute (e.g. a CRM admin edits a “signup tier” field after the fact), Type 0 silently drops the edit. That is intentional — but it requires governance discipline so a future engineer doesn’t “fix” the WHERE NOT IN and accidentally re-open the column to overwrites.
Type 1 overwrite
Update in place. No history kept.
business_key plan_tier
CUST_42 Gold ← Bronze gone forever Use for: typo fixes, current email, current phone.
Data signals that point here: the value can change, you know it can change, and you have made a conscious choice to not track the history. Either because the change is an error correction (a typo in a customer’s last name), or because history has no analytic value (the customer’s current phone number — nobody runs “phone-number-cohort” analyses), or because the workload is operational not analytical.
Query-pattern need: every report wants the current value, no historical view ever required. “Show me customers in CA” should reflect today’s state of the dim, including the customer who moved from NY to CA last month.
What Type 1 does: overwrites the existing row when the source changes. No history, no audit trail in the dimension itself.
Worked example. A customer’s email address. The source-system field mutates frequently as people change jobs, providers, or move to a personal address. Reporting almost never wants the historical email (and if it does, that’s a separate audit-table use case). The email column on dim_customer is Type 1: every CDC update from Postgres overwrites the prior value.
dbt + Iceberg implementation. Standard dbt incremental materialization with merge strategy:
{{ config(
materialized='incremental',
unique_key='customer_business_key',
incremental_strategy='merge'
) }}
select
customer_business_key,
email,
phone_number,
updated_at
from {{ ref('stg_customer') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %} The MERGE INTO ... WHEN MATCHED THEN UPDATE pattern Iceberg compiles this to is what does the in-place overwrite.
What you give up. All history of this attribute, forever. If finance comes back six months from now asking “what was this customer’s billing address on Mar 15?” and you only modeled the address as Type 1, the answer is gone unless you can recover it from the source system’s own audit log. Kimball is explicit: Type 1 is “prohibited in regulated environments” precisely because regulators ask exactly that question. (Kimball Group: SCDs)
Type 2 add new row, the workhorse
New surrogate row per change, with effective dates.
sk bk plan valid_from valid_to curr
3001 CUST_42 Bronze 2025-08-01 2026-02-15 false
3247 CUST_42 Gold 2026-02-15 9999-12-31 true Use for: customer plan, employee manager, product price. Join facts via date range — never on is_current = true.
Data signals that point here: the value changes, the changes matter, and at least one stakeholder will eventually ask “what was it as-of <date>?” Customer plan tier. Employee manager. Product price. Subscription state. Sales territory. Address (when used for tax or compliance reporting).
Query-pattern need: “as-of date” reporting is real. Late-arriving facts must join to the historical version of the dim, not the current one. Auditors or regulators may demand reconstructible state at any past timestamp. Or simpler: BI dashboards filter cohorts by attributes that have already mutated for some customers.
What Type 2 does: every time the attribute changes, insert a new row keyed by a surrogate primary key. Carry the business key as a non-unique column. Add effective_from, effective_to, and is_current columns. The intervals are half-open — [effective_from, effective_to) — meaning a row is valid from effective_from inclusive up to but not including effective_to. The previous row’s effective_to is set to the new row’s effective_from; the previous row’s is_current is flipped to false. Half-open intervals are the right convention because they make the boundary unambiguous: the exact instant of the change belongs to the new row, never both. (Wikipedia: Slowly Changing Dimension)
Worked example. A customer moves from the Bronze plan to the Gold plan on 2026-02-15T00:00:00 UTC. Before the change:
surrogate_key business_key plan_tier effective_from effective_to is_current
───────────── ──────────── ───────── ────────────── ──────────── ──────────
3001 CUST_42 Bronze 2025-08-01 9999-12-31 true After the change:
surrogate_key business_key plan_tier effective_from effective_to is_current
───────────── ──────────── ───────── ────────────── ──────────── ──────────
3001 CUST_42 Bronze 2025-08-01 2026-02-15 false
3247 CUST_42 Gold 2026-02-15 9999-12-31 true A fact with event_time = 2026-01-12 joins to the Bronze-era row (within [2025-08-01, 2026-02-15)). A fact with event_time = 2026-02-15T00:00:00 joins to the Gold-era row (within [2026-02-15, 9999-12-31)) — the boundary instant belongs to the new row, exactly once. A fact arriving today joins the Gold-era row as well.
dbt + Iceberg implementation. dbt’s native snapshot is the path of least resistance:
{% snapshot dim_customer_plan_snapshot %}
{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='customer_business_key',
updated_at='updated_at',
dbt_valid_to_current="cast('9999-12-31' as date)",
snapshot_meta_column_names={
'dbt_valid_from': 'effective_from',
'dbt_valid_to': 'effective_to',
'dbt_scd_id': 'surrogate_key',
'dbt_is_deleted': 'is_deleted'
}
)
}}
select customer_business_key, plan_tier, updated_at
from {{ source('app_db', 'customers') }}
{% endsnapshot %} The snapshot runs on a schedule (typically hourly to daily — dbt docs explicitly warn against more frequent snapshotting). The snapshot_meta_column_names block renames dbt’s defaults (dbt_valid_from, dbt_valid_to, dbt_scd_id) to canonical Kimball names so the downstream models and BI tools see the standard column layout.
When you need a real integer surrogate key (rather than dbt’s dbt_scd_id hash) or finer control over the merge semantics, drop to a hand-rolled incremental model with a two-phase close-and-insert pattern. The Xebia and phData guides walk through this in detail. (Xebia: Practical Guide to SCD2 in dbt, phData: SCD with dbt Snapshots)
The non-negotiable join pattern. Facts must join the dim via the date range, never is_current = true:
-- ✗ wrong — late-arriving fact gets the current dim version
select f.*, d.plan_tier
from fct_subscription_event f
join dim_customer d on f.customer_business_key = d.business_key
where d.is_current = true;
-- ✓ right — late-arriving fact gets the historical dim version
select f.*, d.plan_tier
from fct_subscription_event f
join dim_customer d
on f.customer_business_key = d.business_key
and f.event_time >= d.effective_from
and f.event_time < d.effective_to; What you give up. Storage cost (one row per attribute change per business key — can balloon for volatile dimensions; see Type 4 below). Some query complexity (the date-range join is unfamiliar to analysts trained on simple equi-joins). And the discipline of never training analysts to use is_current = true as a join filter. The day they slip is the day a late-arriving fact gets the wrong attribution. (SystemOverflow: SCD Failure Modes)
Type 3 add new attribute
Limited history via parallel columns.
bk curr_plan prev_plan change_date
CUST_42 Gold Bronze 2026-02-15 Use for: territory reorgs, alternate-reality reporting. Doesn't scale past 1-2 prior versions.
Data signals that point here: the value changes infrequently, you only ever care about “current” and “previous” (or a small fixed number of historical versions), and the change is dimensional in a way that requires side-by-side reporting — both versions visible in the same row of the same query. The canonical Kimball example is sales-territory reorganizations: management wants to see Q3 sales reported under both the old territory boundaries and the new ones, on the same dashboard.
Query-pattern need: “alternate-reality” reports where both the old and new attribute values must appear in adjacent columns of a single row. Not a date-range scan — a flat comparison.
What Type 3 does: adds new columns to the dimension instead of new rows. A typical layout: current_territory, previous_territory, territory_change_date. Both are populated on the same dim row; the column count grows with the number of historical versions you want to keep visible.
Worked example. A B2B sales team reorganizes from 5 to 8 regional territories on 2026-04-01. The board wants Q1 reported under the old 5-region layout (because that’s how quotas were set) AND under the new 8-region layout (because that’s how Q2 will be measured). One dim row per customer, two columns:
business_key current_territory previous_territory territory_change_date
──────────── ───────────────── ────────────────── ─────────────────────
CUST_42 NORTHEAST_NEW NORTHEAST_OLD 2026-04-01 A single Looker explore now groups Q1 sales by either current_territory or previous_territory without joining historical versions.
dbt + Iceberg implementation. Type 3 requires preserving the existing row’s current_territory into the new row’s previous_territory whenever a change is detected. That can’t be expressed inside a single SELECT against stg_customer — the prior value lives in the target table, not the source. The pattern is a merge model with explicit merge_update_columns plus a CTE that joins source against target:
{{ config(
materialized='incremental',
unique_key='business_key',
incremental_strategy='merge'
) }}
with src as (
select business_key, territory as current_territory_source, observed_at
from {{ ref('stg_customer') }}
),
{% if is_incremental() %}
diffed as (
select
s.business_key,
s.current_territory_source as current_territory,
-- preserve prior current_territory as the new previous_territory ONLY on change
case
when t.current_territory != s.current_territory_source
then t.current_territory
else t.previous_territory
end as previous_territory,
case
when t.current_territory != s.current_territory_source
then s.observed_at
else t.territory_change_date
end as territory_change_date
from src s
left join {{ this }} t on s.business_key = t.business_key
)
{% else %}
diffed as (
select business_key, current_territory_source as current_territory,
null as previous_territory, null as territory_change_date
from src
)
{% endif %}
select * from diffed The LEFT JOIN {{ this }} is what gives us access to the existing row. The CASE preserves whatever was previously in current_territory only when the source value has actually changed — otherwise both columns stay stable. The shape works for one historical version. Each additional history slot adds another column triple, and so on — Type 3 doesn’t scale to many changes.
What you give up. Beyond two or three historical versions, you’re stuffing the schema. Type 3 is also useless for “as-of date” queries — there’s no way to express “as-of 2026-03-15 specifically” because the dim only carries snapshot pointers, not date-range validity. Pick Type 2 for time-grained history, Type 3 only for the very specific dual-reality-on-the-same-row use case.
Type 4 mini-dimension
Carve volatile attributes into a separate small dim.
dim_customer + dim_customer_behavior (~200 distinct rows)
facts carry BOTH FKs: slow dim + mini-dim active at fact-event time Use for: rapidly changing scores, segments, demographics on a high-cardinality dim.
Data signals that point here: the attribute changes frequently — daily, hourly, sometimes more — and the rest of the dimension changes slowly. Customer demographics (income band, RFM segment, behavior score). Real-time engagement state (last-session-at, current-page, session-duration-bucket). Telemetry attributes on a device dim (battery level band, signal strength band). If you put these on a Type 2 dim, the dim row count explodes 100×–1000× and storage and join performance both degrade. Kimball calls this the “rapidly changing monster dimension” problem. (Kimball Forum: Monster Dimensions)
Query-pattern need: facts need to be filtered or grouped by the volatile attribute at the time of the fact event. But the main dim’s stable attributes (name, address, signup date) should NOT carry the high-volume change burden.
What Type 4 does: splits the volatile attributes out into a separate dim — the mini-dimension — keyed independently. Each fact carries two foreign keys: one to the slow-changing dim, one to the mini-dim that was active at the fact’s event time. The mini-dim is small (one row per distinct combination of volatile attribute values) and de-duplicates across customers — every customer with the same “high-income / weekly-engaged / score-band-7” profile gets the same mini-dim FK. (Kimball Design Tip #152)
Worked example. A customer dimension has stable attributes (name, signup date, region) AND a behavior score that updates daily based on app activity. With 5M customers and a daily behavior-score recomputation, putting the score on the main Type 2 dim would force a new row every day the score changes for any customer — at the limit, 5M × 365 ≈ 1.825B rows per year. (In practice fewer, because not every customer’s score changes every day — but on the order of hundreds of millions of new rows per year, against a stable-attribute dim that would otherwise grow by tens of thousands of rows per year.) The mini-dim approach instead: keep dim_customer slow-moving (Type 2 on the stable attributes only — growth rate driven by name / address / region changes, not by behavior). Carve out dim_customer_behavior as the mini-dim — keyed not by customer, but by distinct combination of (engagement_band, score_band, last-active-band). The cardinality is bounded by combinations, not by customers — typically a few hundred to a few thousand rows total, no matter how large the customer base. Each fct_session row, at insert time, carries both customer_sk (to the slow dim) and behavior_sk (to the mini-dim row matching that customer’s behavior on that day).
dim_customer_behavior (mini-dim, ~200 rows):
behavior_sk engagement_band score_band last_active_band
─────────── ─────────────── ────────── ──────────────────
B_001 weekly 7 within_7d
B_002 weekly 6 within_7d
B_003 daily 8 within_24h
... The customer-to-behavior linkage moves to the fact table; the mini-dim itself just enumerates the combinations.
dbt + Iceberg implementation. Two dbt models — one snapshot for the slow-moving dim_customer, one incremental model for the mini-dim. The mini-dim builds via:
{{ config(materialized='incremental', unique_key='behavior_combination_hash', incremental_strategy='merge') }}
select
{{ dbt_utils.generate_surrogate_key([
'engagement_band', 'score_band', 'last_active_band'
]) }} as behavior_combination_hash,
engagement_band,
score_band,
last_active_band
from (
select distinct engagement_band, score_band, last_active_band
from {{ ref('int_customer_daily_behavior') }}
)
{% if is_incremental() %}
where /* only new combinations */
{% endif %} Facts carry the behavior_combination_hash as their FK, computed from the customer’s behavior state at fact-event time.
What you give up. A more complex join — facts now have two FKs to navigate. The customer’s current behavior is not on dim_customer anymore, so “show me Customer 42 and their current engagement band” requires either joining to the most recent fact for that customer, OR adopting Type 5 (next).
Type 5 outrigger
Mini-dim plus Type 1 current-pointer on the main dim.
dim_customer.current_behavior_sk → dim_customer_behavior (Type 1)
BI joins customer → current segment without scanning facts Use for: Type 4 plus "show me this customer's current segment".
Data signals that point here: you’ve already moved to a Type 4 mini-dim because of volatility, but a stakeholder constantly asks “what’s customer 42’s current behavior band?” without wanting to scan their fact history. The BI tool wants a direct join from the customer dim to a behavior summary without going through facts.
Query-pattern need: “Customer + their current volatile attributes” as a single-row lookup, on top of the historical-attribution that Type 4 already provides via facts.
What Type 5 does: Type 4 + Type 1. Keep the Type 4 mini-dim as the historical store (each fact still gets the right point-in-time behavior FK). Also add a Type 1 current_behavior_sk column to dim_customer and overwrite it whenever the customer’s behavior recomputes. The mini-dim FK on facts gives you “as-of” attribution; the Type 1 column on the main dim gives you “current” lookup without a fact scan. (Kimball Design Tip #152)
Worked example. Continuing the previous case: dim_customer now carries current_behavior_sk as an outrigger pointing to the mini-dim. Two things happen on each daily behavior recomputation: (a) newly inserted fct_session rows for that day are stamped with the day’s behavior_sk at insert time — historical fact rows are not updated, because facts in a Kimball warehouse are immutable; (b) current_behavior_sk on the customer’s dim_customer row is overwritten in place (Type 1 mechanics on a single outrigger column). The historical lineage lives in the fact-table FKs; the “current state” pointer lives on the dim.
dim_customer:
customer_sk business_key name signup_date region current_behavior_sk
─────────── ──────────── ────────── ─────────── ────── ────────────────────
1001 CUST_42 J. Doe 2024-08-01 NE B_003 Looker can now join fct_revenue → dim_customer → dim_customer_behavior and surface “Customer J. Doe, currently in engagement band: daily / score 8 / last-active within 24h” without a fact aggregation.
dbt + Iceberg implementation. A combined Type 2 dim model with the Type 1 outrigger column managed in the same model:
{% snapshot dim_customer_snapshot %}
-- standard Type 2 on stable attrs (region, name, etc.)
{% endsnapshot %}
-- materialized table downstream that adds the outrigger
{{ config(materialized='table') }}
select
d.*,
b.behavior_sk as current_behavior_sk
from {{ ref('dim_customer_snapshot') }} d
left join {{ ref('current_customer_behavior') }} b on d.business_key = b.business_key The downstream table refreshes whenever the behavior recomputation runs.
What you give up. A second moving part — the Type 1 outrigger has to be kept in sync with the mini-dim’s most recent value. Stale outriggers (“J. Doe’s current_behavior_sk still points to last week’s combination”) are a common bug. Schedule the outrigger refresh as a downstream step of the behavior recomputation, and add a freshness test.
Type 6 in-row dual history
Type 2 row plus historical AND current value columns.
sk bk hist_plan curr_plan valid_from curr
3001 CUST_42 Bronze Gold 2025-08-01 false
3247 CUST_42 Gold Gold 2026-02-15 true Use for: "all-time revenue grouped by current segment + historical attribution." Write-amplified: every change overwrites curr_* on all prior rows.
Data signals that point here: the same dim attribute needs to be reported under both its historical value (for “what was true at fact-event time”) AND its current value (for “show the customer’s current segment across all their historical revenue”), and you want both lenses available on every row of the dim without composing a hybrid yourself.
Query-pattern need: dashboards need to slice the same fact volume by both historical and current dim values, often in adjacent columns of the same chart. “All-time revenue grouped by the customer’s signup-tier (historical) vs grouped by the customer’s current-tier (today).”
What Type 6 does: combines Type 1 + Type 2 + Type 3 on the same dim. Every Type 2 row carries TWO copies of the volatile column — a historical_value (Type 2 — the value at the time this row was effective) and a current_value (Type 1 — overwritten on every prior row whenever the latest version mutates). Kimball: “fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value.” (Kimball Design Tip #152)
Worked example. Customer plan tier with Type 6 columns:
sk business_key historical_tier current_tier effective_from effective_to is_current
──── ──────────── ─────────────── ──────────── ────────────── ──────────── ──────────
3001 CUST_42 Bronze Gold 2025-08-01 2026-02-15 false
3247 CUST_42 Gold Gold 2026-02-15 9999-12-31 true A fact from January 2026 joins the Bronze-era row and reports historical_tier='Bronze' (under the Type 2 lens) or current_tier='Gold' (under the Type 1 lens). If the customer later downgrades to Silver, BOTH rows’ current_tier columns get overwritten to Silver, but the historical_tier values stay at Bronze and Gold respectively.
dbt + Iceberg implementation. Snapshots don’t support this shape natively. You need a hand-rolled incremental model that runs the snapshot logic AND a second-pass UPDATE that overwrites current_tier across all prior versions of the same business key:
{{ config(materialized='incremental', unique_key='surrogate_key', incremental_strategy='merge') }}
with snapshot_logic as (
-- standard Type 2 mechanics: close prior row, insert new row
...
)
select * from snapshot_logic Followed by a separate post-hook macro that overwrites current_tier on all rows where business_key matches the new insert. The two-step pattern is documented in the Servian Type 1 + 2 dbt guide.
What you give up. The Type 1 column overwrites complicate the snapshot mechanics — dbt’s native snapshot doesn’t handle this cleanly. Storage is unchanged versus Type 2, but write amplification is higher (you’re touching every prior row of a business key whenever the latest version changes). Worth it when the BI workload genuinely needs both lenses in adjacent columns — overkill otherwise.
Type 7 dual foreign keys
Fact carries BOTH the surrogate key AND the business key.
fct_revenue.customer_sk → v_dim_customer_historical
fct_revenue.customer_bk → v_dim_customer_current Same need as Type 6 without in-row overwrites. Cleaner audit, two FKs on the fact.
Data signals that point here: same query need as Type 6 — analysts want both the historical and the current attribute available on the same query — but the Type 6 overwrite mechanics are a poor fit for your situation. Specifically you hit Type 7 when (a) the dim is high-cardinality and Type 6’s “rewrite every prior row on every change” amplifies writes unacceptably, (b) auditors don’t want any in-place overwrites on a Type 2 chain, or (c) the modeling team prefers explicit dual-lookup semantics over the implicit “look at the right column” affordance Type 6 offers.
Query-pattern need: identical to Type 6 — same fact volume sliced by both historical and current dim values. The difference is purely how you expose the two lenses to downstream BI.
What Type 7 does: facts carry two foreign keys to the same dim. One is the standard Type 2 surrogate key (points to the historical row valid at fact-event time). The other is the durable business key itself, used to join to a Type 1 view over the dim that exposes only is_current = true. The dim itself stays a pure, audit-clean Type 2 chain — no in-row overwrites. The “current” lens is materialized as a view, not a column. (Kimball Design Tip #152)
Worked example. A fct_revenue row carries both customer_sk (the Type 2 surrogate, points to the dim row valid at event_time) and customer_business_key (the natural key, joinable to a Type 1 view). Same customer plan-tier scenario as the Type 6 worked example — but the dim itself stays clean:
sk business_key plan_tier effective_from effective_to is_current
──── ──────────── ───────── ────────────── ──────────── ──────────
3001 CUST_42 Bronze 2025-08-01 2026-02-15 false
3247 CUST_42 Gold 2026-02-15 9999-12-31 true Two views built on top:
create view v_dim_customer_historical as
select * from dim_customer;
create view v_dim_customer_current as
select * from dim_customer where is_current = true; A fact arriving from January 2026 joins:
v_dim_customer_historicalviacustomer_sk = 3001→ reports plan_tier = Bronze (the value at fact-event time)v_dim_customer_currentviacustomer_business_key = CUST_42→ reports plan_tier = Gold (today’s value)
Both lenses available on the same query. If the customer later downgrades to Silver, the dim closes the Gold row and opens a new Silver row — no prior rows are rewritten. The current view now reports Silver; the historical view continues to report Bronze for January 2026 facts.
dbt + Iceberg implementation. A standard Type 2 snapshot plus two thin views — no special materialization needed:
-- models/marts/dim_customer.sql — standard Type 2 snapshot
{{ config(materialized='table') }}
select * from {{ ref('dim_customer_snapshot') }}
-- models/marts/dim_customer_current.sql — Type 1 lens
{{ config(materialized='view') }}
select * from {{ ref('dim_customer_snapshot') }} where is_current = true The fact-loading job stamps both FKs at insert time. The historical FK is resolved by joining stage facts against dim_customer on business_key AND event_time >= effective_from AND event_time < effective_to; the current FK is just the business key passed through.
Type 6 vs Type 7 — which to pick. Both serve the same query need. The decision tree:
| Question | Pick Type 6 | Pick Type 7 |
|---|---|---|
| Dim row count | Low-to-mid (thousands to low millions) | High (tens of millions+) |
| Write amplification budget | Tolerant — overwrites on every prior row of a changed BK are fine | Tight — every change must touch only one new row |
| Audit posture | Lenient — Type 1 columns mixed into the Type 2 chain are acceptable | Strict — auditors require an immutable Type 2 chain |
| Analyst skill assumption | Beginners — “current_X and historical_X on the same row” is teachable in one diagram | Modelers — analysts can be trained to pick the right view per question |
| Storage budget | Tighter — adding columns is cheaper than carrying dual FKs on every fact | Looser — fact tables can absorb a second FK column |
| BI tool — does it support multiple joins to the same dim cleanly? | Either works; Type 6 is single-join | Type 7 needs the BI tool (Looker LookML, Power BI, Tableau) to handle two relationships to one dim cleanly. Most modern BI does. |
Our default: Type 7 for high-cardinality customer / product / location dims in regulated industries; Type 6 for small-to-mid B2B dims where the in-row dual columns let SMB analysts answer both questions without learning two views. We do not use Type 6 once dim_X crosses ~5M rows or when audit ever asks “show me only Type 2 rows that have not been touched since their original insert” — at that point the rewrite mechanics work against you.
What you give up. Slightly more complex fact schema (two FKs to the same dim looks redundant until you understand the pattern). Some BI tools require explicit “alias” or “view” definitions to navigate dual-FK relationships — adds setup cost. The upside vs Type 6: no in-row overwrites, clean Type 2 audit trail, lower write amplification at scale.
Modern variants
Outside the Kimball taxonomy, three database-level features ship “historical capture” as a primitive. They are not substitutes for the dimensional-modeling decisions above, but they change what’s available to compose with.
SQL Server / Azure SQL system-versioned temporal tables. Define a table with GENERATED ALWAYS AS ROW START/END columns plus a paired history table; queries against the table use FOR SYSTEM_TIME AS OF '<date>' to read any historical state. Microsoft explicitly lists “maintaining a slowly changing dimension for decision support applications” as a use case. (Microsoft Learn: Temporal Tables) — but the engineering pattern still requires you to model surrogate keys, mini-dimensions, and dual-FK lookups on top of the row-versioned storage. Temporal tables are a primitive, not a dimensional model.
PostgreSQL 18 temporal constraints. Adds PRIMARY KEY (id, valid_at WITHOUT OVERLAPS) and exclusion constraints (EXCLUDE USING gist (id WITH =, valid_at WITH &&)) that enforce non-overlapping validity intervals per business key — this is the database guaranteeing you can’t have two is_current = true rows for the same customer. (Neon: PostgreSQL 18 Temporal Constraints, Aiven Blog) Pair this with your dbt-managed SCD2 on top and you get a hard constraint against the “overlapping period” pitfall.
Apache Iceberg time-travel. Every Iceberg table commits as a snapshot; SELECT * FROM tbl FOR VERSION AS OF <snapshot_id> reads any past state. Free audit log — but snapshots are whole-table, not per-business-key, and the default Glue maintenance prunes them after ~5 days. (AWS: Apache Iceberg, Athena time-travel docs, DZone: Dark Side of Iceberg Time Travel) Time-travel is your forensic-debugging tool, not your dimension of record.
Datomic / XTDB event-sourced dims. Stores valid time and transaction time on every datom natively. Dimensions are projections of an immutable event log. Excellent when the source system is already event-driven; awkward when you’re trying to retrofit on a CRUD database. (Datomic Transaction Model, XTDB Bitemporality)
The integration pattern: use the modern variant as the audit substrate under your modeled SCD. Iceberg time-travel + dbt-modeled SCD2 is the canonical TwiceData layering — the substrate gives you forensic recovery, the modeled dim gives you business-grade queryability. See “The layered approach” below.
Bitemporal modeling
Three real-world classes of workload force you past Type 2 into bitemporal:
- Insurance / late knowledge. A claim is filed today (transaction time) for an accident that happened in March (valid time). Both timestamps must be queryable.
- Regulatory / financial restatements. Q1 revenue was filed at $X; an audit later restates it to $Y. Both the original Q1 close ($X) and the restated value ($Y) must remain query-able forever.
- Risk / decision-time analytics. “What did our credit model know about this borrower when it approved the loan?” requires reconstructing the state of every dim as of the decision timestamp — not as of today.
What bitemporal does: every dim row carries four timestamps:
business_key attribute valid_from valid_to recorded_from recorded_to
──────────── ───────── ────────── ────────── ───────────── ─────────────
CUST_42 Bronze 2025-08-01 2026-02-15 2025-08-01 2026-04-01 ← original record
CUST_42 Bronze 2025-08-01 2026-02-15 2026-04-01 9999-12-31 ← restatement
CUST_42 Gold 2026-02-15 9999-12-31 2026-02-15 9999-12-31 Queries pivot on which time dimension answers the question. “What plan tier was active on 2026-01-15 as we know it now?” uses valid_* only. “What did our system report the plan tier as on 2026-03-01?” pivots on recorded_* (XTDB: Bitemporality, SQL2011 Temporal).
Worked example. A SaaS customer’s plan tier was originally recorded as Bronze for the period 2025-08-01 to 2026-02-15. On 2026-04-01, a billing audit reveals it should have been Silver for that entire window (the customer was on a grandfathered Silver plan that was misclassified). Bitemporal lets the original Bronze record AND the corrected Silver record both exist — original reports filed before 2026-04-01 used the Bronze value (correct as of when they were filed), audit reports filed after 2026-04-01 use the Silver value (corrected world-state).
When to reach for bitemporal: billing reconciliations, regulated financial reporting, ML feature stores that need “as-of decision time” reproducibility. Insurance, healthcare, fintech. Not the default for most B2B SaaS — but the right answer when the workload demands it, and the wrong layer to skip when it does.
The layered approach
The three-layer pattern: each layer answers a different question, and none replaces the others. Iceberg time-travel is the forensic audit log (whole-table snapshots, free with the table format, pruned by maintenance — useful when you need to ask “what did the table look like before yesterday’s bad merge”). The modeled SCD2 dim is the dimension of record (per-business-key effective dates, joined by facts via half-open intervals — this is what answers “what was customer 42’s plan tier on Jan 3”). Bitemporal is the restatement layer (adds recorded_from/recorded_to, engaged only when valid time and transaction time diverge — answers “what did we report the value as, on a given date”).
The recurring audit finding: teams routinely try to answer modeled-SCD2 questions using Iceberg time-travel and get wrong answers, because time-travel is snapshot-granular (not business-key-granular) and gets pruned by table maintenance after ~5 days on default AWS Glue settings.
Default stack we deploy on an AWS Iceberg-on-S3 + dbt project at mid-market scale:
- Bronze (raw CDC). Postgres → Debezium → MSK → Iceberg append-only bronze, merge-on-read. Compaction nightly. Iceberg time-travel here is the forensic substrate.
- Silver (modeled SCD2 dim). dbt
snapshotwithstrategy='timestamp',updated_atfrom the source,dbt_valid_to_current: '9999-12-31'. Column names re-mapped to canonical Kimball:effective_from,effective_to,is_current,surrogate_key. One snapshot per slow-moving entity. - Mini-dim. Any attribute that mutates more often than weekly per business key carves out to its own Type 4 dimension. Customer demographics, behavior scores, RFM bands.
- Type 6 or Type 7 added selectively when BI explicitly needs “historical fact volumes grouped by current segment.” Type 6 for small-to-mid dims, Type 7 once dim row count climbs.
- Gold (Type 1 aggregates). dbt incremental tables filtered to
is_current = true. Convenience layer for BI, never the source of truth. - Bitemporal layer. Opt-in. Adds
recorded_from/recorded_to. Engaged only for billing reconciliations, regulatory reports, and ML feature stores. - Joins. Always
fact.event_time >= dim.effective_from AND fact.event_time < dim.effective_to(half-open). Neveris_current = trueon the join. - Iceberg time-travel. Reserved for data engineers’ incident response. Not exposed in BI.
The recipe deviates when the data deviates: pure event-sourced products go declarative-CDC (or stay event-native, no relational dim at all). Pure Type 1 dims with no audit obligation skip the entire SCD2 layer. Bitemporal turns on only when the regulatory or billing math demands it.
Common pitfalls
Ranked by how often we find them in production audits:
1. is_current = true as the default join filter. Quietly produces wrong answers the moment a fact arrives late. Train analysts and BI builders to always use the date-range join. Reserve is_current exclusively for the “as-of latest” gold-tier views. (SystemOverflow: SCD Failure Modes)
2. Overlapping validity periods / duplicate currents. Two rows with is_current = true for the same business key. Causes double-counting. Detect:
select business_key, count(*)
from dim_customer
where is_current = true
group by business_key
having count(*) > 1; Prevent: PG18-style temporal exclusion constraints if your warehouse supports them, otherwise a CI-runtime test in dbt.
3. Out-of-order CDC events. Debezium can emit events out of order under retries or partition rebalances. Always sequence by source LSN / commit timestamp, never by arrival time. (DataOps School)
4. Dimension explosion (the monster dimension). A volatile attribute on a Type 2 dim of millions of business keys generates a row-count explosion. Move to Type 4 mini-dim, or accept Type 1 + an audit log on the side. (Kimball: Monster Dimensions)
5. Timezone drift in effective dates. Storing effective_from in local time on the dim and UTC on the fact produces phantom join misses around DST transitions. Standardize on UTC timestamptz everywhere, or carry the IANA timezone alongside.
6. Backfilling history that the source never tracked. You can’t conjure prior versions out of thin air. Either accept “history starts now,” or import a snapshot from any legacy backup that does carry pre-history. Document the cutover line.
7. Surrogate-key collisions on reactivation. A customer churns then signs up again. Same business key, new chain — decide upfront whether the new chain stitches to the old (continuous history) or starts fresh (separate dim lineage). Both are valid, the choice is policy.
8. GDPR right-to-be-forgotten vs audit retention. Type 2 preserves PII indefinitely; GDPR Article 17 mandates erasure on request, “across all systems, including backups.” (Reform: Right to Erasure) Pseudonymize PII columns (hash, redact, or replace with a stable opaque token) rather than literally deleting rows — the dim’s analytic shape survives, the personally-identifying content does not.
What we will not claim (anti-fabrication)
- “Iceberg time-travel replaces SCD2.” The AWS Big Data blog says the opposite — it proposes layering an SCD2 view on top of Iceberg’s change-log view. Time-travel is snapshot-granular and gets pruned. (AWS blog)
- “dbt snapshots are SCD2-complete.” They implement the canonical Type 2 pattern but don’t give you a Kimball-style integer surrogate key (only
dbt_scd_id), and they can’t shape Type 6 in a single model. Custom incremental models are required for the richer shapes. (dbt Docs: Snapshots) - “Temporal tables replace SCD2.” SQL Server temporal tables, PG18 temporal constraints, and Datomic give system-versioned history at the row level. They do not model surrogate keys, junk dimensions, or mini-dimensions. They are an implementation primitive, not a dimensional model.
- “SCD2 is always the right default.” Kimball, dbt’s own docs, and DataOps School all caution against Type 2 for rapidly changing attributes (use Type 4), raw event streams (use append-only), and cases where current state alone suffices (use Type 1).
- “Type 6 is universally documented.” Wikipedia flags the attribution of Type 6 to a Kimball/Pace conversation at Kalido as needing a citation. We use the pattern; we don’t claim its lineage is bulletproof.
- Specific performance numbers (e.g. “SCD2 makes queries 10× slower”) are workload-dependent and not generalizable without your actual schema and data volume. We refuse to quote them.
- “We saved client X $Y by moving them from Snowflake to Iceberg.” We’ve helped teams cut warehouse spend by switching to Iceberg-on-S3 + Athena/Trino for the right workloads — but the savings number is a function of your storage, your query mix, your reservation commits, and your team’s tooling familiarity. We will model your specific case during a Stack Audit; we will not quote a generic multiplier.
- “PostgreSQL 18 temporal constraints make bitemporal trivial.” They enforce the non-overlap constraint on a single time axis. Bitemporal still requires you to model and maintain two time axes; PG18 covers part of the constraint surface, not the modeling work. (Aiven Blog)
- “Type 4 mini-dims always reduce storage.” They reduce storage when the volatile-attribute combinations are bounded (RFM bands, score buckets). They do not if the attribute is genuinely high-cardinality continuous (raw scores, exact timestamps) — at that point a mini-dim degenerates back into a Type 2 monster. Banding the attribute is what makes the mini-dim cheap.
- “Half-open intervals are universal.” Some warehouses, dbt-utils snippets, and legacy Kimball ETL ship with closed-on-both-ends
BETWEENjoins. Both can be made to work, but mixing them in the same warehouse silently double-counts on boundary instants. Pick one convention per project and enforce it in code review.
How to start
A free 60-minute architecture call is the entry point. We use it to map the SCD types in your existing pipeline (if any), identify which ones are wrong for the workload, and quote a fixed-bid scope to remediate. Most engagements that touch this layer fall under the Stack Diagnostic or Embedded Sprint shapes; full pipeline rebuilds under Quarter Stack.
For the broader picture of how dimensional modeling fits into the AWS-native data stack we deploy, read From a fresh AWS account to dashboards and AI chat — this article is its companion deep-dive on the historical-capture layer specifically.
––