Most mid-market teams don’t realize their analytics roadmap is implicitly a $90K–$180K/yr Snowflake or a $70K–$300K/yr Databricks subscription. Below we walk through the alternative pipeline we build for $5K–$25K/yr all-in, on the same AWS account the customer just provisioned. Same dashboards. Same governed metrics. Same AI-chat surface on top. Roughly 75–85% cheaper. Real, sourced numbers throughout.
This is what a typical TwiceData engagement looks like when you zoom in. The shape we’re describing is what we call the canonical engagement: a customer arrives with a freshly provisioned AWS account, their transactional data sitting in a Postgres database that the vendor’s application writes to, and a mandate from the board to produce governed metrics — fast. The board wants ARR, retention, and product-usage numbers in Looker every Monday morning. The data team wants natural-language chat over the same numbers so non-analysts can self-serve. Today, both of those exist as a half-broken Looker model nobody trusts and a spreadsheet circulated by email on Sunday night.
Twelve weeks later they have a working end-to-end pipeline they own: AWS-native, governed, observable, and consumable through both a dashboard surface and a chat surface. This post walks the entire build, week by week, with every architectural decision documented and every tradeoff named. It is the reference engagement we wire for clients on this stack, generalized so other teams can read it. No specific customer is named; the architecture is real, the choices are the ones we make in production.
If you want the abstract positioning, the data-teammate manifesto is the place. This article is the detailed proof.
Week 0 the starting position
The customer’s first call usually sounds like one of three sentences:
- “We bought AWS last quarter and we have no idea what we’re doing.”
- “We have data in Postgres but no governed metric anywhere downstream.”
- “Finance and product are arguing about ARR again.”
- “Our Snowflake bill is about to hit $20K/mo and our CFO wants to know what we can do about it.”
All four describe the same engagement. They just enter from different doors. The fourth one — the cost door — has become the most common since 2025, and is the one most directly addressed by the architecture below.
The starting position we accept as a given:
- A freshly provisioned AWS account. No VPCs, no IAM roles, no S3 buckets yet. Maybe a Lightsail VM somewhere from an earlier experiment. We assume nothing.
- A Postgres database the vendor’s application writes to. Could be RDS, could be self-hosted on EC2, could be a third-party SaaS platform’s exposed read replica. The key fact is: rows mutate constantly, and we cannot run heavy analytical queries against it without affecting the application.
- A small data team — usually one or two analysts, no dedicated platform engineer. The customer is hiring, or they are not hiring; either way, the engagement assumes they want to operate the result themselves on day 91.
- No Looker workspace yet, or one with a broken legacy model. Sometimes we inherit a Looker instance that an earlier consultancy half-built. Always cheaper to discard the LookML and re-derive it from the semantic layer than to refactor what is there.
The audit phase identifies which of these assumptions actually hold for the specific customer. Two days of conversations and read-only credentials, no commitments yet.
Week 1 the audit and the SOW
We run the Stack Diagnostic shape for week one. The deliverable is a fixed-bid scope for everything that follows. Three things have to be true before the SOW gets signed:
- The data we need is genuinely in the Postgres source. Not “we think it is” — confirmed by sampling. The number of engagements we have walked away from after week one because the user-event table we were supposed to model from did not actually contain user events is non-zero.
- The customer has decision-rights over the AWS account. No “we need to ask IT” friction. The teammate model breaks if we cannot provision infrastructure in the customer’s own cloud.
- The metric definitions we will encode are signed off by someone who can sign off. ARR is not just a SQL query. ARR is a policy. The CFO has to agree.
Once those three pass, the SOW is fixed-bid for 11 weeks (week 1 already running). Day 91 is the handoff date. We hold ourselves to it because we have wired the pricing that way: cost overruns are our problem, not the customer’s.
Tools picked in week 1 and why:
Mature; the customer's future hires will know it. Step Functions is faster to set up but harder to debug at hour three of a midnight pipeline failure.
Sources: Apache Airflow · AWS MWAA
Cheap, native, and the right scale for CDC-style incremental loads. We graduate to ECS Fargate only if Lambda's 15-minute ceiling becomes a real constraint.
Sources: Lambda pricing · 15-min timeout quota
The vendor-neutral substrate. S3 storage is ~$23/TB-mo; Iceberg gives time-travel and schema evolution. The same substrate Snowflake and Databricks themselves now read — adopting Iceberg directly skips ~70–85% of their bill at mid-market scale.
Sources: Apache Iceberg · Snowflake-managed Iceberg · Databricks reads Iceberg
Athena is on-demand at $5/TB scanned — mid-market dbt workloads land at ~$25–$100/mo of query cost. Trino-on-EKS when query concurrency outgrows Athena's slot pool (~$1.5K–$4K/mo for a small Spot fleet). DuckDB/MotherDuck for embedded analytics and CI dbt-locally.
Sources: Athena pricing · data-on-eks Trino blueprint · DuckDB
dbt SL holds the metric definitions in version control. Now an industry standard — the Open Semantic Interchange (Jan 2026, signed by Snowflake, dbt, Cube, AtScale, Databricks, 40+ others) means this YAML format is portable across vendors. LookML proxies through to it for Looker-specific shapes.
Sources: dbt Semantic Layer docs · LookML overview
Customer's pick, common in our engagement profile. Mode, Hex, and Metabase work too — the semantic layer abstracts the dashboarding choice.
Sources: Looker docs
The LLM never writes SQL directly. It picks certified metrics from the semantic layer; the semantic layer compiles the SQL deterministically. Raw NL→SQL hits ~85–90% accuracy with frontier models, ~10% with cheap open models; the semantic-layer wrapper takes both to 98–100% for in-scope questions. The wrapper, not the model, is what makes it reliable — which is why we can use a cheap open model and still ship a production-grade chat surface.
Sources: dbt 2026 NL→SQL benchmark
This is opinionated. It is the engagement we know how to run cleanly. We will deviate on customer constraint (regulatory, existing tooling, geographic data residency), but the default stack above is what week 1 picks unless something forces a change.
Weeks 2 to 3 the ingest layer
The naive way to ingest from Postgres is to run a nightly cron that selects everything and dumps it into S3 as CSV. We have seen it. It works for thirty days, then schema drifts upstream, the parser fails silently, and the customer’s CFO comes in on Monday with the wrong ARR number.
The right way is more disciplined:
- Set up a Postgres logical-replication slot on the source. We do this read-only, with a dedicated user role that has REPLICATION privilege and nothing else. Audit logs flow to CloudWatch.
- Lambda functions consume the replication stream via the
wal2jsonplugin. One Lambda per source table (or per group of related tables — a billing-events Lambda, a user-events Lambda, etc.). - Each Lambda validates the row against a JSON Schema contract before writing. The contract lives in the dbt repo’s
contracts/directory — it is a real artifact the customer can read. Schema drifts fail loudly (CloudWatch alarm, Slack ping), do not propagate silently. - Validated rows land in the bronze layer of the Lakehouse as Iceberg tables, partitioned by ingest date.
- Airflow does NOT pull the stream — Lambdas handle the real-time ingest. Airflow’s job is to schedule the dbt rebuilds and the BI-cache refreshes that consume the bronze tables.
The choice of “Airflow orchestrates rebuilds; Lambda does the ingest” is deliberate. Airflow is a great scheduler and a poor real-time event processor. Lambda is the opposite. Each tool runs the workload it is best at; neither runs the workload it is bad at.
Schema-drift example we always plan for: the vendor adds a new column to users.preferences (a JSONB column gets a new nested key). Most ingest pipelines silently swallow it. Ours fails the JSON Schema contract on the next row, alarms the data team, and the column gets added to the contract as a deliberate human review — not a silent regression three months later.
Week 4 the storage layer
We land everything in three tiers in the Lakehouse:
- Bronze — raw, validated against the contract, untransformed. One row per source-system event. Partitioned by ingest date. Append-only. This is the audit trail; we never overwrite bronze.
- Silver — cleaned, deduplicated, type-cast. One row per business-entity event. The result of light dbt transformations that handle the “garbage in” without yet making business-logic decisions.
- Gold — modeled, certified, semantic-layer-ready. ARR, MRR, retention cohorts, product-usage rollups. Every gold table maps 1:1 to a metric or entity the semantic layer exposes.
Partition strategy in each tier is tuned during week 4. Bronze partitions by ingest_date because that is how we’ll query it during audits. Gold partitions by business_date (often different from ingest_date by hours or days) because that is how the dashboards and the chat layer will query it.
Cost-control measures wired in week 4:
- S3 lifecycle rules: bronze data older than 365 days goes to Glacier.
- Iceberg time-travel retention: 30 days on silver, 7 days on bronze. (Time-travel is cheap but not free.)
- Query-engine warm pools are auto-stop after 10 minutes of idle. We have seen customers waste tens of thousands of dollars on always-on warehouses.
- A nightly
OPTIMIZEjob compacts small files. OptionalVACUUMruns weekly with conservative retention.
The cost comparison at this layer, on the typical mid-market profile:
| Stack | Storage + query | Source |
|---|---|---|
| Iceberg on S3 + Athena | ~$25–$100/mo + ~$50–$150/mo S3 + ~$200/mo Glue | Athena pricing + S3 standard tier |
| Snowflake equivalent | $5K–$15K/mo (Standard/Enterprise edition) | Fivetran benchmark + TBDC case study |
| Databricks equivalent | $5K–$25K/mo (DBU + EC2/S3 cloud cost) | Databricks pricing breakdown |
We are choosing the substrate Snowflake and Databricks themselves now read. Iceberg on S3 is not “going off-piste” — it is the format both vendors expose through their own external-table features (CREATE EXTERNAL TABLE ... USING ICEBERG). The Open Semantic Interchange announcement in January 2026, signed by Snowflake, dbt, Cube, AtScale, Databricks, and 40+ others, locks this in as the industry-standard direction. Vendor lock-in starts in the proprietary compute and the proprietary semantic layer — at the storage layer there is no lock-in to escape from.
A note on time-travel and historical capture, since teams confuse these. Iceberg time-travel (VERSION AS OF / TIMESTAMP AS OF) is the audit log, not the dimension of record. Default snapshot retention is short — AWS Glue’s managed maintenance keeps ~5 days by default, configurable up. If you need to answer “what was this customer’s plan on 2024-03-15?” — that lives in a modeled SCD Type-2 dimension table, not in VERSION AS OF. AWS’s own guidance is explicit on this point (AWS big-data blog on SCD2 + Iceberg). We cover the historical-capture pattern in detail in its own section below.
By end of week 4 the customer’s data is sitting in the lakehouse, structured, queryable, and cost-bounded. No dashboards yet, but the foundation is sound — and the storage layer alone is already 50–250× cheaper than the proprietary alternative would have been.
Weeks 5 to 7 the modeling layer
This is the longest part of the engagement and the most consequential. The week-1 audit identified the metric set; weeks 5-7 implement it.
dbt project structure we lay down:
models/
bronze/ # bronze-→-silver staging models
stg_users.sql
stg_events.sql
stg_subscriptions.sql
silver/ # silver-→-silver normalization
int_active_subscriptions.sql
int_user_sessions.sql
gold/ # business-facing certified models
fct_arr_daily.sql # daily ARR snapshots
fct_retention_cohorts.sql
fct_product_usage.sql
dim_customer.sql
metrics/ # dbt Semantic Layer definitions
arr.yml
mrr.yml
retention.yml
active_users_28d.yml
tests/
generic/ # cross-cutting tests (not_null, unique, etc.)
singular/ # per-model business-logic tests
contracts/ # JSON Schema files (referenced by ingest Lambdas) The ARR model is the headline. ARR is the metric that finance, product, sales, and the board all care about and almost always disagree on. The dbt model encodes the customer’s signed-off definition — typically something like “sum of annualized recurring contract value at the customer level, excluding one-time charges, excluding paused subscriptions older than 60 days, denominated in USD as of the snapshot date.”
The model is ~50-100 lines of well-commented SQL. Every CTE has a purpose. Every JOIN is documented in the model’s YAML file. We test:
- ARR cannot be negative.
- ARR_t+1 must not deviate from ARR_t by more than 25% (catches accidental joins blowing up the count).
- The customer count in the ARR snapshot matches the customer count in the source system within tolerance.
The tests fail loudly in CI, and CI runs on every PR to the dbt repo.
Other gold models follow the same pattern: certified business definitions, documented assumptions, tested invariants. Retention cohorts are tricky (the join logic that defines “active” varies by customer); product usage is straightforward (sum events over a window).
By end of week 7 there is a working dbt project producing certified gold tables in the Lakehouse on a schedule. Airflow runs dbt build --select state:modified+ nightly; full rebuilds run weekly.
Week 8 the semantic layer
The semantic layer is where the metric definitions become exposable to downstream consumers — Looker AND the AI chat layer — without each consumer reinventing the SQL.
We write the metric definitions in dbt Semantic Layer YAML format, version-controlled in the same repo as the dbt models:
metrics:
- name: arr
label: "Annual Recurring Revenue (USD)"
description: >
Customer-signed ARR definition. Snapshot at midnight UTC.
Excludes one-time charges, paused-60+ subscriptions, and
non-USD plans below $10K total contract value.
type: simple
type_params:
measure: arr_usd
filter: "{{ Dimension('customer__is_paused_60d') }} = false" Once the metric is defined in dbt SL, both Looker and the AI chat layer query it via the same API — no double-definition. If the CFO changes the ARR definition in week 32, we edit one YAML file and both surfaces update on the next sync.
LookML is generated for Looker compatibility. A small generator script reads the dbt SL definitions and emits a LookML model that proxies through to the dbt SL API. Looker users see normal LookML explores; the SQL underneath is always the certified definition.
This layer is also the AI-chat substrate. The dbt MCP server exposes the semantic layer to LLMs over the Model Context Protocol — metric definitions, dimensions, entity relationships, the time-grain options. An LLM does NOT write SQL against the warehouse; it picks the certified metric + dimensions + filters, and the semantic layer compiles SQL deterministically. The accuracy difference is not subtle. From dbt’s published 2026 benchmark (Semantic Layer vs Text-to-SQL):
| Model | Raw text-to-SQL | Via Semantic Layer |
|---|---|---|
| Sonnet-class | 90.0% | 98.2% |
| GPT-5.3-codex | 84.1% | 100.0% |
The single most useful sentence from that benchmark, and the one we put in front of every prospect skeptical that AI chat over their data can work:
“With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message.”
That is the structural guarantee. Not “the AI is better” — the architecture refuses to silently lie. The chat layer can route a question to a certified metric or it can return “I don’t know” — it cannot fabricate a number from raw SQL.
Week 9 the dashboards layer
We build the first set of dashboards in week 9 — typically four or five:
- Board-level ARR dashboard — ARR snapshot, MoM growth, top customer movements, retention overview. One page. Sized for an executive screen.
- Product usage dashboard — DAU/WAU/MAU, feature adoption funnels, cohort retention.
- Sales pipeline overview — pipeline value, close rates, time-in-stage (if Salesforce or HubSpot is wired).
- Operational health — pipeline freshness (when did each layer last update), data-contract failures, model run times.
- Cost dashboard — query-engine spend, Lakehouse storage costs, ingest Lambda invocations.
These are our dashboards. The customer’s team almost always wants more — segment views, custom slices, marketing-specific cuts. Week 9 deliberately stops short of building those, because the customer’s team should be the one designing them. We provide the explores, the certified metrics, and a working LookML pattern; the customer’s analysts use Looker’s UI to build the dashboards they actually need.
This is the “your team designs the viz” moment. We hand them functional dashboards plus a thirty-minute screencast walking through how the LookML explore works. They take it from there.
Week 10 the AI chat layer
This is the layer that did not exist in most consulting engagements five years ago, and increasingly cannot be left out of new builds. The board wants to ask “what was our churn last quarter for enterprise customers signed in 2024?” without learning LookML.
Architecture (the wrapper-not-writer pattern):
User question (chat UI)
↓
NL → structured query (FastAPI service)
↓
LLM ← dbt MCP server (semantic-layer metadata)
← metric definitions, dimensions, time grains
↓ outputs {metrics, dimensions, filters, time_grain} JSON (NOT SQL)
↓
dbt Semantic Layer compiles certified SQL deterministically
↓
Result + provenance (which metric, which model, which run)
↓
Chat UI renders answer + citation The critical design choice: the LLM never writes SQL. It picks a metric, dimensions, filters, and a time grain — a structured intent JSON — and the semantic layer compiles SQL deterministically. The benchmark numbers we showed in Week 8 are what this buys us: 98–100% accuracy for in-scope questions, structural refusal (not silent hallucination) for out-of-scope ones.
Why this matters for LLM choice and cost. Raw NL→SQL needs a frontier-grade model to hit ~90% accuracy. With the wrapper pattern, a cheap open model (DeepSeek V4 Flash at $0.14/$0.28 per M tokens, or distilled R1 on g5 spot at ~$1/hr) hits the same 98–100% — because the model only has to identify intent, not generate correct SQL. We are paying for routing, not synthesis.
| Hosting option | Approx cost | When to pick |
|---|---|---|
| Open API (DeepSeek V4 Flash) | ~$0.14 in / $0.28 out per M tokens — ~$30–$100/mo at mid-market usage | Default for most engagements. Data-residency tradeoff: the structured intent JSON (NOT raw rows) flies to the API. |
| Self-hosted distilled R1 on AWS g5 spot | ~$1/hr (4× L4) = ~$200–$300/mo for a 24/7 single-replica deployment | When the customer’s data-residency posture rules out any external API call. |
| Self-hosted full DeepSeek V3 / R1 (671 B) on multi-GPU H100 cluster | $5K+/mo | Only at very high query volume; almost never justified at mid-market scale. |
Key constraints we wire in week 10:
- No raw warehouse access for the LLM. It calls the dbt Semantic Layer HTTP API; the API only exposes certified metrics + certified dimensions. The chat layer literally cannot ask “show me the raw users table.”
- Provenance on every answer. The UI shows which metric was queried, which dbt model defined it, and the timestamp of the model run. No black-box numbers.
- Cost ceiling configurable. Token + query-engine spend counted per query; a daily budget alarms the operator if usage spikes.
- Data residency stays in the customer’s AWS account for the self-hosted option, or only the structured intent JSON leaves for the API option. The raw rows the model is reasoning about never leave the warehouse.
Honest about what we will not claim. The LLMs we choose at this layer (DeepSeek V4, distilled R1, smaller open Llama derivatives) score badly on raw NL→SQL benchmarks — DeepSeek V3 hits ~8.8% on Spider 2.0 (Spider 2.0 paper). They are nowhere near GPT-5.x or Sonnet-class on that specific task. The reason the chat surface works anyway is the semantic-layer wrapper. We are not claiming “open models beat the premium ones at writing SQL” — we are claiming “the architecture removes the need for them to.” That is a defensible engineering claim. The cheap, accurate, vendor-portable result is the consequence.
Historical capture, change-data-capture, and SCD2
Three patterns get conflated here. They are not interchangeable. The engagement wires the right one for the customer’s scale and audit requirements.
1. Iceberg time-travel. Snapshots of the table at points in time. SELECT * FROM fct_arr FOR VERSION AS OF '2024-03-15'. Default snapshot retention on AWS Glue’s managed maintenance is ~5 days, configurable up. Time-travel is the audit log — answers “what did the system see on that date” — but it expires. It is not a long-term dimension of record. AWS’s own guidance is explicit on this point (SCD2 on Iceberg).
2. SCD Type-2 modeled dimensions. A dim_customer_history table with surrogate_key, business_key, effective_from, effective_to, is_current. Built nightly by dbt incremental + MERGE INTO from the silver tables. This is the dimension of record — answers “what plan was this customer on as of 2024-03-15” — and lives forever. Pair this with time-travel on the silver source for full auditability.
3. Streaming CDC into Iceberg. When the source mutation rate exceeds ~1M rows/day or the latency target is sub-hour. Canonical stack: Postgres logical replication → Debezium → Kafka → Flink (or Spark Structured Streaming) → Iceberg merge-on-read tables with equality deletes. Operational reality is that exactly-once is hard; managed tools (Estuary Flow, Streamkap, RisingWave, Fivetran Managed Lakes) absorb the operational burden. (streamkap.com CDC→Iceberg, RisingWave production lessons)
Decision rule by scale:
| Source mutation rate | Pattern | Why |
|---|---|---|
| Under 1M rows/day, batch-OK | dbt incremental SCD2 on Iceberg + Lambda CDC | Simplest; one tool (dbt) owns the entire history layer; the Lambda ingest path is enough. |
| 1–10M rows/day, near-real-time | Debezium + Kafka + Flink → Iceberg MOR + dbt SCD2 on top | Streaming for freshness; dbt-modeled SCD2 still owns the dimension of record. |
| >10M rows/day, sub-second | Managed CDC service (Estuary/Streamkap/RisingWave) + dbt SCD2 | The operational burden of self-hosting exactly-once is real; pay someone to handle it. |
Building bespoke when off-the-shelf doesn’t fit. Sometimes the customer’s source system or audit requirement does not map cleanly to any of the above. Regulatory contexts that require full immutability with cryptographic chain-of-custody. Multi-tenant SaaS platforms where the “tenant” dimension is itself versioned. Industrial sensor data where the “history” is millisecond-resolution append-only with no updates. We have built bespoke change-tracking layers in all three contexts — usually a thin Python service that produces Iceberg-compatible append-only writes with a custom provenance schema, downstream of which dbt models build the same SCD2 / fact-table shapes. The bespoke piece replaces Debezium-or-equivalent; everything downstream stays standard. The decision to build bespoke is always cost-justified: we will not build a custom layer when a managed tool does the job within ±20% of the budget.
Weeks 11 to 12 handoff prep
The last two weeks are about transferring the operating burden off TwiceData and onto the customer’s team (or onto our subscription tier, depending on which option they pick).
Runbooks we write:
- Pipeline failure runbook. “If the nightly dbt build fails, here is how to diagnose: 1) check the dbt Cloud / MWAA log, 2) look for failing tests, 3) inspect the contract-failure CloudWatch dashboard, 4) if the failure is in bronze→silver, the upstream schema drifted.” Five pages of these.
- Cost-anomaly runbook. “If lakehouse compute bill spikes 30% week-over-week, run this query to identify the cause. Most often it is a Looker explore that turned off caching.”
- Metric-definition-change runbook. “If the CFO wants to change the ARR definition, here is the PR template, the test you must add, and the audit-trail commit you must reference.”
- Adding a new data source runbook. Step-by-step for adding the next upstream system after the engagement ends.
- AI-chat retraining runbook. When to refresh the vector index, when to retrain the NL→SQL fine-tune, how to evaluate quality regressions.
Lineage documentation:
- Auto-generated from dbt’s lineage graph plus OpenLineage exports.
- Published to a static site hosted on S3 + CloudFront. Searchable.
- One link the customer’s team bookmarks. Lives forever.
Day-91 verification:
- A handoff checklist the customer signs off on:
- Every model in the dbt repo runs from a clean clone.
- Every dashboard loads in under 5 seconds.
- Every AI-chat sample question returns the expected answer.
- Every runbook has been walked through with the customer’s team — not just written.
- A 90-day spending forecast on the AWS bills, broken down by line item. Typical mid-market profile lands here: storage + ingest Lambda ~$200–$500/mo, Athena query spend ~$25–$100/mo, AI-chat compute ~$0–$300/mo depending on hosting choice (API vs self-hosted). Total well under $1K/mo at typical mid-market scale. Compare to Snowflake equivalent ($7.5K–$15K/mo) and Databricks equivalent ($5K–$25K/mo) for the same workload. The 75–85% reduction holds across the 90-day forecast and improves further as the team gets disciplined about partitioning and column pruning.
- A list of every tool, every credential, every IAM role, every secret. All transferred.
Day 91 onward and the handoff question
This is where the engagement diverges. The same shipping artifact, two ownership models:
Option A — You maintain it. Day-91 handoff completes; we walk away from active operation. Your engineers run the nightly pipelines, watch the alerts, add new metrics, retrain the chat layer. We are on call for thirty days of post-handoff office hours (included in the SOW). After that, we are a phone call away on a pay-as-you-go basis, or you cancel us out of the picture entirely. Most customers who pick this option are hiring their first dedicated data engineer in the same quarter as the engagement; the engagement seeds them with a working system to onboard against.
Option B — We maintain it. Subscription tier kicks in at day 91. Monthly cadence:
- Pipeline health review (data contract failures, slow dbt models, cost anomalies)
- Up to N metric-definition changes per month with full CI + audit trail
- Semantic-layer refresh as new business questions emerge
- AI-chat retraining as the question distribution evolves
- BI adapter upgrades as Looker or alternatives release new features
- Quarterly architecture review
Cancellable annually, no lock-in. The customer can flip from Option B back to Option A whenever they decide they want full ownership of operations.
Either way, you own it. The AWS account is yours. The lakehouse workspace is yours. The dbt repo is yours. The Looker workspace is yours. The chat surface — model weights, vector index, FastAPI service — is yours. You can hire any consulting shop tomorrow to take over from us and they will find a clean, documented system. There is no proprietary TwiceData layer in the stack; everything is open-source or your-vendor-of-choice.
That is the deal.
Why this stack, why now
Three reasons this particular combination of tools is the default we deploy in 2026:
- The substrate is now industry-standard, not vendor pitch. The Open Semantic Interchange (Jan 2026, signed by Snowflake, dbt, Cube, AtScale, Databricks, and 40+ others — see overview) standardized the YAML format that defines metrics across vendors. Iceberg on S3 is the storage substrate the proprietary warehouses themselves now read. Choosing the open path is not “going off-piste” — it is choosing the format the industry agreed on.
- AI chat works at production accuracy now, with the semantic-layer wrapper. dbt’s 2026 benchmark shows raw text-to-SQL accuracy on real-world schemas tops out at ~85–90% even on frontier models; the semantic-layer wrapper takes it to 98–100% on in-scope questions, with structural refusal (not silent fabrication) on out-of-scope ones. That gap is the difference between a science project and a board-trusted surface.
- The economics: $80K–$150K/yr saved at typical mid-market scale. Real-world Snowflake bill at this profile: $90K–$180K/yr (Fivetran benchmark, public TBDC case study at $24K/mo pre-optimization). Databricks at the same profile: $70K–$300K/yr. Our Iceberg-on-S3 + Athena stack delivers the same dashboards and the same AI chat surface for ~$5K–$25K/yr all-in. Most customers save 75–85% of their warehouse bill.
Two paths in, keep your stack or migrate it
We are cost-aware, not stack-religious. Half our engagements stay on the proprietary substrate the customer already pays for. The other half migrate — when the math says they should. The conversation goes one of two ways in week 0.
Scenario A — “Can you save me money?” This is the migration engagement. We do the cost analysis: your current bill, your data volume, your query patterns, the actual proportion of your Snowflake credits that go to dbt rebuilds vs interactive BI queries vs ad-hoc analyst exploration vs ML/feature engineering. The decision threshold we use: above ~$100K/yr current spend at mid-market scale, with a workload that is mostly dbt-driven analytical SQL (i.e. the workload our stack is good at), migration almost always pencils out within ~8–12 weeks of payback. Below that threshold, or if 30%+ of the bill is heavy ML workloads, the math is more nuanced and we will tell you so. The Scenario A engagement looks like a Stack Audit + Quarter Stack: we land an OSS-stack rebuild in parallel with your current Snowflake/Databricks workload, run them side-by-side for a quarter to verify metric equivalence, then cut over. Your old warehouse contract finishes its term, you don’t renew. We have done this enough times to underwrite the savings claim with a fixed-bid SOW.
Scenario B — “I like Snowflake, can we still work?” Yes. We work in your existing stack. We do not tear down what’s built; we improve it. dbt + Snowflake is genuinely a great combination — we will optimize your warehouse sizing, refactor your model layer for less compute waste, add the governed semantic layer, and wire the AI chat surface on top of your existing substrate. The skills transfer cleanly because the deeper layer (dbt + a semantic layer) is the same underneath either choice. You keep your Snowflake contract. We keep delivering the same engagement shapes (Stack Audit, Embedded Sprint, Quarter Stack). The cost-savings story in this article does not apply, but the AI-chat, semantic-governance, and pipeline-discipline parts do.
The honest framing: we have a strong opinion about the cost economics at mid-market scale, but we have zero religious attachment to forcing it. A customer who has a stable Snowflake setup and a budget that absorbs the bill should keep it. A customer who is bleeding $20K/mo on a workload that could run for $1K/mo should hear about that. We will tell you which one you are in the audit, with numbers.
The economics, with real numbers
| Stack | Lakehouse layer / yr | Source |
|---|---|---|
| Snowflake | $90K–$180K all-in | Fivetran 1TB benchmark · TBDC case study, $288K→$109K post-optimization |
| Databricks | $70K–$300K all-in | Fivetran benchmark · Databricks pricing breakdown |
| OSS on AWS (this article) | $5K–$25K all-in | Athena pricing $5/TB scanned · S3 standard tier · AWS data-on-eks Trino blueprint |
Defensible marketing claim: Most mid-market customers save $80K–$150K per year on warehouse spend — roughly 75–85% of their previous bill — by running an open-source Iceberg lakehouse on AWS instead of Snowflake or Databricks. Conservative fallback: Most customers cut their warehouse bill by more than half.
The math assumes the mid-market reference profile (50–150 dbt models, 1–5 TB warehouse, ~40 BI dashboards, single-region AWS). Above that profile, the savings can be larger in absolute dollars but the percentage compresses as Snowflake/Databricks contract discounts kick in. Below it (very small teams, under 1 TB), DuckDB/MotherDuck alone may be enough — the engagement shape compresses too.
What we will not do
Honest exclusions, so the engagement is not over-sold. Half of these are anti-fabrication rules we hold to in marketing too:
- We will not migrate off Postgres. If the customer’s source system is genuinely the wrong choice (rare), that is a separate conversation. The engagement assumes Postgres or equivalent is staying.
- We will not own the customer’s business decisions. ARR definitions, retention rules, “what is an active user” — those are CFO and product calls. We encode them; we do not legislate them.
- We will not name a fake customer in our marketing. The reference architecture above is the pattern. Real customer engagements get published only with named consent.
- We will not claim Athena is faster than Snowflake. It is not. The TPC-H Iceberg comparison shows Snowflake’s optimizer is cheaper on 18 of 22 queries and faster on 14 of 22 (source). The honest line is “fast enough for the dbt workload, at a fraction of the cost.”
- We will not claim DuckDB scales infinitely. It is single-node. The right framing is “the right tool below ~1 TB working sets and for embedded analytics” (source). Over that, push to Athena or Trino.
- We will not claim Iceberg time-travel replaces SCD2. AWS’s own blog says explicitly that it does not — model SCD2 separately (source). Time-travel is the audit log, the modeled dim table is the dimension of record.
- We will not claim our open-LLM choice beats premium models at NL→SQL. It doesn’t — DeepSeek V3 scores ~8.8% on Spider 2.0 vs ~23% for frontier models (source). The semantic-layer wrapper is what makes the architecture accurate. We make that explicit in every engagement.
- We will not promise exactly-once Postgres CDC to Iceberg out of the box. Production lessons say it’s hard and requires managed tooling or careful Flink + compaction work (source). We will tell you what level of “exactly-once” your use case actually needs and whether a managed CDC service is worth the line item.
How to start
30-minute architecture call. We sketch your stack-on-AWS plan and tell you within 48 hours whether the cost-savings story applies to your specific scale, your specific workload, and your specific procurement constraints. If your team is already happy on Snowflake or Databricks and the bill isn’t the problem, we will tell you that too — Scenario B is real and we run it often.
Book the call — we send back a scoped proposal within 48 hours.
For the positioning narrative behind this engagement, see your data teammate — from ingest to dataviz, not your vendor. For the playbook on how we build the AI chat layer specifically — its own deep-dive — see programmatic prompts with DSPy. For the broader engineering portfolio see TD Labs.
––