Data validation in the lakehouse

Three checkpoint locations

Source

Postgres

Vendor app DB. CDC stream out.

GX A · Ingest gate

Pre-bronze contract

Schema + nullability + value-range. REJECT bad rows.

Bronze + dbt

Iceberg + transforms

Validated rows feed bronze, silver, gold.

GX B · Gold gate

Business-rule assertions

Row count, distribution, referential integrity. HALT on fail.

Consumers

Looker · AI chat

Read only validated gold partitions.

GX C · Drift watch

Scheduled checkpoints

Hourly distribution + freshness. Alert on drift.

A board ARR number is wrong on Monday morning. The data team has eight hours to figure out which of three things broke: the source database changed a column type without telling anyone, a dbt model silently absorbed the type cast and produced the wrong aggregation, or the BI layer is reading a stale partition. By Monday afternoon someone is asking why we don’t have tests, and the data lead is explaining — again — that dbt tests run after the model. They catch the broken state. They don’t prevent it from being written.

This is the failure mode data validation exists for. It is not “tests in a different folder.” It is a contract layer that sits at the seams between systems — between the source database and the bronze table, between the dbt gold model and the BI cache, between the gold tables and the AI chat surface that paraphrases them. We use Great Expectations Core 1.x (docs, PyPI) as the OSS validation engine on every engagement, paired with Soda Core for SQL-native production observability, and with a clear-eyed read on what each tool does well in mid-2026. This article walks the integration end-to-end, names the patterns that work, and is explicit about two recent changes that should reshape every team’s evaluation: GX Cloud is sunsetting, and dbt-expectations (Calogica) has been formally unmaintained since late 2024.

Why dbt tests alone are not enough

dbt tests are excellent at one specific job: asserting properties of a modeled table after it is built. The canonical patterns — unique, not_null, accepted_values, relationships — are post-build assertions. If fct_arr_daily violates unique on (customer_id, business_date), dbt’s test framework will fail the run after the table has been materialized.

That is the right shape for some data quality work. But it leaves three classes of failure uncovered:

Source-schema drift. The upstream Postgres table adds a column, drops a column, or changes a column type. dbt’s source freshness check (dbt docs) catches staleness, but not a schema mutation that succeeds downstream because the new shape is implicitly cast. We have audited engagements where a numeric(18,4) got widened to numeric(20,6) upstream, the dbt model absorbed it without error, and three months of revenue rolled up wrong because the implicit precision change shifted a rounding boundary.
Pre-write contract enforcement. Bronze layer ingestion is the moment to reject a malformed row, not the moment to absorb it and hope a downstream test catches it. dbt does not run pre-ingest.
Continuous freshness and distribution checks. Is yesterday’s row count within the historical band? Is the distribution of plan_tier consistent with the prior 30 days? Is a column’s null-rate creeping up? These are probabilistic assertions that dbt’s deterministic test shapes do not express well.

A validation engine like Great Expectations covers all three gaps. It runs before dbt, after dbt, and on a schedule between dbt runs. The integration patterns are documented and stable in 2026 (GX + dbt official tutorial, Airflow provider).

Important context for anyone evaluating now: the popular dbt-expectations package by Calogica, which many teams used to bring expectation-style assertions into dbt, is officially unmaintained. The repository’s README carries the literal banner “This package is no longer actively supported” with the last release in September 2024 (repo). If a current data stack relies on it, the team should know it’s frozen — and look at Elementary for dbt-native anomaly detection (Elementary’s “Top 3 dbt testing packages” overview) and Great Expectations OSS for the heavier validation work.

What Great Expectations actually does

Stripped of marketing language, Great Expectations is four things. The current API (Core 1.x, GA August 2024, release blog) replaced the older YAML/CLI workflow with a Fluent Python API — about an 86% reduction in setup boilerplate vs the v0 era (Fluent Datasources blog). The CLI itself was retired in v1.0 (farewell to the CLI).

01 · Expectations

Assertions about a single column or table. ~300 built-in expectation types in the current gallery.

# Column-level
expect_column_values_to_be_unique("customer_id")
expect_column_values_to_not_be_null("event_time")
expect_column_values_to_be_between("plan_tier_price", 0, 1_000_000)
expect_column_distinct_values_to_be_in_set("plan_tier",
    ["Bronze", "Silver", "Gold", "Enterprise"])

# Table-level
expect_table_row_count_to_be_between(min_value=1_000, max_value=10_000_000)
expect_table_columns_to_match_set(column_set=[...])

Custom expectations are plain Python subclasses. See the expectation gallery for the current catalog.

02 · Expectation Suites

Named collections of expectations, version-controlled in your repo.

suite = context.suites.add(
    gx.ExpectationSuite(name="subscriptions_bronze")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnDistinctValuesToBeInSet(
        column="plan_tier",
        value_set=["Bronze", "Silver", "Gold", "Enterprise"],
    )
)
suite.save()

One suite per table per layer. Bronze suites enforce the contract; gold suites enforce business-rule invariants.

03 · Validation Definitions + Checkpoints

A runnable unit that pairs a suite with a data source. The 1.x flow is Batch → ValidationDefinition → Checkpoint.

import great_expectations as gx

context = gx.get_context()

# Connect to data via the Fluent API (the 1.x replacement for YAML)
data_source = context.data_sources.add_pandas(name="staging")
data_asset = data_source.add_dataframe_asset(name="subscriptions_batch")
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    "subscriptions_latest"
)

# Wire suite + batch into a validation definition
validation_definition = context.validation_definitions.add(
    gx.ValidationDefinition(
        name="subscriptions_ingest_gate",
        data=batch_definition,
        suite=context.suites.get("subscriptions_bronze"),
    )
)

# Run against a concrete dataframe
results = validation_definition.run(
    batch_parameters={"dataframe": df_subscriptions_batch},
)
if not results.success:
    raise ValueError(f"Ingest gate failed: {results.data_docs_url}")

v1.17.0 (May 2026) removed the legacy Batch args (data_context, datasource_name, batch_parameters, batch_kwargs). Any code that still passes these is broken on current GX — see the v0→v1 migration guide.

04 · Data Docs

HTML reports of every checkpoint run, hostable as a static site.

Every run renders to a sortable HTML report: which expectations passed, which failed, what the actual values were, links to the underlying batches. Pushed to S3 + CloudFront in our default deploy, the URL lives next to Looker dashboards as "data quality status."

Three long-running issues to know about: S3 reports overwriting themselves (#6314), DataDocs S3 access-denied on run links (#1235), HTML not updating in S3 (#5314). The fix in all three is the same — the validations-prefix and Data Docs-prefix in S3 must be disjoint (neither a substring of the other).

That is the entire conceptual surface. Everything else in the documentation is integration plumbing.

Adoption signal. GX OSS is healthy in 2026: ~11,500 GitHub stars, ~7M weekly PyPI downloads, ~31M monthly (pypistats). The codebase has 13,600+ commits; the release cadence is steady (1.16.0 in April 2026, 1.17.2 in May 2026 per the changelog). Apache-2.0 licensed. Named production users include LOGEX, THINKMD, Vimeo, Heineken, Rent the Runway, Provectus, Avanade, Calm, and Komodo Health (case studies, Komodo writeup). No quantitative scale figures are published on the case-study pages, so the depth-of-deployment is unverifiable from public sources — we hedge accordingly.

GX Cloud is sunsetting. Use OSS.

This is the most material change to GX evaluation in 2026, and many teams haven’t picked it up yet.

On May 7, 2026, DataKitchen published “When the cloud goes dark — a note to Great Expectations customers” reporting that the company behind Great Expectations was acquired and that the Cloud edition would shut down approximately 30 days later — putting the cutoff around June 6, 2026 (source). The acquirer has not been publicly named in any source we could find as of the time of writing.

Meanwhile, the OSS package (now formally branded as “GX Core”) is healthy and shipping fast — 1.17.2 dropped May 14, 2026. The greatexpectations.io marketing site, as of the same week, still actively sells GX Cloud with no shutdown banner. That contradiction itself is a story; the takeaway for any team evaluating now is:

GX OSS (Core) is the only honest recommendation today. Skip Cloud entirely until ownership and continuity are publicly resolved.

This is also why our default stack does not depend on the Cloud product: every component runs in your own AWS account, on your own infrastructure, with no managed-service dependency. Data Docs ship to S3 + CloudFront. Suites version-control to your own repo. The validation context is plain Python in your own orchestrator.

Where GX sits in our stack: three checkpoint locations

We run Great Expectations in three checkpoint locations on every AWS lakehouse engagement — shown in the hero diagram at the top of this article. Each one catches a class of failure the others can’t.

Checkpoint A — Ingest gate. Runs against each CDC batch before it is written to the bronze Iceberg table. Expectations enforce the contract with the source system: column set, nullability, value ranges, accepted values. A row that violates the contract is either rejected and quarantined (the strict pattern), or routed to a bronze_quarantine table with the failure metadata (the soft pattern — used when the customer is mid-migration and can’t reject yet). The bronze table is never written until the suite passes.

Checkpoint B — Gold gate. Runs after dbt has built the gold-tier tables, before the BI / chat consumers can read them. Expectations enforce business invariants: ARR must be monotonic when grouped by customer-month and bounded by the cohort sum; dim_customer.is_current = true must have exactly one row per business key; the fct_arr_daily row count must be within historical bounds. If Checkpoint B fails, the Airflow DAG halts the partition swap — the dashboards keep reading yesterday’s gold tables, not today’s potentially broken ones.

Checkpoint C — Drift watch. Runs on a schedule (hourly or daily) against gold tables, comparing the current snapshot’s distribution to the rolling 30-day baseline. This catches the slow-failure modes: a plan_tier value that’s gradually disappearing, a region distribution silently shifting, a mrr_usd distribution whose tail is changing shape. Drift alerts go to a Slack channel and a CloudWatch alarm; they do not halt the pipeline. (Compare to Soda Core’s SodaCL change-detection checks, which are SQL-native and run in the same shape — see the next section for when we use which.)

A worked example: validating a SaaS subscriptions table

Concretely: the source system has a subscriptions table in Postgres. CDC streams it via Debezium + MSK to an S3-staging bucket, where a Lambda reads each batch, validates it against the GX checkpoint, and writes the passing rows to the bronze Iceberg table. This is the contract we encode.

The subscriptions expectation suite, version 1:

import great_expectations as gx
from great_expectations import expectations as gxe

context = gx.get_context()
suite = context.suites.add(
    gx.ExpectationSuite(name="subscriptions_bronze")
)

# 1. Schema contract — column set is exactly what we expect
suite.add_expectation(gxe.ExpectTableColumnsToMatchSet(
    column_set=[
        "subscription_id", "customer_id", "plan_tier",
        "started_at", "ended_at", "mrr_usd",
        "currency", "status", "updated_at",
    ],
))

# 2. Primary key — subscription_id never null, always unique within a batch
suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="subscription_id"))
suite.add_expectation(gxe.ExpectColumnValuesToBeUnique(column="subscription_id"))

# 3. Foreign key — customer_id never null
suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="customer_id"))

# 4. Enum constraint — plan_tier in known set
suite.add_expectation(gxe.ExpectColumnDistinctValuesToBeInSet(
    column="plan_tier",
    value_set=["Bronze", "Silver", "Gold", "Enterprise"],
))

# 5. Value range — mrr_usd is positive and within plausible bounds
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(
    column="mrr_usd",
    min_value=0,
    max_value=1_000_000,
    mostly=1.0,
))

# 6. Timestamp sanity — started_at not in the future
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(
    column="started_at",
    max_value="2099-12-31",
    parse_strings_as_datetimes=True,
))

# 7. Currency — ISO 4217 three-letter codes only
suite.add_expectation(gxe.ExpectColumnValueLengthsToEqual(
    column="currency",
    value=3,
))

# 8. Row count sanity — batches between 1 and 100K rows
suite.add_expectation(gxe.ExpectTableRowCountToBeBetween(
    min_value=1,
    max_value=100_000,
))

suite.save()

Eight expectations. Each one is a specific, testable assertion. Together they form a contract — the source system has implicitly promised this shape, and the bronze layer refuses to absorb anything that doesn’t match.

The validation definition that runs the suite against an incoming batch:

# Fluent Datasource → asset → batch_definition
data_source = context.data_sources.add_pandas(name="s3_staging")
data_asset = data_source.add_dataframe_asset(name="subscriptions_batch")
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    "subscriptions_latest"
)

# Wire suite + batch
validation_definition = context.validation_definitions.add(
    gx.ValidationDefinition(
        name="subscriptions_ingest_gate",
        data=batch_definition,
        suite=context.suites.get("subscriptions_bronze"),
    )
)

# Run against the incoming dataframe
results = validation_definition.run(
    batch_parameters={"dataframe": df_subscriptions_batch},
)
if not results.success:
    # Slack notification, halt the DAG, route bad rows to quarantine
    raise ValueError(
        f"Subscriptions ingest gate failed. "
        f"See data docs for details: {results.data_docs_url}"
    )

Failure raises, the Airflow task fails, the DAG halts. The data team gets a Slack ping with a link to the HTML report listing every failed expectation and the actual values that caused the failure. The bronze table is never written.

A note on Iceberg. The GX Application Integration Support page does NOT list Iceberg as a directly-supported data source as of May 2026 — there is no native pyiceberg adapter. The integration pattern in production is via Spark or Trino — read the Iceberg table into a Spark DataFrame or query it via Trino, then validate that result. The Dremio + Nessie + Iceberg + GX walkthrough is a good reference for the Write-Audit-Publish (WAP) pattern on Iceberg-backed lakehouses. If a customer’s stack pushes us toward native pyiceberg validation, we layer Pandera or Soda’s SQL backend instead — see the competitive section below.

The integration patterns that work in 2026

We run Great Expectations alongside three orchestrators and several adjacent tools. The integration health matters — some of the popular paths are quietly degrading.

Airflow airflow-provider-great-expectations

The Apache Airflow community provides a Great Expectations provider with a GreatExpectationsOperator task type. The DAG declares the checkpoint, the operator runs it, and the task fails loudly if the suite fails. Upstream tasks halt downstream consumers automatically — this is the entire point of using Airflow rather than ad-hoc cron.

Maintenance status: alive but quiet. ~173 GitHub stars, last push April 1, 2026. Not archived. Originally owned by Astronomer, then transferred to the GX org. Airflow 2.9.0+ is the documented compatibility floor.

Sources: provider source · Airflow GX provider docs · Astronomer backgrounder · Compatibility reference

dbt Side-by-side orchestration (NOT dbt-expectations)

dbt does not natively call Great Expectations, and there is no first-class dbt-GX bridge package. The official tutorial pattern is side-by-side orchestration: Airflow (or Dagster) coordinates dbt run + GX checkpoint as separate steps. A model-level post_hook can shell out to a GX checkpoint, which is how we wire Checkpoint B (the gold gate).

The dbt-expectations package by Calogica is NOT a GX integration — it's a pure-dbt reimplementation of expectation semantics as dbt macros. It does not import GX. It has been officially unmaintained since late 2024. If your stack uses it, you should know it's frozen. For dbt-native anomaly detection in 2026 we use Elementary, which adds Z-score based drift detection on a configurable historical window.

dbt's native tests still run alongside — they catch the deterministic invariants cheaply. GX layers on the contract-shape and distributional assertions dbt tests do not express well.

Sources: GX official dbt tutorial · dbt-expectations repo (unmaintained banner) · Elementary's dbt testing roundup · dbt pre/post-hooks

Dagster dagster-ge (BETA)

The dagster-ge integration provides a ge_data_context resource and a ge_validation_op_factory for invoking GX inside a Dagster job. Status: officially marked beta, with documented breaking changes in minor releases. Active long-running issue thread on Dagster v3 compatibility.

We use Dagster when the customer's stack already runs on it, and accept the integration is less mature than the Airflow path. Greenfield engagements default to Airflow + GX for that reason.

Sources: dagster-ge integration docs · dagster-ge API ref ("beta") · v3 support tracking issue

Iceberg Via Spark / Trino (no native adapter)

Iceberg is not on the GX compatibility reference page as of May 2026. There's no native pyiceberg adapter. Integration in production is via Spark — read the Iceberg table into a Spark DataFrame and run GX expectations against it, using GX's Spark execution engine.

For Iceberg-on-S3 + AWS Glue Catalog (the stack we deploy), this means the validation step is a small EMR/EKS Spark job, not a Python-only Lambda. The Dremio + Nessie + Iceberg + GX walkthrough is the cleanest reference for the Write-Audit-Publish pattern on Iceberg.

Sources: GX compatibility reference (Iceberg absent) · Dremio Iceberg + WAP + GX

CI GitHub Actions running suites against staging data

Every PR that modifies an expectation suite triggers a CI job that runs the suite against a staging sample. The job also diffs the Data Docs HTML against the previous run — a reviewer sees exactly which expectations changed, what the change is asserting, and which historical batches would now pass/fail.

This catches "drive-by relaxation" — the engineer who weakens an expectation to make a failing CI run pass, instead of fixing the underlying data problem.

Sources: GX in CI patterns

Data docs S3 + CloudFront static site

Every checkpoint run publishes its HTML Data Docs to an S3 bucket fronted by CloudFront. The URL gets linked from Looker as a "data quality status" tile. Auditors get a read-only login that exposes the full historical run set. There is no separate "data quality dashboard" to build — GX renders it for you on every run.

Known issue to wire around: the validations-prefix and Data Docs-prefix in S3 must be disjoint (neither a substring of the other) or reports start overwriting themselves (issue #6314).

Sources: GX data-docs hosting

GX vs Soda vs Pandera vs the rest: what to use when

The honest 2026 verdict, after auditing more than a dozen engagements: no single validation tool covers all the cases well. The teams that ship reliable data quality run two or three tools, each playing to its strengths.

GX Core (OSS) Python-flexible validation in dev / CI

Best for: Python-native validation flows, custom expectations with non-trivial logic, suites version-controlled alongside the producer's code, ML pipelines where the validation logic itself is non-trivial Python. The strongest deployments use GX as the dev/CI engine — write the suite once, version it, run it locally, run it in CI, run it in the orchestrator.

Where it strains: heavy at small scale, the Fluent-API has a learning curve, Spark + result-store split creates real bottlenecks on giant batches (see Performance section). The honest framing from multiple third-party reviews: "designed to make data quality easier, became a platform that requires its own care and feeding" (paraphrasing the widely-cited techoc.blog post — quote attributed loosely since the original text could not be independently re-fetched).

Sources: GX repo · Pebblous deep-dive critique

Soda Core SQL/YAML production observability

Best for: SQL-comfortable teams running production data observability. Declarative SodaCL YAML, 50+ built-in checks, in-warehouse query execution (data stays in the warehouse). Soda 4.0 added AI anomaly detection with vendor-claimed reductions in false positives vs naive baselines. SodaGPT translates natural language to SodaCL — useful for getting analysts to author their own contracts.

Where it strains: less flexible than GX for complex Python validation, fewer hooks for custom logic, doesn't replace dev/CI workflows.

The hybrid pattern most production teams converge on: GX in CI / dev / pre-ingest, Soda for production scheduled monitoring. Each plays to its strengths.

Sources: Soda Core repo · Soda vs GX comparison (DataExpert)

Pandera DataFrame-native, Pythonic, fast

Best for: pandas/polars in-process validation, ML notebook code, anywhere you want a schema-as-code pattern that lives inside Python with minimal ceremony. Significantly fewer dependencies than GX. An informal 5M-row benchmark by endjin found Pandera consistently outperformed GX on equivalent schema validations.

Where it sits: the de-facto split many teams adopt — Pydantic at the API boundary, Pandera inside ML/notebook code, GX/Soda for warehouse-monitoring CI. If your team's pipeline is mostly notebook-driven, Pandera is often the right primary choice and GX is overkill.

Sources: endjin Pandera-vs-GX writeup

dbt-native dbt tests + Elementary

Best for: teams whose entire data layer is dbt models on Snowflake/BigQuery/Redshift, and whose validation needs fit deterministic invariants (uniqueness, not-null, referential integrity, accepted values) plus rolling anomaly detection. dbt's built-in tests cover the deterministic side; Elementary adds Z-score-based anomaly detection with a configurable historical window (default 14 days).

Why we mention it: if a customer's needs really are only this, recommending GX is over-engineering. The teams that ship the best mid-market data quality often start here and add GX/Soda only when the dbt-native shape genuinely doesn't fit.

Sources: dbt tests reference · Elementary roundup

Anomalo AI-driven, no-rules-required

Best for: enterprises with budget for a fully-managed AI-driven validation platform. Connects to Snowflake/Databricks/BigQuery/Redshift; deploys SaaS, hybrid, in-VPC, or as a Snowflake Native App. Strong "unknown-unknowns" detection — surfaces issues without you needing to write the assertions first.

Pricing: contact-sales. One third-party review cites "starts around $60K/yr" but the vendor does not publish numbers and we cannot verify the figure with primary sourcing. Treat as a directional estimate, not a quote.

Sources: Anomalo product page · G2 pricing page (contact sales)

Monte Carlo · Bigeye · Lightup Data observability (different category)

Important distinction: observability ≠ validation. Monte Carlo pioneered the "data observability" category — its core is metadata, lineage, freshness alerts, anomaly detection at the metric layer. It doesn't replace GX's assertion-against-data shape; the two are complementary. Bigeye is similar in category with ML-driven baselines. Lightup runs in-warehouse so data never leaves your account — strong for residency-constrained shops.

Pricing (all secondary-sourced, hedged): Monte Carlo reportedly enterprise-tier at ~$100K+/yr; Bigeye reported at $5K–$15K/mo ($60K–$180K/yr) at mid-market scale. Primary vendor sources are all gated.

Sources: Monte Carlo positioning · 2026 comparison (Basedash) · Atlan tools list

Market consolidation context. The data observability and validation space has been consolidating fast — Datadog acquired Metaplane in late 2025, Snowflake acquired Select Star, and the broader DataKitchen 2026 landscape report observes that “data quality and observability have become too strategic to ignore, and the big players are buying their way in” (source). The OSS layer (GX Core, Soda Core, Pandera, dbt+Elementary) is, perhaps as a consequence, the stablest place to invest in 2026 — vendor M&A risk is real and the GX Cloud shutdown is the canary.

Performance characteristics: what actually breaks

Real-world performance is the part of GX evaluations that gets the least public benchmark data. We’ve stitched together what’s documented in primary sources and community postmortems.

The documented bottlenecks:

Serialization + result-store path is single-Python-process. When the Spark execution engine is in use, validation itself parallelizes across executors — but the result dict that lists failed rows + their primary keys is materialized as JSON in a single Python process for the result store. Multiple community writeups describe this as the load-bearing pain point at scale, particularly when result_format=COMPLETE (Databricks optimization writeup, Prefect-side discussion).
Profiler memory. Issue #5389 documents the UserConfigurableProfiler consuming ~1.6 GB on TPCH SF1 lineitem and ~4 GB on TPCH SF10 lineitem (and not completing) in a GX v0.15.0 run against Snowflake. The issue is closed against subsequent fixes to DataAssistant and UserConfigurableProfiler, but the shape of the failure mode (profiler workloads don’t scale linearly with table size) is worth remembering — sample, don’t full-scan, when defining a new suite.
Long checkpoint runs in cloud environments. Issue #3620: context.run_checkpoint takes 30 seconds on staging but 30+ minutes on production GCP despite output files appearing in the bucket within 30 seconds. The pattern (checkpoint “stuck” after data is already written) recurs in community discussions; the fix is usually result-store configuration.
Spark count/collect overhead. The Pebblous deep-dive (blog.pebblous.ai) reports that “cases of count/collect operations taking more than 2× longer on large Spark datasets have been reported.” This is consistent with the result-store bottleneck above.
DataHub action memory leak. A long-running Airflow + GX + DataHub deployment pattern can leak memory in the DataHub action handler (DataHub issue #4531). Workaround: don’t run the DataHub action inline; run it asynchronously.

The mitigations we apply by default:

result_format="SUMMARY" instead of COMPLETE on production checkpoints. We do NOT materialize every failed row’s primary key in the result store — the summary is enough for alerting, and the underlying batch is queryable for forensics.
Sample, don’t full-scan, on tables over ~10M rows. A representative batch + a separate scheduled full-scan once a day catches the same drift at a fraction of the cost.
Push expectations down to the warehouse where supported. GX’s SQL backend can execute many expectations as database-side queries, avoiding the Spark serialization path entirely.
Set max_workers appropriately when validating concurrently.

The honest hedge: there is no canonical, peer-reviewed “what does GX cost on a 10M-row table” public benchmark from GX itself. The numbers above come from GitHub issues, community blog posts, and the May 2025 MDPI data-cleaning-tools benchmark (source) which compared GX alongside OpenRefine, Dedupe, TidyData, and Pandas. We use those as directional evidence, not as a quote you can put in a SOW.

Anti-patterns we audit in client engagements

Five failure modes recur across the engagements we audit. Most are not GX-specific — they’re patterns any validation engine will exhibit when wired poorly.

1. Profiler-generated suites shipped to production unchanged. GX’s own documentation says the profiler emits expectations that are “deliberately over-fitted on your data” (reference). If a sample table has 10,000 rows, the profiler will emit expect_table_row_count_to_equal(10000). That is meant as scaffolding — to be reviewed, relaxed, and adjusted before going to production. Teams that skip that step ship a suite that throws on every benign row-count change, and the team mutes the alerts within a month.

2. Suite graveyard. Hundreds of expectations accumulate across dozens of suites. Nobody owns them. Failures get muted as noise. The suite stops being a quality signal — it becomes a graveyard. The Pebblous report calls this “expectations overload” and we see it in roughly a third of audits.

3. Validation-as-compliance-theater. When validation runs exist primarily to satisfy an audit checklist rather than to drive engineering action, the failure modes are predictable: alerts route to a no-op Slack channel, suites are never tightened, and the entire layer becomes a tax. We have no source attributing this critique specifically to GX (the search came up empty), but the pattern is well-known across software testing literature — auto-generated suites + nobody-reads-results + muted alerts = theater. The fix is governance, not tooling: a named human owner per suite, a CI diff on every change, a quarterly review.

4. Heavyweight DataContext for small projects. GX’s architecture (DataContext + DataSources + Asset + BatchDefinition + ExpectationSuite + ValidationDefinition + Checkpoint + Stores) is a lot of conceptual surface to learn. If you only need six schema checks on a single dbt project, Pandera or dbt-native tests do it in a fraction of the lines. Elementary’s writeup names this directly: “requires more setup and conceptual overhead than adding a dbt package or writing SodaCL checks” (source). The right answer is not to force GX onto every project — it’s to pick the tool with the right weight class for the job.

5. v0 → v1 migration debt. GX 1.0 (August 2024) was a hard break — Fluent API replaced YAML/CLI config; v1.17.0 (May 2026) removed the legacy Batch args. Codebases still on v0.x with code paths that pass data_context, datasource_name, batch_parameters, or batch_kwargs to the modern Batch API are broken on current GX. Discourse threads describe the migration friction (example); the v0.18 migration guide is the canonical reference. Plan migration as a discrete sprint, not a side-quest.

What it can do for your company

Translated from the engineering layer into language a buyer can use inside their own organization:

Stop wrong-data dashboards before they reach the board. If fct_arr_daily violates an expectation, the partition swap is blocked. The Monday board deck reads yesterday’s correct number — not today’s broken one. Time-to-detection drops from “the CFO asks about it” to “the Airflow task fails at 03:42 UTC.”
Make compliance and audit cheap. The Data Docs HTML is the audit log. SOX, SOC 2, GDPR, regulated finance — every audit asks “show me how you knew the data was correct.” Pointing the auditor at a historical Data Docs URL is materially cheaper than producing the same evidence from logs and runbook screenshots.
Catch schema drift at the source. When the vendor app team adds a column without telling the data team, the contract fails on the next CDC batch. The data team finds out in minutes. The alternative — finding out in three months because a quarterly retention number is wrong — is the failure mode every data engineer has lived through.
Reduce data-team firefighting. The most expensive hour in a data team’s week is the one spent figuring out “where did this number go wrong.” A working validation layer collapses that hour into a Slack ping with a URL. The team that previously spent two days a quarter on incident forensics spends an afternoon.
Give the AI chat surface a contract. When natural-language-to-SQL grounding pulls from gold tables, the model is only as reliable as the data underneath it. A wrong-data answer from an AI chat surface destroys trust faster than a wrong dashboard, because the user can argue with a chat reply. The gold gate (Checkpoint B) is the contract that lets the chat surface be confident.

A concrete buyer-internal pitch: “We are adding a contract layer at the seams of the pipeline. It costs us X engineering hours to wire and Y compute dollars to run. It is what stops wrong-data weeks from happening, and it is what makes our compliance audits cheap.” That is the framing.

What we will not claim (anti-fabrication)

“Great Expectations replaces dbt tests.” It does not. The right architecture runs both. dbt tests catch deterministic invariants cheaply; GX catches contract-shape and distributional assertions dbt cannot express. (dbt docs on tests)
“GX catches every data problem.” It catches what you encode. An expectation that wasn’t written cannot fire. The discipline is in what you assert, not which tool you use.
“Validation is free.” Each checkpoint adds compute time proportional to the batch size and the number of expectations, plus engineering time to maintain the suite as the source schema evolves. Mid-market lakehouse engagements typically add ~5–10% to monthly compute spend and ~half a day per quarter to suite maintenance. We will model both for your specific case during a Stack Audit.
“GX Cloud is the managed option for the long term.” It is not — Cloud was announced to be shutting down ~June 6, 2026 (source). Even if the marketing site still actively sells Cloud as of mid-May 2026, the OSS (Core) path is the only honest recommendation right now. The acquirer behind the change has not been publicly named in any source we can cite.
”dbt-expectations is a current option.” It isn’t — Calogica’s README says “This package is no longer actively supported” with last release in September 2024. If a stack relies on it, treat it as frozen. (source)
“GX has native Iceberg support.” It doesn’t — Iceberg is not on the GX compatibility reference (source). Integration is via Spark or Trino. No pyiceberg adapter as of mid-2026.
“Auto-generated profiler suites are production-ready.” GX’s own docs say they are deliberately over-fitted to the sample. Treat them as scaffolding only, then hand-tighten.
“GX scales to anything via Spark.” Spark engine exists, but the result-store + serialization path is single-Python-process. Real bottlenecks documented in issue #5389, issue #3620, and the Pebblous report. Sample your batches at scale; don’t claim infinite linear scaling.
“Data Docs are a data contract.” They are not — Data Docs are HTML reports of validation runs. A real data contract is a producer-consumer agreement enforced at the schema layer (see PayPal’s data-contract template or Gable for the contract layer). Data Docs report on whether one side of a contract passed; they are not the contract itself.
Specific vendor pricing for Anomalo / Monte Carlo / Bigeye. All numbers in this article come from secondary sources because the vendors do not publish pricing. Treat the ranges as directional.
Headline performance claims like “GX is X% slower than Soda on real production data.” There is no peer-reviewed head-to-head benchmark. The MDPI 2025 study (source) and the endjin Pandera-vs-GX 5M-row informal test are the best public evidence we found. Both are limited.

How to start

A free 60-minute architecture call is the entry point. We use it to map the seams in your existing pipeline that should have a contract gate, identify the three highest-leverage places to wire validation first, and quote a fixed-bid scope. Most data-validation engagements fall under the Stack Diagnostic (audit + remediation plan) or Embedded Sprint (wire it end-to-end on one critical pipeline) shapes. Full lakehouse builds with validation included from day one fall under Quarter Stack.

For the broader picture of how data validation fits into the AWS-native data stack we deploy, read From a fresh AWS account to dashboards and AI chat. For the historical-capture layer that sits next to validation, read Slowly Changing Dimensions — a diagnostic walkthrough. This article is the contract layer; those two are the structure and history layers; together they describe the lakehouse we ship.

––