Bad data is quiet. It does not announce itself. It flows through your pipelines, populates your dashboards, and informs decisions while being wrong in ways that are difficult to detect without actively looking for problems.
The customer lifetime value calculation that produces negative numbers because refund records were double-counted. The churn model that silently receives a column of all nulls because an upstream schema changed. The marketing report that overstates conversion by fourteen percent because a timestamp filter has an off-by-one error. These problems share a common characteristic: they were preventable with data quality checks that nobody put in place.
Great Expectations is the most widely adopted open source framework for data quality testing in Python. It lets you define explicit assertions about your data, run them automatically as part of your pipelines, generate human-readable documentation of what your data is supposed to look like, and get immediate alerts when reality diverges from expectation.
This tutorial walks through the complete Great Expectations workflow. By the end you will have working data quality tests that you can drop into any pipeline.
What Great Expectations Does
Great Expectations works by letting you define expectations about your data and then validating actual data against those expectations. An expectation is a declarative assertion: this column should never be null, values in this column should fall between zero and one, this column should only contain these specific values, the row count should be between ten thousand and fifty thousand.
The framework runs those expectations against your actual data and tells you which passed and which failed. It generates a data documentation site that shows every expectation defined for every dataset, the last validation result, and historical pass rates. And it integrates with orchestration tools like Airflow and dbt so that quality checks run automatically as part of your pipeline rather than as a manual afterthought.
Installation and Project Setup
bash
pip install great-expectations pandas sqlalchemy
Great Expectations uses a context object that manages your configuration, data sources, expectation suites, and checkpoints. Starting with version 0.15, the recommended approach is the fluent API which is significantly cleaner than the legacy file-based configuration.
python
import great_expectations as gx
import pandas as pd
import numpy as np
# Create a data context (manages all GX configuration)
context = gx.get_context()
print(f"Great Expectations version: {gx.__version__}")
print(f"Context type: {type(context)}")
Loading Sample Data
For this tutorial, use a realistic customer transaction dataset with intentional data quality problems embedded so you can see expectations catch them.
python
np.random.seed(42)
n_rows = 5000
# Generate realistic transaction data with embedded quality issues
data = {
"transaction_id": range(1, n_rows + 1),
"customer_id": np.random.randint(1000, 9999, n_rows),
"transaction_date": pd.date_range("2024-01-01", periods=n_rows, freq="h"),
"amount": np.random.lognormal(mean=4.0, sigma=1.2, size=n_rows).round(2),
"currency": np.random.choice(
["USD", "EUR", "GBP", "INVALID", None],
n_rows,
p=[0.7, 0.15, 0.1, 0.03, 0.02]
),
"status": np.random.choice(
["completed", "pending", "failed", "refunded", "UNKNOWN"],
n_rows,
p=[0.75, 0.1, 0.08, 0.05, 0.02]
),
"merchant_category": np.random.choice(
["retail", "food", "travel", "entertainment", None],
n_rows,
p=[0.35, 0.30, 0.20, 0.12, 0.03]
)
}
df = pd.DataFrame(data)
# Introduce data quality problems
df.loc[np.random.choice(df.index, 50, replace=False), "amount"] = -abs(
df.loc[np.random.choice(df.index, 50, replace=False), "amount"]
)
df.loc[np.random.choice(df.index, 30, replace=False), "transaction_id"] = df.loc[
np.random.choice(df.index, 30, replace=False), "transaction_id"
]
df.loc[np.random.choice(df.index, 25, replace=False), "customer_id"] = None
print(f"Dataset shape: {df.shape}")
print(f"\nSample data:")
print(df.head())
print(f"\nNull counts:\n{df.isnull().sum()}")
Connecting a Data Source
Great Expectations needs to know where your data lives. You connect data sources and create batch definitions that tell the framework how to load your data for validation.
python
# Add a pandas data source
data_source = context.data_sources.add_pandas(name="transaction_data_source")
# Add a data asset representing the transactions table
data_asset = data_source.add_dataframe_asset(name="transactions")
# Create a batch definition for how to load the data
batch_definition = data_asset.add_batch_definition_whole_dataframe(
name="full_transactions_batch"
)
# Create a batch from the dataframe
batch = batch_definition.get_batch(
batch_parameters={"dataframe": df}
)
print("Data source connected successfully")
print(f"Batch loaded with {len(df)} rows")
Building an Expectation Suite
An expectation suite is a named collection of expectations about a dataset. Think of it as a test file for your data. You define the suite once and run it every time new data arrives.
python
# Create an expectation suite
suite = context.suites.add(
gx.ExpectationSuite(name="transactions_quality_suite")
)
print(f"Created suite: {suite.name}")
Now define individual expectations. Each expectation targets a specific quality dimension of the data.
python
from great_expectations.expectations import (
ExpectColumnValuesToNotBeNull,
ExpectColumnValuesToBeUnique,
ExpectColumnValuesToBeBetween,
ExpectColumnValuesToBeInSet,
ExpectColumnValuesToMatchRegex,
ExpectTableRowCountToBeBetween,
ExpectTableColumnCountToEqual,
ExpectColumnProportionOfUniqueValuesToBeBetween,
ExpectColumnMeanToBeBetween,
ExpectColumnMedianToBeBetween
)
# Table-level expectations
suite.add_expectation(
ExpectTableRowCountToBeBetween(min_value=1000, max_value=100000)
)
suite.add_expectation(
ExpectTableColumnCountToEqual(value=7)
)
# Primary key expectations
suite.add_expectation(
ExpectColumnValuesToNotBeNull(column="transaction_id")
)
suite.add_expectation(
ExpectColumnValuesToBeUnique(column="transaction_id")
)
# Customer ID expectations
suite.add_expectation(
ExpectColumnValuesToNotBeNull(
column="customer_id",
mostly=0.99 # Allow up to 1% nulls
)
)
suite.add_expectation(
ExpectColumnValuesToBeBetween(
column="customer_id",
min_value=1000,
max_value=9999
)
)
# Amount expectations
suite.add_expectation(
ExpectColumnValuesToNotBeNull(column="amount")
)
suite.add_expectation(
ExpectColumnValuesToBeBetween(
column="amount",
min_value=0,
max_value=100000,
mostly=0.999
)
)
suite.add_expectation(
ExpectColumnMeanToBeBetween(
column="amount",
min_value=50,
max_value=500
)
)
# Categorical expectations
suite.add_expectation(
ExpectColumnValuesToBeInSet(
column="currency",
value_set=["USD", "EUR", "GBP", "JPY", "CAD"],
mostly=0.98
)
)
suite.add_expectation(
ExpectColumnValuesToBeInSet(
column="status",
value_set=["completed", "pending", "failed", "refunded"],
mostly=0.99
)
)
# Null rate expectations for optional fields
suite.add_expectation(
ExpectColumnValuesToNotBeNull(
column="merchant_category",
mostly=0.95
)
)
print(f"Suite contains {len(suite.expectations)} expectations")
The mostly parameter is one of the most important concepts in Great Expectations. It sets the minimum fraction of rows that must meet the expectation. A mostly of 0.99 means the expectation passes if at least 99 percent of rows satisfy the condition, allowing for a realistic tolerance in messy real-world data without treating every minor anomaly as a failure.
Running Validation
Validation runs the expectation suite against a batch of data and produces a validation result object containing detailed pass/fail information for every expectation.
python
# Create a validation definition
validation_definition = context.validation_definitions.add(
gx.ValidationDefinition(
name="transactions_validation",
data=batch_definition,
suite=suite
)
)
# Run validation
validation_result = validation_definition.run(
batch_parameters={"dataframe": df}
)
# Inspect results
print(f"Validation success: {validation_result.success}")
print(f"\nResults summary:")
passed = 0
failed = 0
for result in validation_result.results:
status = "PASS" if result.success else "FAIL"
expectation_type = result.expectation_config.type
kwargs = result.expectation_config.kwargs
if not result.success:
failed += 1
print(f" {status}: {expectation_type}")
if hasattr(result, "result") and result.result:
print(f" Details: {result.result}")
else:
passed += 1
print(f"\nPassed: {passed}")
print(f"Failed: {failed}")
print(f"Total: {passed + failed}")
Checkpoints for Pipeline Integration
A checkpoint connects a validation definition to one or more actions that run after validation. The most common action is saving the validation result to build the data docs site, but checkpoints also support Slack notifications, email alerts, and custom Python callbacks.
python
# Create a checkpoint with actions
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name="transactions_checkpoint",
validation_definitions=[validation_definition],
actions=[
gx.checkpoint.UpdateDataDocsAction(name="update_data_docs")
],
result_format={
"result_format": "COMPLETE",
"include_unexpected_rows": True,
"unexpected_index_column_names": ["transaction_id"]
}
)
)
# Run the checkpoint
checkpoint_result = checkpoint.run(
batch_parameters={"dataframe": df}
)
print(f"Checkpoint passed: {checkpoint_result.success}")
In a production pipeline, the checkpoint run replaces the manual validation call. You drop it into your Airflow DAG or dbt post-hook and it runs automatically every time data is processed.
Custom Expectations
Great Expectations ships with over fifty built-in expectations covering the most common quality dimensions. For domain-specific rules that the built-in set does not cover, you write custom expectations.
python
from great_expectations.expectations import BatchExpectation
from great_expectations.execution_engine import PandasExecutionEngine
from great_expectations.expectations.expectation import (
render_evaluation_parameter_string
)
class ExpectColumnValuesToBePositiveForCompletedTransactions(BatchExpectation):
"""
Custom expectation: amount must be positive for completed transactions.
Allows negative amounts only for refunded status.
"""
metric_dependencies = ("table.columns",)
success_keys = ("amount_column", "status_column")
def _validate(self, metrics, runtime_configuration=None, execution_engine=None):
df = execution_engine.batch_manager.active_batch_data.dataframe
amount_col = self.kwargs.get("amount_column", "amount")
status_col = self.kwargs.get("status_column", "status")
completed = df[df[status_col] == "completed"]
if len(completed) == 0:
return {"success": True, "result": {"observed_value": "No completed transactions"}}
negative_completed = completed[completed[amount_col] < 0]
failed_count = len(negative_completed)
total_count = len(completed)
success = failed_count == 0
return {
"success": success,
"result": {
"observed_value": f"{failed_count} negative amounts in completed transactions",
"total_completed": total_count,
"failed_count": failed_count,
"unexpected_percent": round(failed_count / total_count * 100, 2)
if total_count > 0 else 0
}
}
For simpler custom rules, column map expectations are more concise. They apply a condition to each row and report the fraction that fails.
python
from great_expectations.expectations import ColumnMapExpectation
class ExpectAmountToBeReasonableForCurrency(ColumnMapExpectation):
"""Amount should be under 10000 for EUR transactions."""
map_metric = "column_values.amount_reasonable_for_currency"
success_keys = ("mostly",)
default_kwarg_values = {"mostly": 1.0}
@classmethod
def _get_evaluation_dependencies(cls, metric, configuration, execution_engine, runtime_configuration):
return {}
Generating Data Docs
Data docs are the human-readable documentation site Great Expectations generates from your expectations and validation results. They show every expectation defined for every dataset, the last validation status, and historical pass rates over time.
python
# Build and view data docs
context.build_data_docs()
# Open in browser (works locally)
# context.open_data_docs()
print("Data docs built successfully")
print("Open the generated HTML to review expectation results")
The data docs site is a static HTML site that can be hosted on S3, GCS, or any static hosting service. For teams, hosting data docs centrally gives every stakeholder visibility into data quality without needing to run Python themselves.
Integrating With a Pipeline
In production, Great Expectations runs as part of your data pipeline. Here is how it looks embedded in an Airflow-style batch job.
python
def run_daily_transaction_validation(df: pd.DataFrame) -> bool:
"""
Run data quality validation as part of daily pipeline.
Returns True if validation passes, raises exception if it fails.
"""
context = gx.get_context()
data_source = context.data_sources.add_or_update_pandas(
name="pipeline_data_source"
)
data_asset = data_source.add_dataframe_asset(name="daily_transactions")
batch_definition = data_asset.add_batch_definition_whole_dataframe(
name="daily_batch"
)
# Load existing suite rather than redefining it
suite = context.suites.get("transactions_quality_suite")
validation_definition = context.validation_definitions.add_or_update(
gx.ValidationDefinition(
name="daily_transactions_validation",
data=batch_definition,
suite=suite
)
)
result = validation_definition.run(
batch_parameters={"dataframe": df}
)
if not result.success:
failed_expectations = [
r.expectation_config.type
for r in result.results
if not r.success
]
raise ValueError(
f"Data quality validation failed. "
f"Failed expectations: {failed_expectations}"
)
print(f"Validation passed: {len(result.results)} expectations checked")
return True
# Simulate pipeline execution
try:
run_daily_transaction_validation(df)
print("Pipeline proceeding with validated data")
except ValueError as e:
print(f"Pipeline halted: {e}")
Raising an exception on validation failure stops the pipeline before bad data propagates downstream. The failed expectation names in the error message tell the on-call engineer exactly what went wrong without requiring them to dig through logs.
Great Expectations Cheat Sheet
| Concept | What It Is | When to Use It |
|---|---|---|
| Expectation | Single assertion about data | Define quality rules for each column |
| Expectation Suite | Named collection of expectations | Group related expectations per dataset |
| Batch | A unit of data to validate | The actual data being checked |
| Validation Definition | Links suite to data source | Defines what to validate |
| Checkpoint | Runs validation with actions | Pipeline integration and alerting |
| Data Docs | Generated documentation site | Sharing quality status with stakeholders |
| mostly | Tolerance parameter | Allow small fractions of failures |
| result_format | Verbosity of results | BASIC, SUMMARY, or COMPLETE |
Common Mistakes
Defining expectations with zero tolerance is the most common mistake for teams adopting Great Expectations on real data. Real production data has noise. Setting mostly=1.0 on every expectation means the first slightly messy batch fails validation entirely and the team starts treating alerts as noise. Start with realistic tolerances and tighten them as data quality improves.
Skipping table-level expectations focuses quality checks on individual columns while missing problems that only appear at the dataset level. Row count expectations catch truncated loads. Column count expectations catch schema changes. These are the simplest expectations to define and often the first ones to catch a real problem.
Not integrating checkpoints into the pipeline defeats the purpose of writing expectations. Expectations that are only run manually when someone remembers to run them are not data quality testing. They are occasional data spot-checking. The value of Great Expectations is in automated, continuous validation that catches problems before they reach consumers.
Defining expectations without looking at the actual data distribution first produces expectations that immediately fail because they do not match reality. Always profile the data before writing expectations so your initial suite reflects what the data actually looks like, not what you assume it looks like.
FAQs
What is Great Expectations and why should data engineers use it?
Great Expectations is an open source Python framework for data quality testing. It lets you define explicit assertions about your data, called expectations, run them automatically as part of your pipelines, and generate documentation of your data quality standards. Data engineers should use it because silent data quality failures are one of the most common causes of incorrect analytics and model degradation in production. Great Expectations catches these failures at the pipeline level before they reach downstream consumers.
What is an expectation in Great Expectations?
An expectation is a declarative assertion about a property of your data. Examples include a column should never be null, values in this column should be between zero and one hundred, this column should only contain these specific values, and the table should have between ten thousand and one million rows. Great Expectations provides over fifty built-in expectation types covering nullability, uniqueness, value ranges, set membership, regular expression matching, statistical distributions, and more. Custom expectations handle domain-specific rules the built-in set does not cover.
What is the mostly parameter in Great Expectations?
The mostly parameter sets the minimum fraction of rows that must satisfy an expectation for it to pass. A mostly of 0.99 means the expectation passes if at least 99 percent of rows meet the condition. It is essential for working with real-world data that inevitably has some level of noise or exceptions. Without mostly, a single unexpected null value in a million-row dataset fails the not-null expectation and halts the pipeline. Setting appropriate mostly values lets you enforce meaningful quality standards while tolerating realistic data imperfections.
How does Great Expectations integrate with Airflow or dbt?
For Airflow, run a checkpoint inside a PythonOperator task positioned after data loading and before downstream transformation tasks. If the checkpoint raises an exception on failure, Airflow marks the task as failed and stops the DAG from proceeding. For dbt, Great Expectations can run as a dbt test using the dbt-great-expectations package, which maps expectation results to dbt test pass/fail status. Both integrations turn data quality checks into first-class pipeline steps rather than optional manual processes.
How is Great Expectations different from dbt tests?
dbt tests are defined in YAML configuration files and run as part of the dbt build process, making them the natural choice for testing data that lives in a SQL warehouse and is managed by dbt models. Great Expectations is a Python framework that works with any data source including pandas DataFrames, databases, and files, making it more flexible for validating data before it enters the warehouse or in non-dbt pipelines. Teams often use both: dbt tests for warehouse-native data quality and Great Expectations for pipeline-level validation of incoming data.