Great Expectations Tutorial for Data Quality Testing

Great Expectations Tutorial for Data Quality Testing

Bad data is quiet. It does not announce itself. It flows through your pipelines, populates your dashboards, and informs decisions while being wrong in ways that are difficult to detect without actively looking for problems.

The customer lifetime value calculation that produces negative numbers because refund records were double-counted. The churn model that silently receives a column of all nulls because an upstream schema changed. The marketing report that overstates conversion by fourteen percent because a timestamp filter has an off-by-one error. These problems share a common characteristic: they were preventable with data quality checks that nobody put in place.

Great Expectations is the most widely adopted open source framework for data quality testing in Python. It lets you define explicit assertions about your data, run them automatically as part of your pipelines, generate human-readable documentation of what your data is supposed to look like, and get immediate alerts when reality diverges from expectation.

This tutorial walks through the complete Great Expectations workflow. By the end you will have working data quality tests that you can drop into any pipeline.

What Great Expectations Does

Great Expectations works by letting you define expectations about your data and then validating actual data against those expectations. An expectation is a declarative assertion: this column should never be null, values in this column should fall between zero and one, this column should only contain these specific values, the row count should be between ten thousand and fifty thousand.

The framework runs those expectations against your actual data and tells you which passed and which failed. It generates a data documentation site that shows every expectation defined for every dataset, the last validation result, and historical pass rates. And it integrates with orchestration tools like Airflow and dbt so that quality checks run automatically as part of your pipeline rather than as a manual afterthought.

Installation and Project Setup

bash

pip install great-expectations pandas sqlalchemy

Great Expectations uses a context object that manages your configuration, data sources, expectation suites, and checkpoints. Starting with version 0.15, the recommended approach is the fluent API which is significantly cleaner than the legacy file-based configuration.

python

import great_expectations as gx
import pandas as pd
import numpy as np

# Create a data context (manages all GX configuration)
context = gx.get_context()

print(f"Great Expectations version: {gx.__version__}")
print(f"Context type: {type(context)}")

Loading Sample Data

For this tutorial, use a realistic customer transaction dataset with intentional data quality problems embedded so you can see expectations catch them.

python

np.random.seed(42)
n_rows = 5000

# Generate realistic transaction data with embedded quality issues
data = {
    "transaction_id": range(1, n_rows + 1),
    "customer_id": np.random.randint(1000, 9999, n_rows),
    "transaction_date": pd.date_range("2024-01-01", periods=n_rows, freq="h"),
    "amount": np.random.lognormal(mean=4.0, sigma=1.2, size=n_rows).round(2),
    "currency": np.random.choice(
        ["USD", "EUR", "GBP", "INVALID", None],
        n_rows,
        p=[0.7, 0.15, 0.1, 0.03, 0.02]
    ),
    "status": np.random.choice(
        ["completed", "pending", "failed", "refunded", "UNKNOWN"],
        n_rows,
        p=[0.75, 0.1, 0.08, 0.05, 0.02]
    ),
    "merchant_category": np.random.choice(
        ["retail", "food", "travel", "entertainment", None],
        n_rows,
        p=[0.35, 0.30, 0.20, 0.12, 0.03]
    )
}

df = pd.DataFrame(data)

# Introduce data quality problems
df.loc[np.random.choice(df.index, 50, replace=False), "amount"] = -abs(
    df.loc[np.random.choice(df.index, 50, replace=False), "amount"]
)
df.loc[np.random.choice(df.index, 30, replace=False), "transaction_id"] = df.loc[
    np.random.choice(df.index, 30, replace=False), "transaction_id"
]
df.loc[np.random.choice(df.index, 25, replace=False), "customer_id"] = None

print(f"Dataset shape: {df.shape}")
print(f"\nSample data:")
print(df.head())
print(f"\nNull counts:\n{df.isnull().sum()}")

Connecting a Data Source

Great Expectations needs to know where your data lives. You connect data sources and create batch definitions that tell the framework how to load your data for validation.

python

# Add a pandas data source
data_source = context.data_sources.add_pandas(name="transaction_data_source")

# Add a data asset representing the transactions table
data_asset = data_source.add_dataframe_asset(name="transactions")

# Create a batch definition for how to load the data
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    name="full_transactions_batch"
)

# Create a batch from the dataframe
batch = batch_definition.get_batch(
    batch_parameters={"dataframe": df}
)

print("Data source connected successfully")
print(f"Batch loaded with {len(df)} rows")

Building an Expectation Suite

An expectation suite is a named collection of expectations about a dataset. Think of it as a test file for your data. You define the suite once and run it every time new data arrives.

python

# Create an expectation suite
suite = context.suites.add(
    gx.ExpectationSuite(name="transactions_quality_suite")
)

print(f"Created suite: {suite.name}")

Now define individual expectations. Each expectation targets a specific quality dimension of the data.

python

from great_expectations.expectations import (
    ExpectColumnValuesToNotBeNull,
    ExpectColumnValuesToBeUnique,
    ExpectColumnValuesToBeBetween,
    ExpectColumnValuesToBeInSet,
    ExpectColumnValuesToMatchRegex,
    ExpectTableRowCountToBeBetween,
    ExpectTableColumnCountToEqual,
    ExpectColumnProportionOfUniqueValuesToBeBetween,
    ExpectColumnMeanToBeBetween,
    ExpectColumnMedianToBeBetween
)

# Table-level expectations
suite.add_expectation(
    ExpectTableRowCountToBeBetween(min_value=1000, max_value=100000)
)

suite.add_expectation(
    ExpectTableColumnCountToEqual(value=7)
)

# Primary key expectations
suite.add_expectation(
    ExpectColumnValuesToNotBeNull(column="transaction_id")
)

suite.add_expectation(
    ExpectColumnValuesToBeUnique(column="transaction_id")
)

# Customer ID expectations
suite.add_expectation(
    ExpectColumnValuesToNotBeNull(
        column="customer_id",
        mostly=0.99  # Allow up to 1% nulls
    )
)

suite.add_expectation(
    ExpectColumnValuesToBeBetween(
        column="customer_id",
        min_value=1000,
        max_value=9999
    )
)

# Amount expectations
suite.add_expectation(
    ExpectColumnValuesToNotBeNull(column="amount")
)

suite.add_expectation(
    ExpectColumnValuesToBeBetween(
        column="amount",
        min_value=0,
        max_value=100000,
        mostly=0.999
    )
)

suite.add_expectation(
    ExpectColumnMeanToBeBetween(
        column="amount",
        min_value=50,
        max_value=500
    )
)

# Categorical expectations
suite.add_expectation(
    ExpectColumnValuesToBeInSet(
        column="currency",
        value_set=["USD", "EUR", "GBP", "JPY", "CAD"],
        mostly=0.98
    )
)

suite.add_expectation(
    ExpectColumnValuesToBeInSet(
        column="status",
        value_set=["completed", "pending", "failed", "refunded"],
        mostly=0.99
    )
)

# Null rate expectations for optional fields
suite.add_expectation(
    ExpectColumnValuesToNotBeNull(
        column="merchant_category",
        mostly=0.95
    )
)

print(f"Suite contains {len(suite.expectations)} expectations")

The mostly parameter is one of the most important concepts in Great Expectations. It sets the minimum fraction of rows that must meet the expectation. A mostly of 0.99 means the expectation passes if at least 99 percent of rows satisfy the condition, allowing for a realistic tolerance in messy real-world data without treating every minor anomaly as a failure.

Running Validation

Validation runs the expectation suite against a batch of data and produces a validation result object containing detailed pass/fail information for every expectation.

python

# Create a validation definition
validation_definition = context.validation_definitions.add(
    gx.ValidationDefinition(
        name="transactions_validation",
        data=batch_definition,
        suite=suite
    )
)

# Run validation
validation_result = validation_definition.run(
    batch_parameters={"dataframe": df}
)

# Inspect results
print(f"Validation success: {validation_result.success}")
print(f"\nResults summary:")

passed = 0
failed = 0

for result in validation_result.results:
    status = "PASS" if result.success else "FAIL"
    expectation_type = result.expectation_config.type
    kwargs = result.expectation_config.kwargs

    if not result.success:
        failed += 1
        print(f"  {status}: {expectation_type}")
        if hasattr(result, "result") and result.result:
            print(f"         Details: {result.result}")
    else:
        passed += 1

print(f"\nPassed: {passed}")
print(f"Failed: {failed}")
print(f"Total: {passed + failed}")

Checkpoints for Pipeline Integration

A checkpoint connects a validation definition to one or more actions that run after validation. The most common action is saving the validation result to build the data docs site, but checkpoints also support Slack notifications, email alerts, and custom Python callbacks.

python

# Create a checkpoint with actions
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name="transactions_checkpoint",
        validation_definitions=[validation_definition],
        actions=[
            gx.checkpoint.UpdateDataDocsAction(name="update_data_docs")
        ],
        result_format={
            "result_format": "COMPLETE",
            "include_unexpected_rows": True,
            "unexpected_index_column_names": ["transaction_id"]
        }
    )
)

# Run the checkpoint
checkpoint_result = checkpoint.run(
    batch_parameters={"dataframe": df}
)

print(f"Checkpoint passed: {checkpoint_result.success}")

In a production pipeline, the checkpoint run replaces the manual validation call. You drop it into your Airflow DAG or dbt post-hook and it runs automatically every time data is processed.

Custom Expectations

Great Expectations ships with over fifty built-in expectations covering the most common quality dimensions. For domain-specific rules that the built-in set does not cover, you write custom expectations.

python

from great_expectations.expectations import BatchExpectation
from great_expectations.execution_engine import PandasExecutionEngine
from great_expectations.expectations.expectation import (
    render_evaluation_parameter_string
)

class ExpectColumnValuesToBePositiveForCompletedTransactions(BatchExpectation):
    """
    Custom expectation: amount must be positive for completed transactions.
    Allows negative amounts only for refunded status.
    """

    metric_dependencies = ("table.columns",)
    success_keys = ("amount_column", "status_column")

    def _validate(self, metrics, runtime_configuration=None, execution_engine=None):
        df = execution_engine.batch_manager.active_batch_data.dataframe

        amount_col = self.kwargs.get("amount_column", "amount")
        status_col = self.kwargs.get("status_column", "status")

        completed = df[df[status_col] == "completed"]

        if len(completed) == 0:
            return {"success": True, "result": {"observed_value": "No completed transactions"}}

        negative_completed = completed[completed[amount_col] < 0]
        failed_count = len(negative_completed)
        total_count = len(completed)

        success = failed_count == 0

        return {
            "success": success,
            "result": {
                "observed_value": f"{failed_count} negative amounts in completed transactions",
                "total_completed": total_count,
                "failed_count": failed_count,
                "unexpected_percent": round(failed_count / total_count * 100, 2)
                if total_count > 0 else 0
            }
        }

For simpler custom rules, column map expectations are more concise. They apply a condition to each row and report the fraction that fails.

python

from great_expectations.expectations import ColumnMapExpectation

class ExpectAmountToBeReasonableForCurrency(ColumnMapExpectation):
    """Amount should be under 10000 for EUR transactions."""

    map_metric = "column_values.amount_reasonable_for_currency"

    success_keys = ("mostly",)
    default_kwarg_values = {"mostly": 1.0}

    @classmethod
    def _get_evaluation_dependencies(cls, metric, configuration, execution_engine, runtime_configuration):
        return {}

Generating Data Docs

Data docs are the human-readable documentation site Great Expectations generates from your expectations and validation results. They show every expectation defined for every dataset, the last validation status, and historical pass rates over time.

python

# Build and view data docs
context.build_data_docs()

# Open in browser (works locally)
# context.open_data_docs()

print("Data docs built successfully")
print("Open the generated HTML to review expectation results")

The data docs site is a static HTML site that can be hosted on S3, GCS, or any static hosting service. For teams, hosting data docs centrally gives every stakeholder visibility into data quality without needing to run Python themselves.

Integrating With a Pipeline

In production, Great Expectations runs as part of your data pipeline. Here is how it looks embedded in an Airflow-style batch job.

python

def run_daily_transaction_validation(df: pd.DataFrame) -> bool:
    """
    Run data quality validation as part of daily pipeline.
    Returns True if validation passes, raises exception if it fails.
    """
    context = gx.get_context()

    data_source = context.data_sources.add_or_update_pandas(
        name="pipeline_data_source"
    )
    data_asset = data_source.add_dataframe_asset(name="daily_transactions")
    batch_definition = data_asset.add_batch_definition_whole_dataframe(
        name="daily_batch"
    )

    # Load existing suite rather than redefining it
    suite = context.suites.get("transactions_quality_suite")

    validation_definition = context.validation_definitions.add_or_update(
        gx.ValidationDefinition(
            name="daily_transactions_validation",
            data=batch_definition,
            suite=suite
        )
    )

    result = validation_definition.run(
        batch_parameters={"dataframe": df}
    )

    if not result.success:
        failed_expectations = [
            r.expectation_config.type
            for r in result.results
            if not r.success
        ]
        raise ValueError(
            f"Data quality validation failed. "
            f"Failed expectations: {failed_expectations}"
        )

    print(f"Validation passed: {len(result.results)} expectations checked")
    return True

# Simulate pipeline execution
try:
    run_daily_transaction_validation(df)
    print("Pipeline proceeding with validated data")
except ValueError as e:
    print(f"Pipeline halted: {e}")

Raising an exception on validation failure stops the pipeline before bad data propagates downstream. The failed expectation names in the error message tell the on-call engineer exactly what went wrong without requiring them to dig through logs.

Great Expectations Cheat Sheet

ConceptWhat It IsWhen to Use It
ExpectationSingle assertion about dataDefine quality rules for each column
Expectation SuiteNamed collection of expectationsGroup related expectations per dataset
BatchA unit of data to validateThe actual data being checked
Validation DefinitionLinks suite to data sourceDefines what to validate
CheckpointRuns validation with actionsPipeline integration and alerting
Data DocsGenerated documentation siteSharing quality status with stakeholders
mostlyTolerance parameterAllow small fractions of failures
result_formatVerbosity of resultsBASIC, SUMMARY, or COMPLETE

Common Mistakes

Defining expectations with zero tolerance is the most common mistake for teams adopting Great Expectations on real data. Real production data has noise. Setting mostly=1.0 on every expectation means the first slightly messy batch fails validation entirely and the team starts treating alerts as noise. Start with realistic tolerances and tighten them as data quality improves.

Skipping table-level expectations focuses quality checks on individual columns while missing problems that only appear at the dataset level. Row count expectations catch truncated loads. Column count expectations catch schema changes. These are the simplest expectations to define and often the first ones to catch a real problem.

Not integrating checkpoints into the pipeline defeats the purpose of writing expectations. Expectations that are only run manually when someone remembers to run them are not data quality testing. They are occasional data spot-checking. The value of Great Expectations is in automated, continuous validation that catches problems before they reach consumers.

Defining expectations without looking at the actual data distribution first produces expectations that immediately fail because they do not match reality. Always profile the data before writing expectations so your initial suite reflects what the data actually looks like, not what you assume it looks like.

FAQs

What is Great Expectations and why should data engineers use it?

Great Expectations is an open source Python framework for data quality testing. It lets you define explicit assertions about your data, called expectations, run them automatically as part of your pipelines, and generate documentation of your data quality standards. Data engineers should use it because silent data quality failures are one of the most common causes of incorrect analytics and model degradation in production. Great Expectations catches these failures at the pipeline level before they reach downstream consumers.

What is an expectation in Great Expectations?

An expectation is a declarative assertion about a property of your data. Examples include a column should never be null, values in this column should be between zero and one hundred, this column should only contain these specific values, and the table should have between ten thousand and one million rows. Great Expectations provides over fifty built-in expectation types covering nullability, uniqueness, value ranges, set membership, regular expression matching, statistical distributions, and more. Custom expectations handle domain-specific rules the built-in set does not cover.

What is the mostly parameter in Great Expectations?

The mostly parameter sets the minimum fraction of rows that must satisfy an expectation for it to pass. A mostly of 0.99 means the expectation passes if at least 99 percent of rows meet the condition. It is essential for working with real-world data that inevitably has some level of noise or exceptions. Without mostly, a single unexpected null value in a million-row dataset fails the not-null expectation and halts the pipeline. Setting appropriate mostly values lets you enforce meaningful quality standards while tolerating realistic data imperfections.

How does Great Expectations integrate with Airflow or dbt?

For Airflow, run a checkpoint inside a PythonOperator task positioned after data loading and before downstream transformation tasks. If the checkpoint raises an exception on failure, Airflow marks the task as failed and stops the DAG from proceeding. For dbt, Great Expectations can run as a dbt test using the dbt-great-expectations package, which maps expectation results to dbt test pass/fail status. Both integrations turn data quality checks into first-class pipeline steps rather than optional manual processes.

How is Great Expectations different from dbt tests?

dbt tests are defined in YAML configuration files and run as part of the dbt build process, making them the natural choice for testing data that lives in a SQL warehouse and is managed by dbt models. Great Expectations is a Python framework that works with any data source including pandas DataFrames, databases, and files, making it more flexible for validating data before it enters the warehouse or in non-dbt pipelines. Teams often use both: dbt tests for warehouse-native data quality and Great Expectations for pipeline-level validation of incoming data.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top