Pydantic Tutorial for Data Applications

Pydantic Tutorial for Data Applications

Data applications often receive information from unreliable sources.

Examples include:

  • CSV files
  • APIs
  • Databases
  • Event streams
  • User input
  • Third-party services

Unfortunately, incoming data is not always clean.

You may encounter:

  • Missing fields
  • Invalid data types
  • Incorrect dates
  • Negative values
  • Unexpected formats

Manually validating data using dozens of if statements quickly becomes difficult to maintain.

This is where Pydantic becomes extremely useful.

Pydantic is one of the most popular Python data validation libraries. It uses Python type hints to automatically validate, parse, and structure data, making applications more reliable and easier to maintain.

In this tutorial, you’ll learn how Pydantic works and how data professionals use it in ETL pipelines, analytics workflows, and machine learning projects.

What Is Pydantic?

Pydantic is a Python library that validates and parses data using type annotations. It automatically converts incoming data into structured Python objects while enforcing schemas and business rules.

Pydantic allows developers to define data models using Python classes.

Example:

from pydantic import BaseModel

class Customer(BaseModel):
    id: int
    name: str
    revenue: float

The model defines the expected structure of incoming data.

When data is provided, Pydantic validates it and creates a strongly typed object.

Why Data Teams Use Pydantic

Data applications frequently process data from multiple sources.

Without validation:

customer = {
    "id": "abc",
    "revenue": "unknown"
}

Problems may not appear until much later in the pipeline.

With Pydantic:

Customer(**customer)

Validation occurs immediately.

This helps catch data quality issues early.

Installing Pydantic

Install using pip:

pip install pydantic

Pydantic supports modern Python type hints and provides fast validation powered by a Rust-based core.

Creating Your First Model

Example:

from pydantic import BaseModel

class Product(BaseModel):
    product_id: int
    name: str
    price: float

Create an instance:

product = Product(
    product_id=1,
    name="Laptop",
    price=1200
)

Output:

Product(
    product_id=1,
    name='Laptop',
    price=1200.0
)

The data is validated automatically.

Automatic Type Conversion

One powerful feature is type coercion.

Example:

product = Product(
    product_id="1",
    name="Laptop",
    price="1200"
)

Result:

product.product_id

Output:

1

Pydantic converts compatible values into the expected types automatically.

Validation Errors

Invalid data triggers clear exceptions.

Example:

Product(
    product_id="abc",
    name="Laptop",
    price="unknown"
)

Output:

ValidationError

This makes debugging much easier than manual validation.

Required and Optional Fields

Required:

class Customer(BaseModel):
    id: int
    name: str

Optional:

from typing import Optional

class Customer(BaseModel):
    id: int
    name: Optional[str] = None

Optional fields are useful when processing incomplete datasets.

Using Default Values

Example:

class Order(BaseModel):
    status: str = "Pending"

If no status is supplied:

Pending

is used automatically.

Field Constraints

Pydantic supports validation rules.

Example:

from pydantic import BaseModel, Field

class Product(BaseModel):
    price: float = Field(gt=0)

Rule:

Price > 0

Invalid values fail validation.

This is useful for enforcing business rules.

Validating Email Addresses

Example:

from pydantic import BaseModel
from pydantic import EmailStr

class User(BaseModel):
    email: EmailStr

Pydantic verifies email formatting automatically.

Custom Validators

Sometimes custom logic is required.

Example:

from pydantic import (
    BaseModel,
    field_validator
)

class Employee(BaseModel):

    age: int

    @field_validator("age")
    @classmethod
    def validate_age(
        cls,
        value
    ):

        if value < 18:
            raise ValueError(
                "Age must be 18+"
            )

        return value

This ensures business-specific rules are enforced.

Data Engineering Example: CSV Validation

Suppose a CSV contains:

customer_id,name,revenue
101,John,500
102,Sarah,700
103,David,abc

Model:

class Customer(BaseModel):
    customer_id: int
    name: str
    revenue: float

Validation:

for row in rows:

    Customer(**row)

Invalid records are identified immediately.

This pattern is common in ETL pipelines. Community discussions frequently highlight row-level validation as a major Pydantic use case.

Data Engineering Example: API Responses

Many pipelines consume APIs.

Example:

response = {
    "id": 10,
    "name": "Alice"
}

Model:

class User(BaseModel):
    id: int
    name: str

Validate:

user = User(**response)

The response is guaranteed to follow the schema.

Data Engineering Example: ETL Pipelines

ETL workflow:

Extract
   ↓
Validate
   ↓
Transform
   ↓
Load

Pydantic often serves as the validation layer.

Benefits include:

  • Consistent schemas
  • Early error detection
  • Better data quality
  • Easier debugging

Nested Models

Real-world data is often hierarchical.

Example:

from pydantic import BaseModel

class Address(BaseModel):
    city: str
    country: str

class Customer(BaseModel):
    id: int
    address: Address

Input:

customer = Customer(
    id=1,
    address={
        "city": "Lagos",
        "country": "Nigeria"
    }
)

Nested structures are validated automatically.

Converting Models to Dictionaries

Export data:

customer.model_dump()

Output:

{
    "id": 1,
    "address": {
        "city": "Lagos"
    }
}

This is useful when loading validated data into databases.

Working with JSON

Convert models to JSON:

customer.model_dump_json()

This simplifies API integrations and event-driven systems.

Environment Variable Validation

Pydantic works well with configuration management.

Example:

from pydantic_settings import BaseSettings

class Settings(
    BaseSettings
):

    database_url: str
    api_key: str

Missing configuration values trigger validation errors during startup.

This helps prevent deployment mistakes.

Machine Learning Example

Machine learning projects often receive prediction requests.

Example:

class PredictionInput(
    BaseModel
):

    age: int
    income: float

Before inference:

data = PredictionInput(
    **payload
)

The model receives clean and validated input.

Common Beginner Mistakes

Using Dictionaries Everywhere

Structured models are easier to maintain.

Ignoring Validation Errors

Always handle exceptions properly.

Overcomplicating Models

Start simple and expand gradually.

Validating Too Late

Validate immediately after data ingestion.

Skipping Type Hints

Type annotations drive Pydantic’s validation system.

Best Practices

Create Models for Every Data Source

Define schemas for:

  • CSV files
  • APIs
  • Database records
  • Event streams

Validate Early

Catch problems before transformations begin.

Use Field Constraints

Enforce business rules directly in models.

Reuse Models

Share schemas across multiple pipelines.

Keep Validation Separate from Business Logic

Models should focus on structure and quality checks.

Real-World Example

Imagine a sales pipeline receiving daily files.

Without Pydantic:

Load CSV
Transform Data
Pipeline Fails

Finding the bad record may take hours.

With Pydantic:

Load CSV
Validate Rows
Reject Invalid Records
Continue Processing

Errors are detected immediately and reported clearly.

Pydantic is one of the most valuable tools for data engineers, analysts, and machine learning practitioners because it simplifies data validation while keeping code clean and maintainable. By defining schemas with Python type hints, teams can automatically validate CSV files, API responses, configuration settings, and machine learning inputs.

As data systems grow more complex, using Pydantic helps ensure data quality, reduces bugs, and creates more reliable pipelines. For anyone building production-grade Python data applications, learning Pydantic is a worthwhile investment.

FAQ

What is Pydantic?

Pydantic is a Python library that validates and parses data using type hints and structured models.

Why is Pydantic useful in data engineering?

It helps validate incoming data, enforce schemas, and improve pipeline reliability.

Does Pydantic replace Pandas?

No. Pydantic focuses on validation, while Pandas focuses on data analysis and manipulation.

Can Pydantic validate CSV files?

Yes. CSV rows can be converted into Pydantic models for schema validation.

Is Pydantic used in production applications?

Yes. It is widely used in APIs, ETL pipelines, machine learning systems, and configuration management.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top