How to Build a Data Pipeline in Python with Airflow

In the modern data ecosystem, data pipelines are the backbone of analytics, machine learning, and automation. They allow organizations to collect, transform, and move data seamlessly across systems, from APIs and databases to dashboards and AI models.

One of the most powerful tools for building and managing these pipelines is Apache Airflow, an open-source platform for scheduling, monitoring, and automating workflows.

In this guide, you’ll learn how to build a simple yet powerful data pipeline in Python using Airflow. From setup to deployment, complete with practical code examples.

What Is Apache Airflow?

Apache Airflow is a workflow orchestration tool developed by Airbnb and now maintained by the Apache Software Foundation. It allows you to define complex workflows as Directed Acyclic Graphs (DAGs) where each task represents a step in your data pipeline.

Why Use Airflow?

Automates repetitive data tasks
Handles dependencies and scheduling
Provides monitoring through a web UI
Integrates with AWS, GCP, and databases
Scales easily for big data workflows

Prerequisites

Before you begin, make sure you have:

Python 3.8+ installed
pip or conda package manager
Basic knowledge of pandas and SQL
Airflow installed via:

pip install apache-airflow

Initialize the Airflow environment:

airflow db init
airflow users create --username admin --firstname Data --lastname Engineer --role Admin --email admin@example.com
airflow webserver --port 8080
airflow scheduler

Step 1: Create a Simple DAG

A DAG (Directed Acyclic Graph) defines your pipeline workflow.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract_data():
    print("Extracting data from API...")

def transform_data():
    print("Transforming data...")

def load_data():
    print("Loading data into database...")

default_args = {
    'owner': 'codewithfimi',
    'start_date': datetime(2025, 1, 1),
    'retries': 1
}

with DAG(
    dag_id='simple_data_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False
) as dag:
    
    extract = PythonOperator(task_id='extract', python_callable=extract_data)
    transform = PythonOperator(task_id='transform', python_callable=transform_data)
    load = PythonOperator(task_id='load', python_callable=load_data)

    extract >> transform >> load

Step 2: Understanding ETL (Extract, Transform, Load)

Step	Description	Example
Extract	Get data from APIs, CSVs, or databases	Pulling stock data via an API
Transform	Clean or modify data	Remove null values, aggregate data
Load	Store into database or data warehouse	Upload to PostgreSQL or BigQuery

Each step is handled by Airflow tasks connected through dependencies.

Step 3: Adding Real Data

Let’s extract and load real-world data using pandas and requests.

import pandas as pd
import requests

def extract_data():
    url = "https://api.coindesk.com/v1/bpi/currentprice.json"
    data = requests.get(url).json()
    df = pd.DataFrame([{
        "currency": "USD",
        "rate": data["bpi"]["USD"]["rate_float"],
        "time": data["time"]["updatedISO"]
    }])
    df.to_csv('/tmp/bitcoin_rate.csv', index=False)

Then add a load function:

import sqlite3

def load_data():
    conn = sqlite3.connect('bitcoin.db')
    df = pd.read_csv('/tmp/bitcoin_rate.csv')
    df.to_sql('bitcoin_rates', conn, if_exists='append', index=False)
    conn.close()

This creates a daily Bitcoin price pipeline, fetching live data, cleaning it, and storing it in a local database.

Step 4: Monitor and Automate

Once your DAG runs successfully:

View execution logs from the Airflow UI
Set email alerts for failures
Use @hourly, @daily, or custom cron schedules

Step 5: Scale and Deploy

To move from local to production:

Host Airflow on AWS MWAA, Google Composer, or Docker
Connect to cloud storage (S3, GCS) and databases (Postgres, Snowflake)
Use XComs for data sharing between tasks

Real-World Applications

Data Warehousing: Automate daily database refreshes
Machine Learning Pipelines: Retrain models periodically
Business Reporting: Schedule analytics dashboards
API Data Collection: Aggregate external data daily

Building a data pipeline in Python with Airflow empowers you to automate complex workflows, reduce manual effort, and maintain data consistency across systems.
Once you master the basics, you can scale your pipelines to handle big data, AI, and real-time analytics — all powered by Python.

FAQs

1. What is a data pipeline in simple terms?

A data pipeline is a system that moves and transforms data from source to destination automatically.

2. Why use Airflow for data pipelines?

Airflow helps you automate, schedule, and monitor workflows efficiently.Great for both small and enterprise projects.

3. Can I use Airflow with cloud platforms?

Yes, Airflow integrates with AWS, GCP, Azure, and many SaaS APIs.

4. Is Airflow good for beginners?

Absolutely. It’s intuitive, Python-based, and widely used in the data industry.

5. What’s next after learning Airflow?

Try combining it with Docker, Spark, and cloud services to build production-grade data pipelines.