How to Build a Scheduled ETL Job Using Cron and Python

Credit Card Fraud Detection Dataset

In today’s data-driven world, automating your ETL (Extract, Transform, Load) workflows is a must. Whether you’re moving data from APIs to databases or cleaning up CSV files, manually running scripts every few hours just doesn’t scale. That’s where Cron and Python come in.

In this guide, you’ll learn how to set up an automated ETL job using Cron and Python. From writing the ETL script to scheduling it to run automatically at your chosen time intervals. By the end, you’ll have a hands-free data pipeline running right on your system.

What Is ETL (Extract, Transform, Load)?

ETL stands for Extract, Transform, Load, and it’s a common process in data engineering and analytics:

  • Extract: Pull data from a source (API, database, CSV, etc.)
  • Transform: Clean, filter, or reshape the data for analysis
  • Load: Store it in a database, warehouse, or file format

When automated, this pipeline keeps your data fresh. This is perfect for dashboards, reports, or machine learning models.

Step 1: Write the Python ETL Script

Let’s create a simple Python script that extracts data from an API, transforms it, and loads it into a local CSV file.

import requests
import pandas as pd
from datetime import datetime

def extract_data():
    url = "https://jsonplaceholder.typicode.com/posts"
    response = requests.get(url)
    return response.json()

def transform_data(data):
    df = pd.DataFrame(data)
    df = df[['userId', 'id', 'title']]
    df['timestamp'] = datetime.now()
    return df

def load_data(df):
    df.to_csv('/home/yourname/data/etl_output.csv', index=False)
    print("Data loaded successfully!")

if __name__ == "__main__":
    data = extract_data()
    df = transform_data(data)
    load_data(df)

Save this script as etl_job.py.

Step 2: Schedule the Job Using Cron

Cron is a time-based job scheduler in Unix-like systems. It lets you run scripts automatically at fixed times, dates, or intervals.

To open your Cron scheduler:

crontab -e

Then, add a line like this:

0 8 * * * /usr/bin/python3 /home/yourname/etl_job.py >> /home/yourname/logs/etl_log.txt 2>&1

This command runs your Python ETL script every day at 8 AM and logs output to a text file.

Step 3: Verify and Monitor Your Job

Check your logs regularly to ensure the script runs smoothly. You can also test it manually:

python3 /home/yourname/etl_job.py

If the file updates and your log shows no errors, congratulations! You’ve successfully built a scheduled ETL job.

Step 4: Add Improvements

Once it’s running, you can enhance your ETL pipeline by:

  • Adding error handling and retries
  • Sending email or Slack notifications on success/failure
  • Storing data in a database or cloud storage
  • Using Airflow or Prefect for more complex scheduling

Building a scheduled ETL pipeline using Cron and Python is one of the simplest ways to automate your data workflows. With just a few lines of code and a single Cron command, you can free up hours of manual work every week — keeping your data fresh, consistent, and reliable.

FAQ

1. How do I check if my Cron job ran successfully?

You can check if your Cron job ran by reviewing the log file specified in your Cron command (for example, /home/yourname/logs/etl_log.txt).
You can also run:
grep CRON /var/log/syslog

This shows recent Cron executions on your system.

2. Can I schedule Python ETL jobs on Windows?

Yes. While Windows doesn’t have Cron, you can use Task Scheduler to achieve the same result.
Simply create a new task, point it to your Python script, and set your desired schedule whether daily, hourly, or even at system startup.

3. What’s the difference between Airflow and Cron for scheduling ETL jobs?

Cron is lightweight and great for simple recurring scripts, while Apache Airflow is designed for complex workflows, task dependencies, monitoring, and retry mechanisms.
If you’re running just one or two scripts, Cron is perfect. For large-scale data pipelines, go with Airflow or Prefect.

4. How do I prevent my Cron ETL jobs from failing silently?

Always redirect output to a log file, add error handling in your script, and consider sending email or Slack alerts when a job fails.
Example:
try:
# your ETL logic
except Exception as e:
send_alert(f"ETL Job Failed: {e}")

5. Where should I store data from my ETL pipeline?

It depends on your use case:
CSV files or SQLite for small projects
PostgreSQL or MySQL for medium-scale data
Cloud data warehouses (like BigQuery, Snowflake, or Redshift) for production pipelines
For most beginners, writing to a CSV or local database is a great start.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top