Most people who discover DuckDB have the same reaction. They install it in thirty seconds, run a SQL query against a CSV file without importing it into anything, get the result back in under a second on a dataset that would have taken minutes in another tool, and immediately want to know why nobody told them about this earlier.
DuckDB is an in-process analytical database. That phrase sounds technical but the practical meaning is simple. It runs inside your Python script, your Jupyter notebook, or your terminal session with no server to set up, no connection string to configure, and no infrastructure to maintain. You install it like a Python package and start querying data immediately. When your script finishes, there is nothing running in the background waiting to be shut down.
This tutorial covers everything a beginner needs to go from zero to productive with DuckDB. By the end you will know how to install it, query CSV and Parquet files directly, use it alongside pandas, build real analytical queries, and understand where DuckDB fits in a practical data workflow.
What DuckDB Actually Is and Why It Exists
Before getting into the how, it is worth spending a few minutes on the why, because understanding what problem DuckDB was designed to solve makes it much easier to use it correctly.
Databases generally fall into two categories based on what they are optimized for. Transactional databases, also called OLTP databases, are optimized for reading and writing individual rows quickly. PostgreSQL, MySQL, and SQLite are transactional databases. They are built for applications that need to insert a new order record, update a customer’s address, or look up a specific row by ID thousands of times per second.
Analytical databases, also called OLAP databases, are optimized for reading many rows at once and computing aggregations across them. When you want to know total revenue by product category for the past twelve months, you are scanning potentially millions of rows, grouping them, and summing a column. Transactional databases do this slowly because their storage format is optimized for row-level operations. Analytical databases store data in columns rather than rows, which means they can read only the columns a query needs instead of reading entire rows, and they can apply compression and vectorized processing that makes aggregations dramatically faster.
DuckDB is an analytical database that runs in process rather than as a separate server. The in-process part is what distinguishes it from cloud analytical databases like BigQuery and Snowflake, which run as remote services you connect to over a network. DuckDB runs right where your code runs. It reads files from your local filesystem, processes them using columnar execution, and returns results directly to your Python session or your terminal without any data leaving your machine or any network round trip happening.
The result is a tool that brings warehouse-speed analytical SQL to a laptop, a notebook, or a script without any of the infrastructure overhead those tools normally require.
Installing DuckDB
DuckDB installation is one of the simplest in the data tooling ecosystem. There are no system dependencies, no configuration files, and no server process to start.
For Python, install it with pip:
bash
pip install duckdb
That is the complete installation. DuckDB has no required dependencies beyond Python itself. The package includes the full DuckDB engine as a compiled binary that runs on Windows, macOS, and Linux.
To verify the installation worked, open a Python session and run:
python
import duckdb
print(duckdb.__version__)
You should see a version number printed. If you see that, DuckDB is installed and ready to use.
DuckDB also has a standalone command-line interface that you can use to run SQL queries interactively in your terminal without any Python code. On macOS with Homebrew:
bash
brew install duckdb
On Linux, download the binary from the DuckDB releases page on GitHub and place it somewhere on your PATH. On Windows, the same download page provides a Windows binary.
To open the DuckDB command-line interface, run:
bash
duckdb
This opens an interactive SQL shell where you can type queries and see results immediately. Type .quit to exit. For this tutorial, all examples will use Python because that is where most analytical work happens, but everything shown in Python translates directly to the command-line interface.
Your First DuckDB Query
The fastest way to understand what DuckDB does differently is to run a query on a file that exists on your filesystem without importing it into anything first.
Create a simple CSV file to work with. Save this as sales.csv in your working directory:
order_id,order_date,customer_id,product,category,quantity,unit_price,region
1001,2024-01-05,C001,Laptop,Electronics,1,899.99,North
1002,2024-01-07,C002,Desk Chair,Furniture,2,249.99,South
1003,2024-01-08,C001,Monitor,Electronics,2,349.99,North
1004,2024-01-10,C003,Keyboard,Electronics,1,79.99,East
1005,2024-01-12,C004,Standing Desk,Furniture,1,599.99,West
1006,2024-01-14,C002,Laptop,Electronics,1,899.99,South
1007,2024-01-15,C005,Mouse,Electronics,3,49.99,North
1008,2024-01-18,C003,Bookshelf,Furniture,1,189.99,East
1009,2024-01-20,C001,Webcam,Electronics,1,129.99,North
1010,2024-01-22,C006,Monitor,Electronics,1,349.99,West
Now run a SQL query against this file directly in Python:
python
import duckdb
result = duckdb.sql("SELECT * FROM 'sales.csv'")
print(result)
DuckDB reads the CSV file, infers the column types automatically, and returns the result. You did not create a table, did not define a schema, and did not import anything. The file path in the FROM clause is all DuckDB needs.
Run a more meaningful query:
python
result = duckdb.sql("""
SELECT
category,
COUNT(*) as order_count,
SUM(quantity * unit_price) as total_revenue,
ROUND(AVG(unit_price), 2) as avg_unit_price
FROM 'sales.csv'
GROUP BY category
ORDER BY total_revenue DESC
""")
print(result)
This returns total revenue and average unit price by category without touching pandas, without creating a database, and without any setup beyond the import statement. For a ten-row CSV this feels like a curiosity. On a file with ten million rows, the same query runs in seconds on a laptop.
Connecting to DuckDB and Creating a Persistent Database
The examples above use DuckDB in its default in-memory mode, where results are computed but nothing is stored permanently. For most analytical work this is exactly what you want. But DuckDB also supports persistent databases that save data to a file and persist between sessions.
To create a persistent database, pass a file path when creating a connection:
python
import duckdb
# Creates a persistent database file called analytics.db
conn = duckdb.connect("analytics.db")
Everything you create in this connection, tables, views, indexes, is saved to analytics.db and will be there the next time you connect. To connect to an existing database, use the same file path.
For in-memory work without persistence, connect without a file path:
python
conn = duckdb.connect() # In-memory, nothing persisted
Or use the module-level duckdb.sql() as shown in the first examples, which also operates in memory.
The connection object has an execute method for running queries and a sql method for running queries and returning a result relation:
python
# Using execute for queries that don't return results
conn.execute("CREATE TABLE products (id INTEGER, name VARCHAR, price DECIMAL(10,2))")
conn.execute("INSERT INTO products VALUES (1, 'Laptop', 899.99)")
# Using sql for queries that return results
result = conn.sql("SELECT * FROM products")
print(result)
# Close the connection when done
conn.close()
For most Python workflows, using a connection object rather than the module-level functions is the better practice because it gives you explicit control over when the connection is open and when it is closed, and it lets you run multiple related queries against the same session state.
Querying CSV Files at Scale
DuckDB’s CSV reading capability goes well beyond the basic example shown earlier. It handles a wide range of CSV formats and gives you control over how the file is parsed through options you can pass in the read_csv function.
For simple cases, the auto-detection works well:
python
# DuckDB automatically detects delimiter, header, and column types
result = duckdb.sql("SELECT * FROM 'sales.csv' LIMIT 10")
For files with non-standard formatting, use read_csv with explicit options:
python
result = duckdb.sql("""
SELECT * FROM read_csv(
'sales.csv',
header=true,
delimiter=',',
columns={
'order_id': 'INTEGER',
'order_date': 'DATE',
'customer_id': 'VARCHAR',
'product': 'VARCHAR',
'category': 'VARCHAR',
'quantity': 'INTEGER',
'unit_price': 'DECIMAL(10,2)',
'region': 'VARCHAR'
}
)
LIMIT 5
""")
print(result)
One of DuckDB’s most useful features for analysts who work with multiple files is the ability to query an entire folder of CSV files as if they were a single table. If you have monthly sales files named sales_2024_01.csv, sales_2024_02.csv, and so on in a folder called data, you can query all of them at once:
python
result = duckdb.sql("SELECT * FROM 'data/sales_*.csv'")
DuckDB reads all matching files and unions them into a single result set. This works for CSV files and for Parquet files, which are covered next. For analysts who receive periodic data dumps as separate files and need to analyze them together, this pattern eliminates the preprocessing step of combining files before analysis can begin.
Querying Parquet Files
Parquet is a columnar file format that stores data far more efficiently than CSV for analytical workloads. A CSV file stores all values as text and reads every column even when a query only needs a subset of them. A Parquet file stores data in columns, compresses each column using type-appropriate compression algorithms, and allows DuckDB to read only the columns a query actually needs. The result is dramatically smaller file sizes and much faster query times on large datasets.
DuckDB queries Parquet files using exactly the same syntax as CSV files:
python
result = duckdb.sql("SELECT * FROM 'data.parquet' LIMIT 10")
To see how much faster Parquet is than CSV on a meaningful dataset, create a Parquet version of your data using pandas and then compare query performance:
python
import duckdb
import pandas as pd
# Create a larger dataset to make the comparison meaningful
df = pd.read_csv('sales.csv')
# Write it as Parquet
df.to_parquet('sales.parquet', index=False)
# Query both and compare
csv_result = duckdb.sql("""
SELECT category, SUM(quantity * unit_price) as revenue
FROM 'sales.csv'
GROUP BY category
""")
parquet_result = duckdb.sql("""
SELECT category, SUM(quantity * unit_price) as revenue
FROM 'sales.parquet'
GROUP BY category
""")
print(csv_result)
print(parquet_result)
On a ten-row file the difference is imperceptible. On a file with millions of rows, the Parquet query is typically two to ten times faster depending on the query and the data characteristics. For any analytical project where you control the file format, storing data as Parquet instead of CSV is one of the highest-leverage performance improvements available.
Using DuckDB With Pandas
DuckDB and pandas work together seamlessly, and this integration is one of the most practically useful aspects of DuckDB for analysts who already have a Python data workflow. You can query a pandas DataFrame using SQL, convert DuckDB query results to a pandas DataFrame, and mix DuckDB queries and pandas operations in the same workflow without any explicit data transfer step.
Query a pandas DataFrame directly with SQL:
python
import duckdb
import pandas as pd
# Create a pandas DataFrame
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005],
'customer_id': ['C001', 'C002', 'C001', 'C003', 'C004'],
'product': ['Laptop', 'Chair', 'Monitor', 'Keyboard', 'Desk'],
'category': ['Electronics', 'Furniture', 'Electronics', 'Electronics', 'Furniture'],
'revenue': [899.99, 499.98, 699.98, 79.99, 599.99]
})
# Query the DataFrame using SQL - reference it by variable name
result = duckdb.sql("SELECT category, SUM(revenue) as total FROM df GROUP BY category")
print(result)
DuckDB can see pandas DataFrames in the local Python scope by their variable name. You reference the DataFrame in the FROM clause as if it were a table name and DuckDB handles the integration automatically. No explicit registration step is required.
Convert a DuckDB result to a pandas DataFrame:
python
# Convert result to pandas DataFrame for further pandas operations
result_df = duckdb.sql("""
SELECT
category,
COUNT(*) as order_count,
SUM(revenue) as total_revenue
FROM df
GROUP BY category
ORDER BY total_revenue DESC
""").df()
print(type(result_df)) # <class 'pandas.core.frame.DataFrame'>
print(result_df)
The .df() method on any DuckDB result converts it to a pandas DataFrame. Similarly, .pl() converts to a Polars DataFrame if you have Polars installed, and .arrow() converts to a PyArrow table.
This bidirectional integration means you can use DuckDB for the operations where SQL is more expressive or faster, like complex aggregations and joins across large datasets, and use pandas for the operations where it is more convenient, like applying custom Python functions to individual rows or using pandas-specific transformations that have no SQL equivalent.
A practical workflow pattern that combines both:
python
import duckdb
import pandas as pd
# Load raw data into pandas for initial inspection
raw_df = pd.read_csv('sales.csv')
# Use DuckDB SQL for complex aggregation that would be verbose in pandas
summary = duckdb.sql("""
SELECT
region,
category,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(quantity * unit_price) as total_revenue,
ROUND(SUM(quantity * unit_price) / COUNT(DISTINCT customer_id), 2) as revenue_per_customer
FROM raw_df
GROUP BY region, category
ORDER BY region, total_revenue DESC
""").df()
# Use pandas for further processing of the summarized result
summary['revenue_share'] = summary['total_revenue'] / summary['total_revenue'].sum()
print(summary)
Creating Tables and Loading Data
For projects where you want to persist data across sessions or run many queries against the same dataset without re-reading files each time, creating tables in a DuckDB database is the right approach.
Create a table from a CSV file:
python
import duckdb
conn = duckdb.connect("analytics.db")
# Create a table by reading from CSV
conn.execute("""
CREATE TABLE IF NOT EXISTS sales AS
SELECT * FROM read_csv('sales.csv', header=true)
""")
# Verify the table was created
result = conn.sql("SELECT COUNT(*) as row_count FROM sales")
print(result)
Create a table from a pandas DataFrame:
python
import duckdb
import pandas as pd
conn = duckdb.connect("analytics.db")
df = pd.read_csv('sales.csv')
# Register the DataFrame and create a table from it
conn.execute("CREATE TABLE IF NOT EXISTS sales AS SELECT * FROM df")
Once the table exists in the database, queries against it are fast because DuckDB uses its own internal columnar storage format rather than re-reading the source file each time.
Insert additional data into an existing table:
python
# Insert a single row
conn.execute("""
INSERT INTO sales VALUES (
1011, '2024-01-25', 'C007', 'Headphones', 'Electronics', 1, 199.99, 'South'
)
""")
# Insert from another CSV file
conn.execute("INSERT INTO sales SELECT * FROM read_csv('sales_new.csv', header=true)")
conn.close()
Writing Real Analytical Queries in DuckDB
The SQL that runs in DuckDB is standard SQL with extensions for analytical functions. If you know SQL from any other database, your existing knowledge transfers directly. The analytical functions that DuckDB adds on top of standard SQL are where its power for data analysis becomes fully visible.
Window functions for running totals and rankings:
python
result = duckdb.sql("""
SELECT
order_date,
region,
quantity * unit_price as daily_revenue,
SUM(quantity * unit_price) OVER (
PARTITION BY region
ORDER BY order_date
) as cumulative_revenue_by_region,
RANK() OVER (
PARTITION BY region
ORDER BY quantity * unit_price DESC
) as revenue_rank_in_region
FROM 'sales.csv'
ORDER BY region, order_date
""")
print(result)
Common Table Expressions for readable multi-step queries:
python
result = duckdb.sql("""
WITH customer_totals AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(quantity * unit_price) as total_spent
FROM 'sales.csv'
GROUP BY customer_id
),
customer_segments AS (
SELECT
customer_id,
order_count,
total_spent,
CASE
WHEN total_spent >= 1000 THEN 'High Value'
WHEN total_spent >= 500 THEN 'Mid Value'
ELSE 'Low Value'
END as segment
FROM customer_totals
)
SELECT
segment,
COUNT(*) as customer_count,
ROUND(AVG(total_spent), 2) as avg_spend,
ROUND(AVG(order_count), 1) as avg_orders
FROM customer_segments
GROUP BY segment
ORDER BY avg_spend DESC
""")
print(result)
Date functions for time-based analysis:
python
result = duckdb.sql("""
SELECT
DATE_TRUNC('week', CAST(order_date AS DATE)) as week_start,
COUNT(*) as orders,
SUM(quantity * unit_price) as weekly_revenue,
ROUND(AVG(quantity * unit_price), 2) as avg_order_value
FROM 'sales.csv'
GROUP BY week_start
ORDER BY week_start
""")
print(result)
PIVOT for reshaping data from rows to columns:
python
result = duckdb.sql("""
PIVOT (
SELECT region, category, ROUND(SUM(quantity * unit_price), 2) as revenue
FROM 'sales.csv'
GROUP BY region, category
)
ON category
USING SUM(revenue)
GROUP BY region
ORDER BY region
""")
print(result)
The PIVOT syntax in DuckDB is one of the cleanest implementations of row-to-column reshaping in any SQL database and is significantly more readable than the CASE WHEN aggregation approach that standard SQL requires for the same operation.
DuckDB Extensions
DuckDB has an extension system that adds functionality beyond the core engine. Extensions are installed with a single SQL command and loaded into a session before use.
The most useful extensions for analytical work are the httpfs extension for reading files directly from S3 and other cloud storage, the spatial extension for geographic data analysis, the json extension for parsing JSON data, and the postgres extension for connecting directly to a PostgreSQL database.
Install and load an extension:
python
import duckdb
conn = duckdb.connect()
# Install the httpfs extension for reading from S3
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")
# Now you can read Parquet files directly from S3
result = conn.sql("""
SELECT COUNT(*)
FROM 's3://your-bucket/your-file.parquet'
""")
The httpfs extension is particularly useful because it means DuckDB can query data stored in cloud object storage without downloading it first. DuckDB uses Parquet’s metadata to read only the row groups and columns that a query needs, which means queries against cloud-stored Parquet files are often faster than you would expect given that the data is remote.
The postgres extension lets DuckDB connect directly to a running PostgreSQL instance and query its tables:
python
conn.execute("INSTALL postgres")
conn.execute("LOAD postgres")
conn.execute("""
ATTACH 'dbname=mydb user=myuser host=localhost' AS pg (TYPE POSTGRES)
""")
# Query a PostgreSQL table directly from DuckDB
result = conn.sql("SELECT * FROM pg.public.customers LIMIT 10")
This is useful for running complex analytical queries against PostgreSQL data using DuckDB’s faster analytical execution engine without having to export and reimport the data.
DuckDB vs SQLite: Understanding the Difference
SQLite and DuckDB are both embedded databases that require no server, and beginners often wonder which one to use. The answer comes down to what kind of queries you are running.
SQLite is a row-oriented database optimized for transactional workloads. It is the right choice when you need to store application data, look up individual records by ID, insert and update rows frequently, and run queries that return a small number of rows. SQLite is embedded in billions of devices and applications for exactly this purpose. It is extraordinarily reliable and well-suited to the transactional use case it was designed for.
DuckDB is a column-oriented database optimized for analytical workloads. It is the right choice when you need to run aggregations across many rows, group and filter large datasets, join multiple tables on analytical queries, and compute summary statistics. For these workloads DuckDB is typically ten to one hundred times faster than SQLite because its columnar storage and vectorized execution are designed for reading many rows at once rather than finding individual rows quickly.
A useful mental model is that SQLite is the right database for an application that stores its own data and needs to retrieve individual records. DuckDB is the right database for an analyst or data engineer who needs to run SQL against data files and get analytical results fast. They solve different problems and for most projects you will not need to choose between them because the use cases rarely overlap.
Building a Complete Analytical Workflow With DuckDB
Putting everything together, here is a complete example of a realistic analytical workflow that uses DuckDB from data loading through final output:
python
import duckdb
import pandas as pd
# Connect to a persistent database
conn = duckdb.connect("sales_analytics.db")
# Load data from CSV into a persistent table
conn.execute("""
CREATE TABLE IF NOT EXISTS sales AS
SELECT * FROM read_csv('sales.csv',
header=true,
columns={
'order_id': 'INTEGER',
'order_date': 'DATE',
'customer_id': 'VARCHAR',
'product': 'VARCHAR',
'category': 'VARCHAR',
'quantity': 'INTEGER',
'unit_price': 'DECIMAL(10,2)',
'region': 'VARCHAR'
}
)
""")
# Build a summary table for reporting
conn.execute("""
CREATE OR REPLACE TABLE regional_summary AS
WITH base AS (
SELECT
region,
category,
DATE_TRUNC('month', order_date) as month,
SUM(quantity * unit_price) as revenue,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(*) as order_count
FROM sales
GROUP BY region, category, month
)
SELECT
*,
ROUND(revenue / unique_customers, 2) as revenue_per_customer,
ROUND(revenue / order_count, 2) as avg_order_value
FROM base
""")
# Pull the summary into pandas for export or visualization
summary_df = conn.sql("""
SELECT *
FROM regional_summary
ORDER BY month, region, revenue DESC
""").df()
print(summary_df)
print(f"\nTotal rows: {len(summary_df)}")
print(f"\nRevenue by region:")
print(summary_df.groupby('region')['revenue'].sum().sort_values(ascending=False))
# Export to CSV and Parquet
summary_df.to_csv('regional_summary.csv', index=False)
summary_df.to_parquet('regional_summary.parquet', index=False)
conn.close()
print("\nWorkflow complete. Results saved.")
This workflow covers loading raw data, transforming it into a summary table using analytical SQL, pulling results into pandas for final processing, and exporting in multiple formats. It is a pattern that scales from a ten-row example file to a dataset with millions of rows with no changes to the code structure.
When to Use DuckDB and When Not To
DuckDB is the right tool when you need fast analytical SQL on local or cloud-stored files, when you want to query data without loading it into a database first, when you are working in Python and want SQL-based data manipulation alongside pandas, when you need to aggregate large datasets quickly on a single machine, or when you are building a data pipeline that runs locally and needs analytical query capability without infrastructure overhead.
DuckDB is not the right tool when you need multiple users or applications writing to the same database simultaneously because DuckDB is designed for single-writer scenarios. It is not the right tool when you need the transactional guarantees of a production database with ACID compliance across concurrent connections. It is not the right tool when your dataset is genuinely too large for a single machine to process, at which point a distributed system like Spark or a cloud data warehouse is the appropriate infrastructure.
For the analytical work that most data analysts and data engineers do day to day, those limitations almost never apply. DuckDB fits the vast majority of analytical workflows that involve a single analyst or pipeline querying data to produce results, which covers most of what the tool is used for in practice.
FAQs
What is DuckDB and what is it used for?
DuckDB is an in-process analytical database that runs inside your Python script, Jupyter notebook, or terminal without requiring a separate server or any infrastructure setup. It is used for running fast SQL queries against CSV files, Parquet files, pandas DataFrames, and other data sources for analytical work like aggregations, joins, filtering, and time-based analysis. It is particularly useful for data analysts and data engineers who want SQL-based analytical capability on local data without the overhead of setting up a database server.
Is DuckDB faster than pandas for data analysis?
For aggregation-heavy operations on large datasets, DuckDB is typically significantly faster than pandas because it uses columnar storage and vectorized query execution designed specifically for analytical workloads. For row-level operations, custom Python transformations, and workflows that use pandas-specific functionality, pandas may be more convenient. The most productive approach for many analysts is using both together, DuckDB for the SQL-based aggregation and transformation steps and pandas for the Python-specific processing steps, since they integrate seamlessly.
Can DuckDB replace a data warehouse for small teams?
For small teams working with datasets that fit on a single machine, DuckDB can handle many of the analytical workloads that teams might otherwise use a cloud data warehouse for. It queries Parquet files stored locally or in S3 using the httpfs extension, supports persistent databases that store processed data, and runs complex analytical SQL at speeds that match cloud data warehouses for single-machine dataset sizes. For teams that need multiple concurrent users querying a shared dataset or datasets that are genuinely too large for a single machine, a cloud data warehouse remains the appropriate solution.
Does DuckDB work with Google Colab and Jupyter notebooks?
Yes. DuckDB installs in Google Colab and Jupyter notebooks with a standard pip install duckdb command and works exactly the same as in any other Python environment. It is particularly convenient in notebooks because the integration with pandas means you can alternate between SQL-based analysis with DuckDB and DataFrame-based analysis with pandas in the same notebook without any friction. The in-memory operation mode means there is nothing to clean up when you close the notebook session.
What file formats does DuckDB support?
DuckDB natively reads and writes CSV, Parquet, JSON, and Arrow files. It also reads Excel files through a community extension. For remote data, the httpfs extension enables reading files directly from S3, Google Cloud Storage, and Azure Blob Storage without downloading them first. The Parquet format is the best choice for analytical workloads because DuckDB can use Parquet’s columnar structure and metadata to read only the data a query needs, which produces the fastest query times and smallest memory footprint.