If you have ever written a SQL query that worked perfectly on a small dataset but suddenly crawled to a halt when run against a table with millions of rows, you have experienced the pain of unoptimized SQL firsthand.
Slow queries are one of the most common and costly problems in data engineering and analytics. They slow down dashboards, block applications, frustrate analysts, and consume expensive compute resources. A query that takes 45 minutes to run is not just annoying — it is a business problem.
The good news is that most slow SQL queries can be dramatically improved using a handful of well-understood techniques. You do not need to be a database administrator to apply them, you just need to understand how SQL engines execute queries and what makes them work harder than they need to.
In this guide, we will walk through every major SQL query optimization technique for large tables with clear explanations, before-and-after examples, and practical guidance on when to apply each one.
Why SQL Queries Get Slow on Large Tables
Before optimizing, it helps to understand why queries get slow in the first place.
When you run a SQL query against a table with millions of rows, the database engine has to do work and sometimes a lot of it. The most common causes of slow queries are:
Full table scans — The engine reads every single row in the table looking for matches instead of jumping directly to the relevant rows
Missing or unused indexes — Without indexes, the engine has no map to navigate to specific rows efficiently
Inefficient joins — Joining large tables without proper indexes forces the engine to compare every combination of rows
Retrieving too many columns — Selecting more data than you need wastes I/O, memory, and network bandwidth
Non-sargable WHERE clauses — Filtering conditions written in ways the engine cannot use indexes to evaluate
Suboptimal query structure — Correlated subqueries, unnecessary DISTINCT, improper use of HAVING vs WHERE
Poor data types — Joining or filtering on columns with mismatched or inefficient data types
Understanding these root causes points directly toward the solutions.
Setting Up Example Tables
We will use realistic large-table examples throughout this guide.
sql
-- Orders fact table — 50 million rows
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
customer_id INT NOT NULL,
product_id INT NOT NULL,
order_date DATE NOT NULL,
status VARCHAR(20) NOT NULL,
amount DECIMAL(12, 2) NOT NULL,
region VARCHAR(50) NOT NULL,
sales_rep_id INT,
discount DECIMAL(5, 2) DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Customers dimension table — 2 million rows
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(150) UNIQUE,
country VARCHAR(50),
segment VARCHAR(30),
created_date DATE
);
-- Products dimension table — 500,000 rows
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(200),
category VARCHAR(50),
subcategory VARCHAR(50),
unit_price DECIMAL(10, 2),
supplier_id INT
);
Technique 1: Use EXPLAIN / EXPLAIN ANALYZE — Understand Before Optimizing
The single most important skill in query optimization is reading the execution plan which is the roadmap the database engine creates showing exactly how it plans to execute your query.
Never optimize blindly. Always start by understanding what the engine is doing.
How to Get the Execution Plan
sql
-- MySQL / PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;
-- PostgreSQL — with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 12345;
-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 12345;
-- Or use graphical execution plan in SQL Server Management Studio
-- Query menu → Include Actual Execution Plan (Ctrl+M)
Reading the Execution Plan — Key Things to Look For
Seq Scan / Table Scan — The engine is reading the entire table. On large tables, this is usually a red flag.
Index Scan / Index Seek — The engine is using an index to find rows efficiently. This is what you want.
Nested Loop / Hash Join / Merge Join — How the engine is joining tables. Each has different performance characteristics.
Rows — The estimated number of rows at each step. Large row estimates in early steps mean expensive operations.
Cost — The relative cost the optimizer assigns to each step. Higher cost = more work.
sql
-- Example execution plan output (PostgreSQL)
EXPLAIN ANALYZE
SELECT order_id, amount
FROM orders
WHERE customer_id = 12345
AND order_date >= '2024-01-01';
-- Output (before optimization):
-- Seq Scan on orders (cost=0.00..2847293.00 rows=127 width=20)
-- (actual time=0.042..45823.112 rows=127 loops=1)
-- Filter: ((customer_id = 12345) AND (order_date >= '2024-01-01'))
-- Rows Removed by Filter: 49999873
-- Planning Time: 0.152 ms
-- Execution Time: 45823.334 ms ← 45 seconds!
The execution plan shows a Seq Scan reading all 50 million rows and discarding 49,999,873 of them. Adding an index on customer_id transforms this.
Technique 2: Indexing — The Most Impactful Optimization
Indexes are the single most powerful tool for speeding up SQL queries on large tables. An index is a separate data structure that the engine maintains allowing it to find rows matching a condition without scanning the entire table.
Think of an index like the index at the back of a textbook instead of reading every page to find “photosynthesis”, you look it up in the index and jump directly to page 342.
Single Column Index
sql
-- Create index on the most commonly filtered column
CREATE INDEX idx_orders_customer_id
ON orders (customer_id);
CREATE INDEX idx_orders_order_date
ON orders (order_date);
CREATE INDEX idx_orders_status
ON orders (status);
After adding the customer_id index:
sql
EXPLAIN ANALYZE
SELECT order_id, amount
FROM orders
WHERE customer_id = 12345;
-- Output (after indexing):
-- Index Scan using idx_orders_customer_id on orders
-- (cost=0.56..892.34 rows=127 width=20)
-- (actual time=0.034..1.247 rows=127 loops=1)
-- Execution Time: 1.489 ms ← From 45 seconds to 1.5ms!
Composite Index — Multiple Columns
When your WHERE clause filters on multiple columns, a composite index covering all those columns is far more efficient than separate single-column indexes.
sql
-- Frequently used together: filter by customer, then date
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);
-- Query that benefits from composite index
SELECT order_id, amount, status
FROM orders
WHERE customer_id = 12345
AND order_date BETWEEN '2024-01-01' AND '2024-12-31';
The Column Order in Composite Indexes Matters
In a composite index, the order of columns is critical. The index is most effective when queries filter on the leading columns first.
sql
-- Index: (customer_id, order_date, status)
-- Uses the index efficiently — filters on leading columns
SELECT * FROM orders
WHERE customer_id = 12345
AND order_date >= '2024-01-01';
-- Uses the index efficiently — filters on first column only
SELECT * FROM orders
WHERE customer_id = 12345;
-- Does NOT use the index efficiently — skips the leading column
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND status = 'Completed';
Rule: Build composite indexes with the most selective column first, and ensure queries that filter on the middle or end columns also filter on the leading columns.
Covering Index — Include All Needed Columns
A covering index includes all the columns needed by a query — so the engine never has to go back to the main table to fetch data. This is one of the most powerful optimizations available.
sql
-- Query needs: customer_id (filter), order_date (filter), amount and status (select)
-- Covering index includes all four columns
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (amount, status);
-- SQL Server / PostgreSQL syntax
-- MySQL: just include all columns in the index definition
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date, amount, status);
With a covering index, the engine satisfies the entire query from the index alone — no I/O to the main table at all.
Partial Index — Index Only What You Query
If you frequently query only a subset of a table like active orders or recent records — a partial index on that subset is smaller, faster, and uses less storage.
sql
-- Most queries filter on active orders only
-- Partial index covering only active orders
CREATE INDEX idx_orders_active
ON orders (customer_id, order_date)
WHERE status = 'Active';
-- PostgreSQL syntax
-- Much smaller than a full index — only indexes ~15% of rows if 15% are active
Index Maintenance — Do Not Over-Index
Indexes have a cost — they slow down INSERT, UPDATE, and DELETE operations because the engine must update every index whenever data changes. More indexes = faster reads, slower writes.
Guidelines:
- Index columns used in WHERE, JOIN, and ORDER BY clauses
- Index foreign key columns in child tables
- Remove unused indexes — they slow writes with no benefit
- Monitor index usage with database-specific system views
sql
-- PostgreSQL: find unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%pkey%'
ORDER BY tablename;
-- MySQL: find unused indexes
SELECT * FROM sys.schema_unused_indexes;
Technique 3: SELECT Only What You Need
One of the most common and easily fixed performance mistakes is using SELECT * when you only need a few columns.
The Problem With SELECT *
sql
-- Bad — retrieves ALL columns from a 50 million row table
-- Transfers potentially hundreds of bytes per row across the network
SELECT *
FROM orders
WHERE customer_id = 12345;
-- Good — retrieves only the 3 columns you actually need
SELECT order_id, order_date, amount
FROM orders
WHERE customer_id = 12345;
Why SELECT * is expensive:
- Forces the engine to retrieve every column from disk
- Increases memory usage for storing and transmitting results
- Prevents covering indexes from being used
- Increases network I/O between database and application
- Forces unnecessary data type conversions in the application layer
Always specify the exact columns you need. This applies especially when joining multiple large tables — the column count multiplies across every joined table.
Technique 4: Write Sargable WHERE Clauses
Sargable (Search ARGument ABLE) means the WHERE clause is written in a way that allows the database engine to use an index to evaluate it.
Non-sargable conditions force full table scans even when indexes exist on the relevant columns.
Functions on Indexed Columns — Non-Sargable
sql
-- Non-sargable — wraps column in function, index cannot be used
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;
-- Sargable — range condition, index CAN be used
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
-- Non-sargable — function on column
SELECT * FROM customers
WHERE UPPER(email) = 'ALICE@EXAMPLE.COM';
-- Sargable — use consistent case or case-insensitive collation
SELECT * FROM customers
WHERE email = 'alice@example.com';
-- Non-sargable — arithmetic on column
SELECT * FROM orders
WHERE amount * 1.1 > 1000;
-- Sargable — move arithmetic to the other side
SELECT * FROM orders
WHERE amount > 1000 / 1.1;
LIKE With Leading Wildcard — Non-Sargable
sql
-- Non-sargable — leading wildcard prevents index use
SELECT * FROM customers
WHERE name LIKE '%Smith';
-- Sargable — trailing wildcard only, index CAN be used
SELECT * FROM customers
WHERE name LIKE 'Smith%';
-- For full-text search needs, use full-text indexes instead
CREATE FULLTEXT INDEX idx_customers_name ON customers(name);
SELECT * FROM customers
WHERE MATCH(name) AGAINST ('Smith' IN BOOLEAN MODE);
NULL Handling — Sargability Considerations
sql
-- Non-sargable on most databases
SELECT * FROM orders
WHERE ISNULL(sales_rep_id, 0) = 0;
-- Sargable
SELECT * FROM orders
WHERE sales_rep_id IS NULL;
-- Or use COALESCE on the parameter side, not the column
SELECT * FROM orders
WHERE sales_rep_id = COALESCE(@rep_id, sales_rep_id);
Technique 5: Optimize JOIN Operations
Joins on large tables are one of the most expensive SQL operations. Optimizing them has a dramatic impact on query performance.
Always Join on Indexed Columns
sql
-- Both customer_id columns must be indexed for efficient join
SELECT o.order_id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';
-- orders.customer_id should have an index
-- customers.customer_id is the primary key (already indexed)
Filter Before Joining — Use Subqueries or CTEs
Reducing the size of a table before joining it dramatically reduces the work the join must do.
sql
-- Slow — joins all 50 million orders to customers, then filters
SELECT o.order_id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'Completed'
AND c.country = 'USA';
-- Faster — filter each table first, then join the reduced result sets
WITH recent_orders AS (
SELECT order_id, customer_id, amount
FROM orders
WHERE order_date >= '2024-01-01'
AND status = 'Completed'
),
us_customers AS (
SELECT customer_id, name
FROM customers
WHERE country = 'USA'
)
SELECT o.order_id, o.amount, c.name
FROM recent_orders o
JOIN us_customers c ON o.customer_id = c.customer_id;
Choose the Right Join Type
sql
-- INNER JOIN — only rows matching in both tables
-- Use when you only need matched records
SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- LEFT JOIN — all rows from left table, matched from right
-- Use when you need all orders regardless of customer data availability
SELECT o.order_id, c.name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;
-- Avoid CROSS JOIN on large tables — produces m x n rows
-- 50M orders × 2M customers = 100 trillion rows
SELECT * FROM orders CROSS JOIN customers; -- Never do this!
Join Order in Queries
Most modern optimizers determine join order automatically but in complex queries with many joins, explicit hints or CTEs can help the optimizer make better choices.
sql
-- Join smaller tables first to reduce intermediate result sets
SELECT o.order_id, o.amount, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE c.segment = 'Enterprise' -- Filter on dimension table
AND p.category = 'Electronics'; -- Filter on dimension table
Technique 6: Use WHERE Instead of HAVING for Row Filtering
HAVING filters are applied after GROUP BY aggregation meaning the engine groups millions of rows and then discards groups you do not want.
WHERE filters are applied before GROUP BY meaning the engine filters to a smaller dataset before doing the expensive aggregation work.
sql
-- Slow — GROUP BY aggregates all 50 million rows, HAVING discards most groups
SELECT customer_id, SUM(amount) AS total_sales
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 10000
AND customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA');
-- Fast — WHERE reduces dataset before grouping
SELECT customer_id, SUM(amount) AS total_sales
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA')
GROUP BY customer_id
HAVING SUM(amount) > 10000;
Rule: Use WHERE for filtering rows. Use HAVING only for filtering on aggregated values (like SUM, COUNT, AVG) that cannot be evaluated before GROUP BY.
Technique 7: Replace Correlated Subqueries With Joins or CTEs
Correlated subqueries execute once for every row in the outer query — turning an O(n) problem into an O(n²) problem on large tables.
sql
-- Very slow — correlated subquery runs once for each of 50 million rows
SELECT order_id, amount,
(SELECT name FROM customers WHERE customer_id = o.customer_id) AS customer_name
FROM orders o
WHERE order_date >= '2024-01-01';
-- Fast — JOIN executes once, same result
SELECT o.order_id, o.amount, c.name AS customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';
-- Another correlated subquery example — finding orders above average
-- Slow — recalculates average for every row
SELECT order_id, amount
FROM orders o
WHERE amount > (
SELECT AVG(amount)
FROM orders
WHERE customer_id = o.customer_id
);
-- Fast — calculate averages once with CTE, then join
WITH customer_avg AS (
SELECT customer_id, AVG(amount) AS avg_amount
FROM orders
GROUP BY customer_id
)
SELECT o.order_id, o.amount
FROM orders o
JOIN customer_avg ca ON o.customer_id = ca.customer_id
WHERE o.amount > ca.avg_amount;
Technique 8: Use EXISTS Instead of IN for Large Subqueries
When checking membership in a large subquery result, EXISTS is often faster than IN because EXISTS stops as soon as it finds the first match while IN must evaluate the entire subquery first.
sql
-- Potentially slow — IN evaluates entire subquery result
SELECT order_id, amount
FROM orders
WHERE customer_id IN (
SELECT customer_id
FROM customers
WHERE country = 'USA'
AND segment = 'Enterprise'
);
-- Often faster — EXISTS stops at first match
SELECT o.order_id, o.amount
FROM orders o
WHERE EXISTS (
SELECT 1
FROM customers c
WHERE c.customer_id = o.customer_id
AND c.country = 'USA'
AND c.segment = 'Enterprise'
);
-- Often best — explicit JOIN with proper indexes
SELECT o.order_id, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'USA'
AND c.segment = 'Enterprise';
Note: Modern query optimizers often transform IN to EXISTS or JOIN automatically but writing it explicitly ensures the behavior you intend.
Technique 9: Avoid SELECT DISTINCT When Possible
DISTINCT forces the engine to sort and de-duplicate the entire result set which is expensive on large tables.
sql
-- Slow — DISTINCT forces sort of all results
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01';
-- Often faster — GROUP BY achieves the same result
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;
-- Better yet — query the customers table directly if that is what you need
SELECT customer_id
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
);
If you find yourself using DISTINCT frequently, it often signals a deeper problem either a missing JOIN condition causing duplicate rows, or a query that should be redesigned.
Technique 10: Use CTEs and Temp Tables to Break Up Complex Queries
Very complex queries with many joins and subqueries are hard for the optimizer to handle efficiently. Breaking them into steps using CTEs or temporary tables can dramatically improve performance and readability.
CTEs for Readable Multi-Step Queries
sql
-- Complex query broken into clear, optimizable steps
WITH
-- Step 1: Get recent high-value orders
high_value_orders AS (
SELECT order_id, customer_id, product_id, amount, order_date
FROM orders
WHERE order_date >= '2024-01-01'
AND amount >= 500
AND status = 'Completed'
),
-- Step 2: Get enterprise customers in key markets
target_customers AS (
SELECT customer_id, name, country
FROM customers
WHERE segment = 'Enterprise'
AND country IN ('USA', 'UK', 'Germany', 'France')
),
-- Step 3: Get electronics products
electronics AS (
SELECT product_id, product_name, category
FROM products
WHERE category = 'Electronics'
),
-- Step 4: Join everything together
final_result AS (
SELECT
hvo.order_id,
hvo.amount,
hvo.order_date,
tc.name AS customer_name,
tc.country,
e.product_name
FROM high_value_orders hvo
JOIN target_customers tc ON hvo.customer_id = tc.customer_id
JOIN electronics e ON hvo.product_id = e.product_id
)
SELECT *
FROM final_result
ORDER BY amount DESC
LIMIT 1000;
Temporary Tables for Intermediate Results
When a CTE is referenced multiple times in a query, it may be recalculated each time. A temporary table calculates once and stores the result.
sql
-- Calculate expensive intermediate result once
CREATE TEMPORARY TABLE temp_customer_stats AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent,
AVG(amount) AS avg_order_value,
MAX(order_date) AS last_order_date
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id;
-- Index the temp table for subsequent joins
CREATE INDEX idx_temp_customer_id
ON temp_customer_stats (customer_id);
-- Now use the pre-calculated stats in multiple queries efficiently
SELECT c.name, c.country, tcs.total_spent, tcs.order_count
FROM customers c
JOIN temp_customer_stats tcs ON c.customer_id = tcs.customer_id
WHERE tcs.total_spent > 50000
ORDER BY tcs.total_spent DESC;
-- Clean up
DROP TEMPORARY TABLE temp_customer_stats;
Technique 11: Use LIMIT and Pagination for Large Result Sets
If you do not need all rows at once — add LIMIT to retrieve only what you need.
sql
-- Return first 1000 results only
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 1000;
-- Efficient pagination using keyset pagination (better than OFFSET for large pages)
-- Page 1
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC, order_id DESC
LIMIT 100;
-- Next page — use last values from previous page as cursor
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
AND (order_date, order_id) < ('2024-06-15', 99845)
ORDER BY order_date DESC, order_id DESC
LIMIT 100;
Why keyset pagination is better than OFFSET:
sql
-- Slow OFFSET pagination — scans all skipped rows
SELECT order_id, amount
FROM orders
ORDER BY order_date DESC
LIMIT 100 OFFSET 500000; -- Must scan 500,000 rows to skip them
-- Fast keyset pagination — uses index to start at the right position
SELECT order_id, amount
FROM orders
WHERE order_date < '2023-08-20' -- Last date from previous page
ORDER BY order_date DESC
LIMIT 100;
Technique 12: Partitioning Large Tables
Table partitioning divides a very large table into smaller, more manageable physical segments while presenting a single logical table to queries. Queries that filter on the partition key only scan the relevant partition dramatically reducing I/O.
Range Partitioning by Date (Most Common)
sql
-- MySQL range partitioning by year
CREATE TABLE orders_partitioned (
order_id BIGINT,
customer_id INT,
order_date DATE,
amount DECIMAL(12,2),
status VARCHAR(20)
)
PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p_future VALUES LESS THAN MAXVALUE
);
-- Query that benefits from partition pruning
-- Only scans the p2024 partition instead of all 50 million rows
SELECT order_id, amount
FROM orders_partitioned
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
List Partitioning by Category
sql
-- PostgreSQL list partitioning by region
CREATE TABLE orders_by_region
PARTITION BY LIST (region);
CREATE TABLE orders_north PARTITION OF orders_by_region
FOR VALUES IN ('North', 'Northeast', 'Northwest');
CREATE TABLE orders_south PARTITION OF orders_by_region
FOR VALUES IN ('South', 'Southeast', 'Southwest');
CREATE TABLE orders_east PARTITION OF orders_by_region
FOR VALUES IN ('East', 'Midwest');
CREATE TABLE orders_west PARTITION OF orders_by_region
FOR VALUES IN ('West', 'Pacific');
Technique 13: Query Result Caching and Materialized Views
Some queries are expensive to compute but do not change frequently. Materialized views pre-compute and store the result allowing instant access without recomputing from scratch each time.
Materialized Views
sql
-- PostgreSQL materialized view
CREATE MATERIALIZED VIEW mv_customer_monthly_summary AS
SELECT
customer_id,
DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM orders
WHERE status = 'Completed'
GROUP BY customer_id, DATE_TRUNC('month', order_date);
-- Create index on the materialized view
CREATE INDEX idx_mv_customer_month
ON mv_customer_monthly_summary (customer_id, month);
-- Query the pre-computed result instantly
SELECT customer_id, month, total_amount
FROM mv_customer_monthly_summary
WHERE month >= '2024-01-01'
AND total_amount > 10000;
-- Refresh when source data changes
REFRESH MATERIALIZED VIEW mv_customer_monthly_summary;
-- Refresh without locking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_customer_monthly_summary;
Real-World Optimization Examples
Example 1: Dashboard Summary Query
sql
-- BEFORE: 3 minutes 42 seconds
SELECT
YEAR(o.order_date) AS year,
MONTH(o.order_date) AS month,
c.country,
SUM(o.amount) AS total_sales,
COUNT(DISTINCT o.customer_id) AS unique_customers
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'Completed'
GROUP BY YEAR(o.order_date), MONTH(o.order_date), c.country
ORDER BY year, month, total_sales DESC;
-- AFTER: 4.2 seconds
-- Changes:
-- 1. Used date range instead of YEAR() function (sargable)
-- 2. Filtered customers before joining
-- 3. Added covering index on orders(status, order_date, customer_id, amount)
-- 4. Added index on customers(customer_id, country)
CREATE INDEX idx_orders_status_date ON orders (status, order_date, customer_id, amount);
WITH completed_orders AS (
SELECT customer_id, order_date, amount
FROM orders
WHERE status = 'Completed'
AND order_date >= '2022-01-01'
)
SELECT
EXTRACT(YEAR FROM co.order_date) AS year,
EXTRACT(MONTH FROM co.order_date) AS month,
c.country,
SUM(co.amount) AS total_sales,
COUNT(DISTINCT co.customer_id) AS unique_customers
FROM completed_orders co
JOIN customers c ON co.customer_id = c.customer_id
GROUP BY
EXTRACT(YEAR FROM co.order_date),
EXTRACT(MONTH FROM co.order_date),
c.country
ORDER BY year, month, total_sales DESC;
Example 2: Customer Ranking Query
sql
-- BEFORE: 8 minutes 15 seconds — correlated subquery
SELECT
customer_id,
name,
(SELECT SUM(amount)
FROM orders
WHERE customer_id = c.customer_id
AND status = 'Completed') AS total_spent
FROM customers c
WHERE country = 'USA'
ORDER BY total_spent DESC
LIMIT 100;
-- AFTER: 1.1 seconds — pre-aggregate with CTE
WITH customer_totals AS (
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
WHERE status = 'Completed'
GROUP BY customer_id
)
SELECT c.customer_id, c.name, ct.total_spent
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE c.country = 'USA'
ORDER BY ct.total_spent DESC
LIMIT 100;
Optimization Techniques
| Technique | Effort | Impact | Best For |
|---|---|---|---|
| Add indexes | Low | Very High | All slow queries |
| SELECT specific columns | Very Low | Medium | Wide tables, covered indexes |
| Sargable WHERE clauses | Low | High | Queries with functions on columns |
| Optimize JOINs | Medium | High | Multi-table queries |
| WHERE vs HAVING | Very Low | Medium | Aggregation queries |
| Replace correlated subqueries | Medium | Very High | Row-by-row subquery patterns |
| EXISTS vs IN | Low | Medium | Large subquery membership tests |
| CTEs and temp tables | Medium | High | Complex multi-step queries |
| LIMIT and pagination | Very Low | Medium | Large result set queries |
| Partitioning | High | Very High | Very large tables, date-range queries |
| Materialized views | High | Very High | Expensive repeated aggregations |
| EXPLAIN / execution plan | Low | Very High | Diagnosing any slow query |
Advantages and Disadvantages of Query Optimization
Advantages
- Dramatically reduced query execution time — minutes to seconds or seconds to milliseconds
- Lower compute costs — faster queries use less CPU and I/O
- Better user experience — dashboards and applications respond faster
- Reduced database server load — optimized queries free resources for other workloads
- More scalable systems — optimized queries handle growing data volumes gracefully
Disadvantages
- Indexes consume storage space and slow down write operations
- Over-indexing can hurt write-heavy workloads (ETL, data ingestion)
- Partitioning adds schema complexity and management overhead
- Materialized views require refresh strategies and storage
- Optimization requires understanding execution plans — a learning curve for beginners
- Optimizations may need revisiting as data volumes and query patterns change
Common Mistakes to Avoid
- Optimizing without checking the execution plan first — Always EXPLAIN before and after changes to verify your optimization actually helped
- Adding indexes on every column — Indexes on low-cardinality columns (like boolean flags or status with only 3 values) often provide little benefit and slow down writes
- Using functions on indexed columns in WHERE clauses —
WHERE YEAR(order_date) = 2024destroys index usability. Use range conditions instead - Forgetting to analyze statistics — The optimizer uses table statistics to make decisions. Outdated statistics lead to bad execution plans. Run ANALYZE (PostgreSQL) or UPDATE STATISTICS (SQL Server) regularly
- Using OFFSET for deep pagination on large tables — OFFSET n requires scanning and discarding n rows. Use keyset pagination for large datasets
- Using SELECT * in production queries — Always specify exactly the columns you need
- Ignoring write performance when adding indexes — Every index you add must be maintained on every INSERT, UPDATE, and DELETE. Balance read and write needs carefully
- Not cleaning up temporary tables — Temporary tables left in the database consume space and can interfere with subsequent sessions
Quick Reference Cheat Sheet
| Slow Pattern | Faster Alternative |
|---|---|
SELECT * | SELECT col1, col2, col3 |
WHERE YEAR(date) = 2024 | WHERE date >= '2024-01-01' AND date < '2025-01-01' |
WHERE col LIKE '%value' | Full-text index or redesign |
| Correlated subquery | JOIN or CTE |
IN (SELECT ...) large subquery | EXISTS or JOIN |
GROUP BY ... HAVING col = x | WHERE col = x ... GROUP BY |
| No indexes on JOIN columns | Add indexes on foreign keys |
| OFFSET pagination | Keyset pagination |
| Repeated expensive aggregation | Materialized view |
| Complex query, no steps | Break into CTEs or temp tables |
SQL query optimization is one of the highest-leverage skills in data engineering. A few hours spent optimizing a critical query can save thousands of hours of waiting time and significant compute costs over the life of a system.
Here is the simplest optimization workflow to follow every time:
- Measure first — Run EXPLAIN ANALYZE to understand what the engine is doing
- Check for full table scans — These are almost always fixable with indexes
- Make WHERE clauses sargable — Remove functions from indexed columns
- Select only what you need — Never use SELECT * in production
- Fix joins — Ensure join columns are indexed and filter before joining
- Replace correlated subqueries — Use CTEs or JOINs instead
- Measure again — Verify the improvement with another EXPLAIN ANALYZE
Start with indexes — they are the highest-impact, lowest-effort optimization available. Then work through sargable WHERE clauses, join optimization, and query restructuring. For the most expensive recurring queries, consider materialized views and partitioning.
Optimization is not a one-time task. As your data grows and query patterns evolve, revisit your execution plans regularly and keep your statistics up to date.
FAQs
What is the most effective SQL query optimization technique?
Proper indexing is typically the highest-impact optimization. Adding the right index on a filtered column can transform a 45-second full table scan into a 1-millisecond index seek. Always start with EXPLAIN to understand the execution plan before optimizing
What is a sargable SQL query?
A sargable query is one where the WHERE clause conditions can be evaluated using an index. Wrapping indexed columns in functions (like YEAR(), UPPER(), or arithmetic operations) makes conditions non-sargable and forces full table scans.
When should I use a CTE vs a temporary table?
Use CTEs for readable, single-use intermediate results within one query. Use temporary tables when an intermediate result is expensive to compute and will be referenced multiple times — temp tables compute once and store the result with its own index support.
How many indexes should I add to a large table?
There is no universal answer — it depends on your read/write ratio and query patterns. A read-heavy analytics table might have 10 indexes. A write-heavy transactional table might have only 2 or 3. Always monitor index usage and remove unused indexes.
What is partition pruning in SQL?
Partition pruning is when the query optimizer recognizes that a query’s WHERE clause only needs data from specific partitions of a partitioned table and skips scanning irrelevant partitions entirely. This can reduce I/O by 90% or more on time-partitioned tables when filtering by date.
How do I find slow queries in my database?
Use the slow query log in MySQL, pg_stat_statements in PostgreSQL, or Query Store in SQL Server to identify the most frequently executed and most resource-intensive queries. These are your highest-priority optimization targets.