SQL Query Optimization Techniques for Large Tables

SQL Query Optimization Techniques for Large Tables

If you have ever written a SQL query that worked perfectly on a small dataset but suddenly crawled to a halt when run against a table with millions of rows, you have experienced the pain of unoptimized SQL firsthand.

Slow queries are one of the most common and costly problems in data engineering and analytics. They slow down dashboards, block applications, frustrate analysts, and consume expensive compute resources. A query that takes 45 minutes to run is not just annoying — it is a business problem.

The good news is that most slow SQL queries can be dramatically improved using a handful of well-understood techniques. You do not need to be a database administrator to apply them, you just need to understand how SQL engines execute queries and what makes them work harder than they need to.

In this guide, we will walk through every major SQL query optimization technique for large tables with clear explanations, before-and-after examples, and practical guidance on when to apply each one.

Why SQL Queries Get Slow on Large Tables

Before optimizing, it helps to understand why queries get slow in the first place.

When you run a SQL query against a table with millions of rows, the database engine has to do work and sometimes a lot of it. The most common causes of slow queries are:

Full table scans — The engine reads every single row in the table looking for matches instead of jumping directly to the relevant rows

Missing or unused indexes — Without indexes, the engine has no map to navigate to specific rows efficiently

Inefficient joins — Joining large tables without proper indexes forces the engine to compare every combination of rows

Retrieving too many columns — Selecting more data than you need wastes I/O, memory, and network bandwidth

Non-sargable WHERE clauses — Filtering conditions written in ways the engine cannot use indexes to evaluate

Suboptimal query structure — Correlated subqueries, unnecessary DISTINCT, improper use of HAVING vs WHERE

Poor data types — Joining or filtering on columns with mismatched or inefficient data types

Understanding these root causes points directly toward the solutions.

Setting Up Example Tables

We will use realistic large-table examples throughout this guide.

sql

-- Orders fact table — 50 million rows
CREATE TABLE orders (
    order_id        BIGINT PRIMARY KEY,
    customer_id     INT NOT NULL,
    product_id      INT NOT NULL,
    order_date      DATE NOT NULL,
    status          VARCHAR(20) NOT NULL,
    amount          DECIMAL(12, 2) NOT NULL,
    region          VARCHAR(50) NOT NULL,
    sales_rep_id    INT,
    discount        DECIMAL(5, 2) DEFAULT 0,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Customers dimension table — 2 million rows
CREATE TABLE customers (
    customer_id     INT PRIMARY KEY,
    name            VARCHAR(100) NOT NULL,
    email           VARCHAR(150) UNIQUE,
    country         VARCHAR(50),
    segment         VARCHAR(30),
    created_date    DATE
);

-- Products dimension table — 500,000 rows
CREATE TABLE products (
    product_id      INT PRIMARY KEY,
    product_name    VARCHAR(200),
    category        VARCHAR(50),
    subcategory     VARCHAR(50),
    unit_price      DECIMAL(10, 2),
    supplier_id     INT
);

Technique 1: Use EXPLAIN / EXPLAIN ANALYZE — Understand Before Optimizing

The single most important skill in query optimization is reading the execution plan which is the roadmap the database engine creates showing exactly how it plans to execute your query.

Never optimize blindly. Always start by understanding what the engine is doing.

How to Get the Execution Plan

sql

-- MySQL / PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

-- PostgreSQL — with actual execution statistics
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 12345;

-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT * FROM orders WHERE customer_id = 12345;

-- Or use graphical execution plan in SQL Server Management Studio
-- Query menu → Include Actual Execution Plan (Ctrl+M)

Reading the Execution Plan — Key Things to Look For

Seq Scan / Table Scan — The engine is reading the entire table. On large tables, this is usually a red flag.

Index Scan / Index Seek — The engine is using an index to find rows efficiently. This is what you want.

Nested Loop / Hash Join / Merge Join — How the engine is joining tables. Each has different performance characteristics.

Rows — The estimated number of rows at each step. Large row estimates in early steps mean expensive operations.

Cost — The relative cost the optimizer assigns to each step. Higher cost = more work.

sql

-- Example execution plan output (PostgreSQL)
EXPLAIN ANALYZE
SELECT order_id, amount
FROM orders
WHERE customer_id = 12345
AND order_date >= '2024-01-01';

-- Output (before optimization):
-- Seq Scan on orders  (cost=0.00..2847293.00 rows=127 width=20)
--                     (actual time=0.042..45823.112 rows=127 loops=1)
--   Filter: ((customer_id = 12345) AND (order_date >= '2024-01-01'))
--   Rows Removed by Filter: 49999873
-- Planning Time: 0.152 ms
-- Execution Time: 45823.334 ms  ← 45 seconds!

The execution plan shows a Seq Scan reading all 50 million rows and discarding 49,999,873 of them. Adding an index on customer_id transforms this.

Technique 2: Indexing — The Most Impactful Optimization

Indexes are the single most powerful tool for speeding up SQL queries on large tables. An index is a separate data structure that the engine maintains allowing it to find rows matching a condition without scanning the entire table.

Think of an index like the index at the back of a textbook instead of reading every page to find “photosynthesis”, you look it up in the index and jump directly to page 342.

Single Column Index

sql

-- Create index on the most commonly filtered column
CREATE INDEX idx_orders_customer_id
ON orders (customer_id);

CREATE INDEX idx_orders_order_date
ON orders (order_date);

CREATE INDEX idx_orders_status
ON orders (status);

After adding the customer_id index:

sql

EXPLAIN ANALYZE
SELECT order_id, amount
FROM orders
WHERE customer_id = 12345;

-- Output (after indexing):
-- Index Scan using idx_orders_customer_id on orders
--   (cost=0.56..892.34 rows=127 width=20)
--   (actual time=0.034..1.247 rows=127 loops=1)
-- Execution Time: 1.489 ms  ← From 45 seconds to 1.5ms!

Composite Index — Multiple Columns

When your WHERE clause filters on multiple columns, a composite index covering all those columns is far more efficient than separate single-column indexes.

sql

-- Frequently used together: filter by customer, then date
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);

-- Query that benefits from composite index
SELECT order_id, amount, status
FROM orders
WHERE customer_id = 12345
AND order_date BETWEEN '2024-01-01' AND '2024-12-31';

The Column Order in Composite Indexes Matters

In a composite index, the order of columns is critical. The index is most effective when queries filter on the leading columns first.

sql

-- Index: (customer_id, order_date, status)

-- Uses the index efficiently — filters on leading columns
SELECT * FROM orders
WHERE customer_id = 12345
AND order_date >= '2024-01-01';

-- Uses the index efficiently — filters on first column only
SELECT * FROM orders
WHERE customer_id = 12345;

-- Does NOT use the index efficiently — skips the leading column
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND status = 'Completed';

Rule: Build composite indexes with the most selective column first, and ensure queries that filter on the middle or end columns also filter on the leading columns.

Covering Index — Include All Needed Columns

A covering index includes all the columns needed by a query — so the engine never has to go back to the main table to fetch data. This is one of the most powerful optimizations available.

sql

-- Query needs: customer_id (filter), order_date (filter), amount and status (select)
-- Covering index includes all four columns
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (amount, status);

-- SQL Server / PostgreSQL syntax
-- MySQL: just include all columns in the index definition
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date, amount, status);

With a covering index, the engine satisfies the entire query from the index alone — no I/O to the main table at all.

Partial Index — Index Only What You Query

If you frequently query only a subset of a table like active orders or recent records — a partial index on that subset is smaller, faster, and uses less storage.

sql

-- Most queries filter on active orders only
-- Partial index covering only active orders
CREATE INDEX idx_orders_active
ON orders (customer_id, order_date)
WHERE status = 'Active';

-- PostgreSQL syntax
-- Much smaller than a full index — only indexes ~15% of rows if 15% are active

Index Maintenance — Do Not Over-Index

Indexes have a cost — they slow down INSERT, UPDATE, and DELETE operations because the engine must update every index whenever data changes. More indexes = faster reads, slower writes.

Guidelines:

  • Index columns used in WHERE, JOIN, and ORDER BY clauses
  • Index foreign key columns in child tables
  • Remove unused indexes — they slow writes with no benefit
  • Monitor index usage with database-specific system views

sql

-- PostgreSQL: find unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%pkey%'
ORDER BY tablename;

-- MySQL: find unused indexes
SELECT * FROM sys.schema_unused_indexes;

Technique 3: SELECT Only What You Need

One of the most common and easily fixed performance mistakes is using SELECT * when you only need a few columns.

The Problem With SELECT *

sql

-- Bad — retrieves ALL columns from a 50 million row table
-- Transfers potentially hundreds of bytes per row across the network
SELECT *
FROM orders
WHERE customer_id = 12345;

-- Good — retrieves only the 3 columns you actually need
SELECT order_id, order_date, amount
FROM orders
WHERE customer_id = 12345;

Why SELECT * is expensive:

  • Forces the engine to retrieve every column from disk
  • Increases memory usage for storing and transmitting results
  • Prevents covering indexes from being used
  • Increases network I/O between database and application
  • Forces unnecessary data type conversions in the application layer

Always specify the exact columns you need. This applies especially when joining multiple large tables — the column count multiplies across every joined table.

Technique 4: Write Sargable WHERE Clauses

Sargable (Search ARGument ABLE) means the WHERE clause is written in a way that allows the database engine to use an index to evaluate it.

Non-sargable conditions force full table scans even when indexes exist on the relevant columns.

Functions on Indexed Columns — Non-Sargable

sql

-- Non-sargable — wraps column in function, index cannot be used
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;

-- Sargable — range condition, index CAN be used
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';

-- Non-sargable — function on column
SELECT * FROM customers
WHERE UPPER(email) = 'ALICE@EXAMPLE.COM';

-- Sargable — use consistent case or case-insensitive collation
SELECT * FROM customers
WHERE email = 'alice@example.com';

-- Non-sargable — arithmetic on column
SELECT * FROM orders
WHERE amount * 1.1 > 1000;

-- Sargable — move arithmetic to the other side
SELECT * FROM orders
WHERE amount > 1000 / 1.1;

LIKE With Leading Wildcard — Non-Sargable

sql

-- Non-sargable — leading wildcard prevents index use
SELECT * FROM customers
WHERE name LIKE '%Smith';

-- Sargable — trailing wildcard only, index CAN be used
SELECT * FROM customers
WHERE name LIKE 'Smith%';

-- For full-text search needs, use full-text indexes instead
CREATE FULLTEXT INDEX idx_customers_name ON customers(name);
SELECT * FROM customers
WHERE MATCH(name) AGAINST ('Smith' IN BOOLEAN MODE);

NULL Handling — Sargability Considerations

sql

-- Non-sargable on most databases
SELECT * FROM orders
WHERE ISNULL(sales_rep_id, 0) = 0;

-- Sargable
SELECT * FROM orders
WHERE sales_rep_id IS NULL;

-- Or use COALESCE on the parameter side, not the column
SELECT * FROM orders
WHERE sales_rep_id = COALESCE(@rep_id, sales_rep_id);

Technique 5: Optimize JOIN Operations

Joins on large tables are one of the most expensive SQL operations. Optimizing them has a dramatic impact on query performance.

Always Join on Indexed Columns

sql

-- Both customer_id columns must be indexed for efficient join
SELECT o.order_id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';

-- orders.customer_id should have an index
-- customers.customer_id is the primary key (already indexed)

Filter Before Joining — Use Subqueries or CTEs

Reducing the size of a table before joining it dramatically reduces the work the join must do.

sql

-- Slow — joins all 50 million orders to customers, then filters
SELECT o.order_id, o.amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'Completed'
AND c.country = 'USA';

-- Faster — filter each table first, then join the reduced result sets
WITH recent_orders AS (
    SELECT order_id, customer_id, amount
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'Completed'
),
us_customers AS (
    SELECT customer_id, name
    FROM customers
    WHERE country = 'USA'
)
SELECT o.order_id, o.amount, c.name
FROM recent_orders o
JOIN us_customers c ON o.customer_id = c.customer_id;

Choose the Right Join Type

sql

-- INNER JOIN — only rows matching in both tables
-- Use when you only need matched records
SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- LEFT JOIN — all rows from left table, matched from right
-- Use when you need all orders regardless of customer data availability
SELECT o.order_id, c.name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;

-- Avoid CROSS JOIN on large tables — produces m x n rows
-- 50M orders × 2M customers = 100 trillion rows
SELECT * FROM orders CROSS JOIN customers;  -- Never do this!

Join Order in Queries

Most modern optimizers determine join order automatically but in complex queries with many joins, explicit hints or CTEs can help the optimizer make better choices.

sql

-- Join smaller tables first to reduce intermediate result sets
SELECT o.order_id, o.amount, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE c.segment = 'Enterprise'  -- Filter on dimension table
AND p.category = 'Electronics'; -- Filter on dimension table

Technique 6: Use WHERE Instead of HAVING for Row Filtering

HAVING filters are applied after GROUP BY aggregation meaning the engine groups millions of rows and then discards groups you do not want.

WHERE filters are applied before GROUP BY meaning the engine filters to a smaller dataset before doing the expensive aggregation work.

sql

-- Slow — GROUP BY aggregates all 50 million rows, HAVING discards most groups
SELECT customer_id, SUM(amount) AS total_sales
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 10000
AND customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA');

-- Fast — WHERE reduces dataset before grouping
SELECT customer_id, SUM(amount) AS total_sales
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA')
GROUP BY customer_id
HAVING SUM(amount) > 10000;

Rule: Use WHERE for filtering rows. Use HAVING only for filtering on aggregated values (like SUM, COUNT, AVG) that cannot be evaluated before GROUP BY.

Technique 7: Replace Correlated Subqueries With Joins or CTEs

Correlated subqueries execute once for every row in the outer query — turning an O(n) problem into an O(n²) problem on large tables.

sql

-- Very slow — correlated subquery runs once for each of 50 million rows
SELECT order_id, amount,
       (SELECT name FROM customers WHERE customer_id = o.customer_id) AS customer_name
FROM orders o
WHERE order_date >= '2024-01-01';

-- Fast — JOIN executes once, same result
SELECT o.order_id, o.amount, c.name AS customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';

-- Another correlated subquery example — finding orders above average
-- Slow — recalculates average for every row
SELECT order_id, amount
FROM orders o
WHERE amount > (
    SELECT AVG(amount)
    FROM orders
    WHERE customer_id = o.customer_id
);

-- Fast — calculate averages once with CTE, then join
WITH customer_avg AS (
    SELECT customer_id, AVG(amount) AS avg_amount
    FROM orders
    GROUP BY customer_id
)
SELECT o.order_id, o.amount
FROM orders o
JOIN customer_avg ca ON o.customer_id = ca.customer_id
WHERE o.amount > ca.avg_amount;

Technique 8: Use EXISTS Instead of IN for Large Subqueries

When checking membership in a large subquery result, EXISTS is often faster than IN because EXISTS stops as soon as it finds the first match while IN must evaluate the entire subquery first.

sql

-- Potentially slow — IN evaluates entire subquery result
SELECT order_id, amount
FROM orders
WHERE customer_id IN (
    SELECT customer_id
    FROM customers
    WHERE country = 'USA'
    AND segment = 'Enterprise'
);

-- Often faster — EXISTS stops at first match
SELECT o.order_id, o.amount
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM customers c
    WHERE c.customer_id = o.customer_id
    AND c.country = 'USA'
    AND c.segment = 'Enterprise'
);

-- Often best — explicit JOIN with proper indexes
SELECT o.order_id, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'USA'
AND c.segment = 'Enterprise';

Note: Modern query optimizers often transform IN to EXISTS or JOIN automatically but writing it explicitly ensures the behavior you intend.

Technique 9: Avoid SELECT DISTINCT When Possible

DISTINCT forces the engine to sort and de-duplicate the entire result set which is expensive on large tables.

sql

-- Slow — DISTINCT forces sort of all results
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01';

-- Often faster — GROUP BY achieves the same result
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;

-- Better yet — query the customers table directly if that is what you need
SELECT customer_id
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

If you find yourself using DISTINCT frequently, it often signals a deeper problem either a missing JOIN condition causing duplicate rows, or a query that should be redesigned.

Technique 10: Use CTEs and Temp Tables to Break Up Complex Queries

Very complex queries with many joins and subqueries are hard for the optimizer to handle efficiently. Breaking them into steps using CTEs or temporary tables can dramatically improve performance and readability.

CTEs for Readable Multi-Step Queries

sql

-- Complex query broken into clear, optimizable steps
WITH
-- Step 1: Get recent high-value orders
high_value_orders AS (
    SELECT order_id, customer_id, product_id, amount, order_date
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND amount >= 500
    AND status = 'Completed'
),
-- Step 2: Get enterprise customers in key markets
target_customers AS (
    SELECT customer_id, name, country
    FROM customers
    WHERE segment = 'Enterprise'
    AND country IN ('USA', 'UK', 'Germany', 'France')
),
-- Step 3: Get electronics products
electronics AS (
    SELECT product_id, product_name, category
    FROM products
    WHERE category = 'Electronics'
),
-- Step 4: Join everything together
final_result AS (
    SELECT
        hvo.order_id,
        hvo.amount,
        hvo.order_date,
        tc.name AS customer_name,
        tc.country,
        e.product_name
    FROM high_value_orders hvo
    JOIN target_customers tc ON hvo.customer_id = tc.customer_id
    JOIN electronics e ON hvo.product_id = e.product_id
)
SELECT *
FROM final_result
ORDER BY amount DESC
LIMIT 1000;

Temporary Tables for Intermediate Results

When a CTE is referenced multiple times in a query, it may be recalculated each time. A temporary table calculates once and stores the result.

sql

-- Calculate expensive intermediate result once
CREATE TEMPORARY TABLE temp_customer_stats AS
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent,
    AVG(amount) AS avg_order_value,
    MAX(order_date) AS last_order_date
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id;

-- Index the temp table for subsequent joins
CREATE INDEX idx_temp_customer_id
ON temp_customer_stats (customer_id);

-- Now use the pre-calculated stats in multiple queries efficiently
SELECT c.name, c.country, tcs.total_spent, tcs.order_count
FROM customers c
JOIN temp_customer_stats tcs ON c.customer_id = tcs.customer_id
WHERE tcs.total_spent > 50000
ORDER BY tcs.total_spent DESC;

-- Clean up
DROP TEMPORARY TABLE temp_customer_stats;

Technique 11: Use LIMIT and Pagination for Large Result Sets

If you do not need all rows at once — add LIMIT to retrieve only what you need.

sql

-- Return first 1000 results only
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 1000;

-- Efficient pagination using keyset pagination (better than OFFSET for large pages)
-- Page 1
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC, order_id DESC
LIMIT 100;

-- Next page — use last values from previous page as cursor
SELECT order_id, amount, order_date
FROM orders
WHERE customer_id = 12345
AND (order_date, order_id) < ('2024-06-15', 99845)
ORDER BY order_date DESC, order_id DESC
LIMIT 100;

Why keyset pagination is better than OFFSET:

sql

-- Slow OFFSET pagination — scans all skipped rows
SELECT order_id, amount
FROM orders
ORDER BY order_date DESC
LIMIT 100 OFFSET 500000;  -- Must scan 500,000 rows to skip them

-- Fast keyset pagination — uses index to start at the right position
SELECT order_id, amount
FROM orders
WHERE order_date < '2023-08-20'  -- Last date from previous page
ORDER BY order_date DESC
LIMIT 100;

Technique 12: Partitioning Large Tables

Table partitioning divides a very large table into smaller, more manageable physical segments while presenting a single logical table to queries. Queries that filter on the partition key only scan the relevant partition dramatically reducing I/O.

Range Partitioning by Date (Most Common)

sql

-- MySQL range partitioning by year
CREATE TABLE orders_partitioned (
    order_id    BIGINT,
    customer_id INT,
    order_date  DATE,
    amount      DECIMAL(12,2),
    status      VARCHAR(20)
)
PARTITION BY RANGE (YEAR(order_date)) (
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

-- Query that benefits from partition pruning
-- Only scans the p2024 partition instead of all 50 million rows
SELECT order_id, amount
FROM orders_partitioned
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';

List Partitioning by Category

sql

-- PostgreSQL list partitioning by region
CREATE TABLE orders_by_region
PARTITION BY LIST (region);

CREATE TABLE orders_north PARTITION OF orders_by_region
FOR VALUES IN ('North', 'Northeast', 'Northwest');

CREATE TABLE orders_south PARTITION OF orders_by_region
FOR VALUES IN ('South', 'Southeast', 'Southwest');

CREATE TABLE orders_east PARTITION OF orders_by_region
FOR VALUES IN ('East', 'Midwest');

CREATE TABLE orders_west PARTITION OF orders_by_region
FOR VALUES IN ('West', 'Pacific');

Technique 13: Query Result Caching and Materialized Views

Some queries are expensive to compute but do not change frequently. Materialized views pre-compute and store the result allowing instant access without recomputing from scratch each time.

Materialized Views

sql

-- PostgreSQL materialized view
CREATE MATERIALIZED VIEW mv_customer_monthly_summary AS
SELECT
    customer_id,
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount
FROM orders
WHERE status = 'Completed'
GROUP BY customer_id, DATE_TRUNC('month', order_date);

-- Create index on the materialized view
CREATE INDEX idx_mv_customer_month
ON mv_customer_monthly_summary (customer_id, month);

-- Query the pre-computed result instantly
SELECT customer_id, month, total_amount
FROM mv_customer_monthly_summary
WHERE month >= '2024-01-01'
AND total_amount > 10000;

-- Refresh when source data changes
REFRESH MATERIALIZED VIEW mv_customer_monthly_summary;

-- Refresh without locking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_customer_monthly_summary;

Real-World Optimization Examples

Example 1: Dashboard Summary Query

sql

-- BEFORE: 3 minutes 42 seconds
SELECT
    YEAR(o.order_date) AS year,
    MONTH(o.order_date) AS month,
    c.country,
    SUM(o.amount) AS total_sales,
    COUNT(DISTINCT o.customer_id) AS unique_customers
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'Completed'
GROUP BY YEAR(o.order_date), MONTH(o.order_date), c.country
ORDER BY year, month, total_sales DESC;

-- AFTER: 4.2 seconds
-- Changes: 
-- 1. Used date range instead of YEAR() function (sargable)
-- 2. Filtered customers before joining
-- 3. Added covering index on orders(status, order_date, customer_id, amount)
-- 4. Added index on customers(customer_id, country)

CREATE INDEX idx_orders_status_date ON orders (status, order_date, customer_id, amount);

WITH completed_orders AS (
    SELECT customer_id, order_date, amount
    FROM orders
    WHERE status = 'Completed'
    AND order_date >= '2022-01-01'
)
SELECT
    EXTRACT(YEAR FROM co.order_date) AS year,
    EXTRACT(MONTH FROM co.order_date) AS month,
    c.country,
    SUM(co.amount) AS total_sales,
    COUNT(DISTINCT co.customer_id) AS unique_customers
FROM completed_orders co
JOIN customers c ON co.customer_id = c.customer_id
GROUP BY
    EXTRACT(YEAR FROM co.order_date),
    EXTRACT(MONTH FROM co.order_date),
    c.country
ORDER BY year, month, total_sales DESC;

Example 2: Customer Ranking Query

sql

-- BEFORE: 8 minutes 15 seconds — correlated subquery
SELECT
    customer_id,
    name,
    (SELECT SUM(amount)
     FROM orders
     WHERE customer_id = c.customer_id
     AND status = 'Completed') AS total_spent
FROM customers c
WHERE country = 'USA'
ORDER BY total_spent DESC
LIMIT 100;

-- AFTER: 1.1 seconds — pre-aggregate with CTE
WITH customer_totals AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    WHERE status = 'Completed'
    GROUP BY customer_id
)
SELECT c.customer_id, c.name, ct.total_spent
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE c.country = 'USA'
ORDER BY ct.total_spent DESC
LIMIT 100;

Optimization Techniques

TechniqueEffortImpactBest For
Add indexesLowVery HighAll slow queries
SELECT specific columnsVery Low MediumWide tables, covered indexes
Sargable WHERE clausesLow HighQueries with functions on columns
Optimize JOINsMedium HighMulti-table queries
WHERE vs HAVINGVery Low MediumAggregation queries
Replace correlated subqueriesMediumVery HighRow-by-row subquery patterns
EXISTS vs INLowMediumLarge subquery membership tests
CTEs and temp tablesMedium HighComplex multi-step queries
LIMIT and paginationVery LowMediumLarge result set queries
PartitioningHighVery HighVery large tables, date-range queries
Materialized viewsHigh Very HighExpensive repeated aggregations
EXPLAIN / execution planLow Very HighDiagnosing any slow query

Advantages and Disadvantages of Query Optimization

Advantages

  • Dramatically reduced query execution time — minutes to seconds or seconds to milliseconds
  • Lower compute costs — faster queries use less CPU and I/O
  • Better user experience — dashboards and applications respond faster
  • Reduced database server load — optimized queries free resources for other workloads
  • More scalable systems — optimized queries handle growing data volumes gracefully

Disadvantages

  • Indexes consume storage space and slow down write operations
  • Over-indexing can hurt write-heavy workloads (ETL, data ingestion)
  • Partitioning adds schema complexity and management overhead
  • Materialized views require refresh strategies and storage
  • Optimization requires understanding execution plans — a learning curve for beginners
  • Optimizations may need revisiting as data volumes and query patterns change

Common Mistakes to Avoid

  • Optimizing without checking the execution plan first — Always EXPLAIN before and after changes to verify your optimization actually helped
  • Adding indexes on every column — Indexes on low-cardinality columns (like boolean flags or status with only 3 values) often provide little benefit and slow down writes
  • Using functions on indexed columns in WHERE clausesWHERE YEAR(order_date) = 2024 destroys index usability. Use range conditions instead
  • Forgetting to analyze statistics — The optimizer uses table statistics to make decisions. Outdated statistics lead to bad execution plans. Run ANALYZE (PostgreSQL) or UPDATE STATISTICS (SQL Server) regularly
  • Using OFFSET for deep pagination on large tables — OFFSET n requires scanning and discarding n rows. Use keyset pagination for large datasets
  • Using SELECT * in production queries — Always specify exactly the columns you need
  • Ignoring write performance when adding indexes — Every index you add must be maintained on every INSERT, UPDATE, and DELETE. Balance read and write needs carefully
  • Not cleaning up temporary tables — Temporary tables left in the database consume space and can interfere with subsequent sessions

Quick Reference Cheat Sheet

Slow PatternFaster Alternative
SELECT *SELECT col1, col2, col3
WHERE YEAR(date) = 2024WHERE date >= '2024-01-01' AND date < '2025-01-01'
WHERE col LIKE '%value'Full-text index or redesign
Correlated subqueryJOIN or CTE
IN (SELECT ...) large subqueryEXISTS or JOIN
GROUP BY ... HAVING col = xWHERE col = x ... GROUP BY
No indexes on JOIN columnsAdd indexes on foreign keys
OFFSET paginationKeyset pagination
Repeated expensive aggregationMaterialized view
Complex query, no stepsBreak into CTEs or temp tables

SQL query optimization is one of the highest-leverage skills in data engineering. A few hours spent optimizing a critical query can save thousands of hours of waiting time and significant compute costs over the life of a system.

Here is the simplest optimization workflow to follow every time:

  1. Measure first — Run EXPLAIN ANALYZE to understand what the engine is doing
  2. Check for full table scans — These are almost always fixable with indexes
  3. Make WHERE clauses sargable — Remove functions from indexed columns
  4. Select only what you need — Never use SELECT * in production
  5. Fix joins — Ensure join columns are indexed and filter before joining
  6. Replace correlated subqueries — Use CTEs or JOINs instead
  7. Measure again — Verify the improvement with another EXPLAIN ANALYZE

Start with indexes — they are the highest-impact, lowest-effort optimization available. Then work through sargable WHERE clauses, join optimization, and query restructuring. For the most expensive recurring queries, consider materialized views and partitioning.

Optimization is not a one-time task. As your data grows and query patterns evolve, revisit your execution plans regularly and keep your statistics up to date.

FAQs

What is the most effective SQL query optimization technique?

Proper indexing is typically the highest-impact optimization. Adding the right index on a filtered column can transform a 45-second full table scan into a 1-millisecond index seek. Always start with EXPLAIN to understand the execution plan before optimizing

What is a sargable SQL query?

A sargable query is one where the WHERE clause conditions can be evaluated using an index. Wrapping indexed columns in functions (like YEAR(), UPPER(), or arithmetic operations) makes conditions non-sargable and forces full table scans.

When should I use a CTE vs a temporary table?

Use CTEs for readable, single-use intermediate results within one query. Use temporary tables when an intermediate result is expensive to compute and will be referenced multiple times — temp tables compute once and store the result with its own index support.

How many indexes should I add to a large table?

There is no universal answer — it depends on your read/write ratio and query patterns. A read-heavy analytics table might have 10 indexes. A write-heavy transactional table might have only 2 or 3. Always monitor index usage and remove unused indexes.

What is partition pruning in SQL?

Partition pruning is when the query optimizer recognizes that a query’s WHERE clause only needs data from specific partitions of a partitioned table and skips scanning irrelevant partitions entirely. This can reduce I/O by 90% or more on time-partitioned tables when filtering by date.

How do I find slow queries in my database?

Use the slow query log in MySQL, pg_stat_statements in PostgreSQL, or Query Store in SQL Server to identify the most frequently executed and most resource-intensive queries. These are your highest-priority optimization targets.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top