If you have ever wondered why some customers keep coming back month after month while others disappear after their first purchase — cohort analysis is the technique that answers that question with data.
Cohort analysis is one of the most powerful tools in any analyst’s toolkit. It reveals retention patterns, product improvement signals, and customer lifetime behavior that aggregate metrics completely hide. And unlike many advanced analytics techniques, it is genuinely achievable in both Excel and SQL without specialized tools or data science expertise.
In this guide, we will walk through exactly how to perform cohort analysis — from defining cohorts to building a complete retention matrix in both Excel and SQL, with real examples.
What Is Cohort Analysis?
Cohort analysis is a technique that groups users or customers by a shared characteristic — typically the time period when they first engaged and then tracks the behavior of each group over time.
The most common cohort is the acquisition cohort — grouping users by the month they first signed up, made their first purchase, or first used a product. You then track what percentage of each group returned in subsequent months.
Simple Analogy
Imagine you opened a restaurant in January. You want to know whether customers who came for the first time in January are more loyal than customers who first came in June.
Without cohort analysis, you just see total visitors each month — a number that mixes new and returning customers together.
With cohort analysis, you track the January cohort separately from the June cohort. You see that 40% of January customers returned in February, and only 25% of June customers returned in July. That difference tells you something important maybe your summer marketing attracted less loyal customers, or your winter menu drives better retention.
Why Aggregate Metrics Hide the Truth
Consider a product with these monthly active users:
January: 10,000
February: 11,000
March: 12,500
April: 13,000
This looks like healthy growth. But what if the breakdown is:
January cohort retained: 2,000 (20%)
February new users: 9,000
February cohort retained: 1,800 (16%)
March new users: 10,700
Each cohort is actually retaining worse than the one before. The growth is masking deteriorating retention — new user acquisition is outpacing the losses. Without cohort analysis, you would never see this until growth slows and the underlying problem becomes a crisis.
The Three Types of Cohorts
Acquisition Cohorts (Most Common)
Groups users by when they first joined, signed up, or made their first purchase.
Example: All users who signed up in January 2024 form the January cohort.
Behavioral Cohorts
Groups users by a specific action they took — not when they joined.
Example: All users who completed onboarding, all users who purchased a specific product, or all users who used a premium feature.
Size Cohorts
Groups users by a characteristic at the time of acquisition — company size, plan tier, geographic region.
Example: Enterprise customers vs SMB customers vs individual users — tracked separately to see if retention differs by segment.
This guide focuses on acquisition cohorts because they are the most universally applicable and the clearest starting point.
Setting Up the Dataset
We will use the same dataset for both the SQL and Excel approaches.
Table: user_orders
Columns: user_id, order_date, order_amount
sql
-- Sample data structure
CREATE TABLE user_orders (
user_id INT,
order_date DATE,
order_amount DECIMAL(10, 2)
);
-- Sample records (abbreviated)
INSERT INTO user_orders VALUES
(1001, '2024-01-05', 85.00),
(1001, '2024-02-18', 120.00),
(1001, '2024-04-03', 95.00),
(1002, '2024-01-12', 200.00),
(1002, '2024-01-28', 75.00),
(1003, '2024-02-07', 150.00),
(1003, '2024-03-15', 88.00),
(1003, '2024-04-22', 210.00),
(1004, '2024-02-14', 95.00),
(1005, '2024-03-01', 175.00),
(1005, '2024-04-10', 90.00),
(1006, '2024-01-20', 310.00),
-- ... (thousands more rows in practice)
Cohort Analysis in SQL — Step by Step
Step 1: Find Each User’s First Order Date (Cohort Assignment)
The first step is identifying when each user made their very first purchase. This defines which cohort they belong to.
sql
-- Step 1: Get the first order date for each user
WITH user_cohorts AS (
SELECT
user_id,
MIN(order_date) AS first_order_date,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM user_orders
GROUP BY user_id
)
SELECT * FROM user_cohorts LIMIT 10;
Output (sample):
| user_id | first_order_date | cohort_month |
|---|---|---|
| 1001 | 2024-01-05 | 2024-01-01 |
| 1002 | 2024-01-12 | 2024-01-01 |
| 1006 | 2024-01-20 | 2024-01-01 |
| 1003 | 2024-02-07 | 2024-02-01 |
| 1004 | 2024-02-14 | 2024-02-01 |
| 1005 | 2024-03-01 | 2024-03-01 |
Users 1001, 1002, and 1006 all joined in January — they form the January 2024 cohort.
Step 2: Calculate Months Since First Order
For each order, calculate how many months have passed since the user’s first order. This is the “cohort age” — Month 0 is the acquisition month, Month 1 is the following month, and so on.
sql
-- Step 2: Join orders with cohort assignments and calculate age
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM user_orders
GROUP BY user_id
),
orders_with_cohort AS (
SELECT
o.user_id,
uc.cohort_month,
DATE_TRUNC('month', o.order_date) AS order_month,
-- Calculate months between order and cohort start
DATEDIFF('month', uc.cohort_month,
DATE_TRUNC('month', o.order_date)) AS months_since_first
FROM user_orders o
JOIN user_cohorts uc ON o.user_id = uc.user_id
)
SELECT * FROM orders_with_cohort
ORDER BY cohort_month, user_id, months_since_first
LIMIT 15;
Output (sample):
| user_id | cohort_month | order_month | months_since_first |
|---|---|---|---|
| 1001 | 2024-01-01 | 2024-01-01 | 0 |
| 1001 | 2024-01-01 | 2024-02-01 | 1 |
| 1001 | 2024-01-01 | 2024-04-01 | 3 |
| 1002 | 2024-01-01 | 2024-01-01 | 0 |
| 1002 | 2024-01-01 | 2024-01-01 | 0 |
| 1003 | 2024-02-01 | 2024-02-01 | 0 |
| 1003 | 2024-02-01 | 2024-03-01 | 1 |
| 1003 | 2024-02-01 | 2024-04-01 | 2 |
User 1001 (January cohort) placed orders at Month 0, Month 1, and Month 3 — skipping Month 2.
Note on DATEDIFF syntax: The syntax varies by database. Use the appropriate version for your platform:
sql
-- PostgreSQL
DATE_PART('month', order_month) - DATE_PART('month', cohort_month)
+ 12 * (DATE_PART('year', order_month) - DATE_PART('year', cohort_month))
-- MySQL
TIMESTAMPDIFF(MONTH, cohort_month, order_month)
-- BigQuery / Snowflake
DATE_DIFF(order_month, cohort_month, MONTH)
-- SQL Server
DATEDIFF(MONTH, cohort_month, order_month)
Step 3: Count Active Users Per Cohort and Month
Count distinct users in each cohort who made at least one order in each month period.
sql
-- Step 3: Count distinct active users per cohort per month
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM user_orders
GROUP BY user_id
),
orders_with_cohort AS (
SELECT
o.user_id,
uc.cohort_month,
DATEDIFF('month', uc.cohort_month,
DATE_TRUNC('month', o.order_date)) AS months_since_first
FROM user_orders o
JOIN user_cohorts uc ON o.user_id = uc.user_id
),
cohort_activity AS (
SELECT
cohort_month,
months_since_first,
COUNT(DISTINCT user_id) AS active_users
FROM orders_with_cohort
GROUP BY cohort_month, months_since_first
)
SELECT * FROM cohort_activity
ORDER BY cohort_month, months_since_first;
Output (sample):
| cohort_month | months_since_first | active_users |
|---|---|---|
| 2024-01-01 | 0 | 850 |
| 2024-01-01 | 1 | 340 |
| 2024-01-01 | 2 | 255 |
| 2024-01-01 | 3 | 212 |
| 2024-02-01 | 0 | 720 |
| 2024-02-01 | 1 | 252 |
| 2024-02-01 | 2 | 180 |
| 2024-03-01 | 0 | 650 |
| 2024-03-01 | 1 | 195 |
Step 4: Get Cohort Size (Month 0 Users)
The cohort size is the number of users active at Month 0 — the total users in that cohort.
sql
-- Step 4: Add cohort size for percentage calculation
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM user_orders
GROUP BY user_id
),
orders_with_cohort AS (
SELECT
o.user_id,
uc.cohort_month,
DATEDIFF('month', uc.cohort_month,
DATE_TRUNC('month', o.order_date)) AS months_since_first
FROM user_orders o
JOIN user_cohorts uc ON o.user_id = uc.user_id
),
cohort_activity AS (
SELECT
cohort_month,
months_since_first,
COUNT(DISTINCT user_id) AS active_users
FROM orders_with_cohort
GROUP BY cohort_month, months_since_first
),
cohort_sizes AS (
SELECT
cohort_month,
active_users AS cohort_size
FROM cohort_activity
WHERE months_since_first = 0
)
SELECT
ca.cohort_month,
ca.months_since_first,
ca.active_users,
cs.cohort_size,
ROUND(ca.active_users * 100.0 / cs.cohort_size, 1) AS retention_rate
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
ORDER BY ca.cohort_month, ca.months_since_first;
Output:
| cohort_month | months_since_first | active_users | cohort_size | retention_rate |
|---|---|---|---|---|
| 2024-01-01 | 0 | 850 | 850 | 100.0 |
| 2024-01-01 | 1 | 340 | 850 | 40.0 |
| 2024-01-01 | 2 | 255 | 850 | 30.0 |
| 2024-01-01 | 3 | 212 | 850 | 24.9 |
| 2024-02-01 | 0 | 720 | 720 | 100.0 |
| 2024-02-01 | 1 | 252 | 720 | 35.0 |
| 2024-02-01 | 2 | 180 | 720 | 25.0 |
| 2024-03-01 | 0 | 650 | 650 | 100.0 |
| 2024-03-01 | 1 | 195 | 650 | 30.0 |
Every cohort starts at 100% at Month 0 by definition. Retention is tracked as a percentage of that starting cohort size.
Step 5: Pivot Into a Retention Matrix
The final step pivots the long-format result into the classic cohort heatmap matrix — cohorts as rows, months as columns.
sql
-- Step 5: Pivot to retention matrix
-- (Using conditional aggregation — works in most SQL dialects)
WITH user_cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM user_orders GROUP BY user_id
),
orders_with_cohort AS (
SELECT o.user_id, uc.cohort_month,
DATEDIFF('month', uc.cohort_month,
DATE_TRUNC('month', o.order_date)) AS months_since_first
FROM user_orders o JOIN user_cohorts uc ON o.user_id = uc.user_id
),
cohort_activity AS (
SELECT cohort_month, months_since_first,
COUNT(DISTINCT user_id) AS active_users
FROM orders_with_cohort GROUP BY cohort_month, months_since_first
),
cohort_sizes AS (
SELECT cohort_month, active_users AS cohort_size
FROM cohort_activity WHERE months_since_first = 0
),
retention AS (
SELECT ca.cohort_month,
ca.months_since_first,
ROUND(ca.active_users * 100.0 / cs.cohort_size, 1) AS retention_rate
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
)
SELECT
cohort_month,
MAX(CASE WHEN months_since_first = 0 THEN retention_rate END) AS month_0,
MAX(CASE WHEN months_since_first = 1 THEN retention_rate END) AS month_1,
MAX(CASE WHEN months_since_first = 2 THEN retention_rate END) AS month_2,
MAX(CASE WHEN months_since_first = 3 THEN retention_rate END) AS month_3,
MAX(CASE WHEN months_since_first = 4 THEN retention_rate END) AS month_4,
MAX(CASE WHEN months_since_first = 5 THEN retention_rate END) AS month_5
FROM retention
GROUP BY cohort_month
ORDER BY cohort_month;
Result — Retention Matrix:
| cohort_month | month_0 | month_1 | month_2 | month_3 | month_4 | month_5 |
|---|---|---|---|---|---|---|
| 2024-01-01 | 100.0 | 40.0 | 30.0 | 24.9 | 19.5 | 15.2 |
| 2024-02-01 | 100.0 | 35.0 | 25.0 | 18.8 | 14.2 | NULL |
| 2024-03-01 | 100.0 | 30.0 | 21.5 | 15.9 | NULL | NULL |
| 2024-04-01 | 100.0 | 28.0 | 18.2 | NULL | NULL | NULL |
NULL appears where data does not yet exist — a March cohort cannot have Month 4 data if it is currently July. This is expected and correct.
Cohort Analysis in Excel — Step by Step
Step 1: Prepare Your Data
In Excel, start with a clean table:
| user_id | order_date | order_amount |
|---|---|---|
| 1001 | 05/01/2024 | 85.00 |
| 1001 | 18/02/2024 | 120.00 |
| … | … | … |
Ensure order_date is formatted as a date — not text. Select the column → Format Cells → Date.
Step 2: Calculate First Order Date Per User
In a new column or separate summary table, find the first order date for each user.
If your data is in columns A (user_id), B (order_date), C (amount):
In a separate column D, add a header “First Order Date” and use:
=MINIFS($B$2:$B$1000, $A$2:$A$1000, A2)
This finds the minimum date in column B where column A matches the current user ID.
Step 3: Extract Cohort Month
In column E, extract just the month and year from the first order date:
=DATE(YEAR(D2), MONTH(D2), 1)
Format this column as date with format “MMM-YYYY” for readability.
Step 4: Calculate Order Month and Months Since First
In column F, extract the month of each order:
=DATE(YEAR(B2), MONTH(B2), 1)
In column G, calculate months since first order:
=DATEDIF(E2, F2, "M")
Or using YEARFRAC:
=(YEAR(F2)-YEAR(E2))*12 + MONTH(F2) - MONTH(E2)
Step 5: Build the Summary Table Using COUNTIFS
Create a summary table with cohort months as rows and months since first (0, 1, 2, 3…) as columns.
Summary table structure (starting at cell I1):
| Month 0 | Month 1 | Month 2 | Month 3 | |
|---|---|---|---|---|
| Jan-2024 | ||||
| Feb-2024 | ||||
| Mar-2024 |
In cell J2 (January cohort, Month 0 — active users):
=COUNTIFS($E$2:$E$1000, $I2, $G$2:$G$1000, J$1_value)
Where $I2 is the cohort month date and J$1_value is the month number (0 for Month 0).
More precisely, using numeric month values in row 1 (0, 1, 2, 3…):
=COUNTIFS($E$2:$E$1000, $I2, $G$2:$G$1000, J$1)
Copy this formula across all cells in the table.
Step 6: Calculate Retention Percentages
Create a parallel table showing percentages. If your count table is in columns I through N, create a percentage table in columns P through U.
In the percentage table, cell Q2 (January cohort, Month 1 retention):
=IF(J2=0, "", ROUND(K2/J2*100, 1))
Where J2 is the Month 0 count (cohort size) and K2 is the Month 1 count.
For Month 0, always show 100%:
=IF(J2=0, "", 100)
Step 7: Apply Conditional Formatting — The Heatmap
The retention matrix becomes dramatically more readable with a color heatmap — green for high retention, red for low.
- Select all percentage cells in your retention table
- Home tab → Conditional Formatting → Color Scales
- Select Green-Yellow-Red (or Red-White-Green depending on your preference)
- Power BI will automatically apply gradient colors based on values
Alternatively, use custom conditional formatting:
- Home → Conditional Formatting → New Rule → Format cells based on their values
- Set minimum (0%) as red, midpoint (30%) as yellow, maximum (100%) as green
The resulting heatmap makes patterns immediately visible — healthy cohorts appear as green rows that fade slowly. Problematic cohorts appear as rows that turn red quickly.
Step 8: Add a Pivot Table for Flexibility
For more dynamic cohort analysis in Excel, use a Pivot Table.
- Select your full data table
- Insert → Pivot Table
- Row: Cohort Month (column E)
- Column: Months Since First (column G)
- Values: Count of user_id (Distinct Count — requires Power Pivot)
For distinct count:
- Right-click on the value field → Value Field Settings → Summarize Values By → Distinct Count
This requires the Data Model (Power Pivot). Enable it:
- When creating the pivot, check “Add this data to the Data Model”
- Then distinct count becomes available
Reading and Interpreting the Retention Matrix
What to Look For
Down each column — Compares cohorts at the same age. Month 1 retention for January vs February vs March. If later cohorts show higher Month 1 retention, your product is improving. If it is declining, something is getting worse.
Across each row — Shows how retention decays for one cohort over time. A flat or slow-declining row indicates strong retention. A steep drop-off indicates a problem with keeping users engaged.
The diagonal — Each diagonal position represents the same calendar month but different cohort ages. The diagonal helps you see whether absolute activity levels are growing.
Patterns and What They Mean
Healthy retention pattern:
Month 0: 100%
Month 1: 40% → Large initial drop is normal
Month 2: 35% → Slows down — retained users are loyal
Month 3: 33% → Stabilizes — this is your "true" retained base
Month 4: 32%
Month 5: 31%
The curve flattens after the initial drop — users who stay past Month 2 tend to stay long-term.
Worsening retention across cohorts:
Jan cohort Month 1: 40%
Feb cohort Month 1: 35%
Mar cohort Month 1: 28%
Apr cohort Month 1: 22%
Month 1 retention is declining with each cohort — a signal that recent acquisition is bringing in lower-quality users, or a product change is reducing early engagement.
Step-change improvement:
Jan cohort Month 1: 25%
Feb cohort Month 1: 26%
Mar cohort Month 1: 38% ← Something changed
Apr cohort Month 1: 40%
A sudden improvement in March suggests a product update, onboarding change, or acquisition channel shift that took effect around that time.
Real-World Use Cases
SaaS Product — Feature Adoption Cohort
A SaaS company tracks cohorts by signup month and measures weekly active usage. They notice that cohorts from months when they introduced guided onboarding show 15 percentage points higher Month 2 retention than cohorts from before onboarding. This becomes the business case for investing further in onboarding improvement.
E-commerce — First Purchase Cohort
An e-commerce platform groups customers by their first purchase month and tracks repeat purchase rates. January cohort customers (acquired during holiday sales) show poor long-term retention — they were deal-seekers who never returned at full price. The marketing team shifts acquisition budget away from discount-heavy campaigns toward loyalty-building channels.
Mobile App — Cohort by Acquisition Channel
A mobile app splits its cohort analysis by acquisition channel — organic, paid social, app store search. Organic cohorts show 3x better Month 3 retention than paid social cohorts. The growth team reallocates budget from paid social toward content and SEO that drives organic discovery.
Common Mistakes to Avoid
- Using calendar month active users instead of cohort-based counting — Aggregate monthly active users mixes cohorts together and hides the patterns that make cohort analysis valuable. Always track users relative to their acquisition date, not the calendar date
- Defining cohort membership too broadly — A cohort should have a clear, single definition — first purchase month, first login month. Mixing different definitions of “first engagement” across users corrupts the analysis
- Ignoring the diagonal when comparing cohorts — When comparing Month 1 retention across cohorts, you are comparing activity in different calendar months. External factors (seasonality, market events, platform changes) affect all cohorts simultaneously. Factor this in before attributing changes to cohort-specific factors
- Not waiting long enough before concluding — A March cohort analyzed in April only has Month 0 and Month 1 data. Wait at least 3–6 months before drawing conclusions about long-term retention for any cohort
- Treating NULL as zero in the retention matrix — NULL means data does not yet exist — the future. Zero means zero users were retained. These are completely different. Never fill NULL cells with 0 in your retention matrix
- Analyzing too many cohorts simultaneously — With two years of monthly data you have 24 cohorts. Trying to find patterns across all 24 simultaneously is overwhelming. Focus on recent cohorts for improvement signals and older cohorts for long-term behavior benchmarks
Cohort analysis is one of the most revealing analytical techniques available and now you have both the SQL and Excel implementations to apply it immediately.
Here is a quick recap of the complete process:
- Assign each user to a cohort based on their first order or signup month
- Calculate how many months have passed between each activity and their first activity
- Count distinct active users per cohort per month period
- Divide by cohort size to get retention percentages
- Pivot to create the retention matrix
- Apply conditional formatting to create a heatmap for easy pattern recognition
- Read down columns to compare cohorts at the same age, across rows to see individual cohort decay
Start with three to six months of data and your most important engagement event — first purchase, first login, first feature use. Build the matrix, apply the heatmap, and look for the patterns. They are almost always there — and they almost always point directly to where the most impactful product and marketing improvements can be made.
FAQs
What is cohort analysis?
Cohort analysis groups users by a shared characteristic — typically the time period of their first engagement and tracks the behavior of each group over time. It reveals retention patterns and product improvement signals that aggregate metrics completely hide.
What is the difference between cohort analysis and retention analysis?
Retention analysis measures what percentage of users return over time. Cohort analysis is the method used to perform retention analysis accurately — by grouping users into cohorts so you compare users at the same stage of their lifecycle rather than mixing different groups together.
How do I perform cohort analysis in SQL?
Use four steps: find each user’s first order date (MIN with GROUP BY), calculate months since first order (DATEDIFF), count distinct active users per cohort per month (COUNT DISTINCT with GROUP BY), and calculate retention percentages by dividing by cohort size. Pivot using CASE WHEN to create the matrix.
How do I perform cohort analysis in Excel?
Use MINIFS to find first order dates, calculate months since first using DATEDIF or date arithmetic, then use COUNTIFS to count active users per cohort per month period. Create a retention percentage table and apply color scale conditional formatting for the heatmap.
What is a good retention rate for cohort analysis?
It depends entirely on your industry and business model. SaaS products typically aim for Month 1 retention above 30–40%. E-commerce businesses vary widely — 20–30% Month 1 repeat purchase rate is often considered healthy. Always benchmark against your own historical cohorts and industry comparables.
How many months of data do I need for cohort analysis?
You need at least three to six months of data to see meaningful patterns. With only one or two months, you can only compare Month 0 and Month 1 behavior — not enough to see where retention stabilizes or how cohorts compare at the same age.