How to Create Synthetic Data for Machine Learning

How to Create Synthetic Data for Machine Learning

Real data is messy, expensive, scarce, and sometimes legally untouchable. The customer records that would make your fraud detection model genuinely good contain PII that cannot leave your organization’s secure environment. The rare disease cases that would improve your diagnostic classifier appear maybe a dozen times in years of records. The edge cases that break production models are, by definition, underrepresented in historical data. And the labeled dataset you need to train a supervised model costs more in annotation time than the initial project budget allowed.

Synthetic data is how the machine learning community has learned to work around all of these constraints simultaneously. Instead of waiting for real data to accumulate or negotiating access to restricted datasets, you generate data that has the statistical properties you need. Done well, models trained on synthetic data generalize to real data effectively. Done poorly, they generalize to the synthetic distribution and fail on real inputs in confusing ways.

This guide covers the major approaches to synthetic data generation, when each one is appropriate, how to implement them in Python, and how to validate that the synthetic data you have generated is actually fit for purpose.

When Synthetic Data Makes Sense

Synthetic data is worth generating when one of four conditions applies.

Data scarcity. You have too few real examples to train a reliable model. Class imbalance is the most common version of this: ninety-nine percent of transactions are legitimate and one percent are fraudulent, so the model sees almost no examples of what it is supposed to detect.

Privacy constraints. The data you need contains personally identifiable information, protected health information, or other sensitive attributes that cannot be shared, moved, or used without restrictions that make model development impractical.

Annotation cost. Labeling real data requires expensive human expertise. Medical imaging, legal document classification, and specialized manufacturing defect detection all involve labeling costs that quickly exceed project budgets. Generating pre-labeled synthetic data bypasses the bottleneck.

Edge case coverage. Real data reflects historical reality. If your system has never encountered a specific failure mode, your training data contains no examples of it. Synthetic data lets you deliberately generate examples of scenarios you want the model to handle correctly.

Method 1: Statistical Methods for Tabular Data

For structured tabular data, statistical approaches are often the fastest path to useful synthetic data and require no deep learning infrastructure.

Gaussian Copula

A Gaussian copula captures the correlation structure between columns in your real dataset and uses it to generate new rows that preserve those correlations. It models each column’s marginal distribution separately and then uses the copula to model how columns co-vary.

python

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
import pandas as pd

# Load real data
real_data = pd.read_csv("customer_transactions.csv")

# Define metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# Fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=10000)

print(f"Real data shape: {real_data.shape}")
print(f"Synthetic data shape: {synthetic_data.shape}")
print("\nReal data statistics:")
print(real_data.describe())
print("\nSynthetic data statistics:")
print(synthetic_data.describe())

The SDV (Synthetic Data Vault) library handles the metadata detection automatically, identifying column types, ranges, and relationships without manual specification.

CTGAN for Complex Distributions

Gaussian copulas assume roughly normal marginal distributions. Real tabular data often has multimodal distributions, heavy tails, and complex non-linear relationships between columns. CTGAN uses a conditional GAN architecture specifically designed for tabular data and handles these complexities more effectively.

python

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(
    metadata,
    epochs=300,
    batch_size=500,
    verbose=True
)

synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=len(real_data))

# Evaluate quality
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata
)
print(quality_report.get_score())

CTGAN training takes longer than Gaussian copula fitting but produces better results on complex distributions. For datasets with more than a few dozen columns or strongly non-normal distributions, the quality improvement is typically worth the additional training time.

Method 2: SMOTE for Class Imbalance

When the specific problem is class imbalance rather than general data scarcity, SMOTE (Synthetic Minority Oversampling Technique) is a targeted and widely validated approach. SMOTE generates new minority class examples by interpolating between existing ones in feature space rather than sampling from a learned distribution.

python

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

# Simulate imbalanced dataset
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    weights=[0.97, 0.03],  # 97% majority, 3% minority
    random_state=42
)

print(f"Original class distribution: {Counter(y)}")

# Apply SMOTE
smote = SMOTE(
    sampling_strategy=0.3,  # Minority becomes 30% of majority count
    k_neighbors=5,
    random_state=42
)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"After SMOTE: {Counter(y_resampled)}")

# BorderlineSMOTE focuses on samples near the decision boundary
borderline_smote = BorderlineSMOTE(
    sampling_strategy=0.3,
    random_state=42
)
X_borderline, y_borderline = borderline_smote.fit_resample(X, y)
print(f"After BorderlineSMOTE: {Counter(y_borderline)}")

# ADASYN adapts the number of synthetic samples per region
adasyn = ADASYN(
    sampling_strategy=0.3,
    random_state=42
)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)
print(f"After ADASYN: {Counter(y_adasyn)}")

BorderlineSMOTE generates synthetic examples specifically near the decision boundary where the classifier needs the most help. ADASYN generates more synthetic samples in regions of the feature space where minority examples are sparse. Both are improvements over vanilla SMOTE for datasets where the minority class is not uniformly distributed.

A critical warning: apply SMOTE only to your training set, never to validation or test sets. Applying it to validation or test data contaminates your evaluation with synthetic examples and produces optimistically biased performance metrics that do not reflect real-world performance.

python

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split first, then resample only training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply SMOTE to training data only
smote = SMOTE(sampling_strategy=0.3, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train and evaluate
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

# Evaluate on real test data
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Method 3: LLM-Based Synthetic Data Generation

Large language models have become one of the most practical tools for generating synthetic tabular and text data, particularly for creating labeled training examples for NLP tasks. The approach is straightforward: prompt the LLM to generate examples that match the format and distribution of your real data.

python

from anthropic import Anthropic
import json
import pandas as pd

client = Anthropic()

def generate_synthetic_examples(
    task_description: str,
    label_schema: dict,
    real_examples: list,
    n_examples: int = 50
) -> list:

    few_shot = "\n".join([
        f"Text: {ex['text']}\nLabel: {ex['label']}"
        for ex in real_examples[:5]
    ])

    prompt = f"""You are generating synthetic training data for a machine learning classifier.

Task: {task_description}

Labels and their meanings:
{json.dumps(label_schema, indent=2)}

Here are real examples to match in style and distribution:
{few_shot}

Generate {n_examples} diverse synthetic examples. Return a JSON array where each
element has 'text' and 'label' fields. Ensure realistic class distribution,
varied vocabulary, different sentence structures, and edge cases.

Return only the JSON array with no other text."""

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4000,
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(response.content[0].text)

# Example: customer support ticket classification
label_schema = {
    "billing": "Questions about charges, invoices, or payment issues",
    "technical": "Product bugs, errors, or functionality problems",
    "shipping": "Delivery status, tracking, or logistics questions",
    "returns": "Refund requests or return process questions"
}

real_examples = [
    {"text": "My card was charged twice for the same order", "label": "billing"},
    {"text": "The app crashes whenever I try to upload a photo", "label": "technical"},
    {"text": "Where is my package? It was supposed to arrive yesterday", "label": "shipping"},
    {"text": "How do I return an item that does not fit?", "label": "returns"}
]

synthetic_examples = generate_synthetic_examples(
    task_description="Classify customer support tickets by category",
    label_schema=label_schema,
    real_examples=real_examples,
    n_examples=100
)

synthetic_df = pd.DataFrame(synthetic_examples)
print(synthetic_df["label"].value_counts())
print(synthetic_df.head(10))

LLM-based generation is particularly effective for text classification, named entity recognition, and question-answering datasets where the task is well-defined and describable. The key to quality is providing real examples as few-shot demonstrations, specifying the class distribution you want, and asking for diversity in vocabulary and structure.

For structured tabular data, LLMs are less reliable than statistical methods because they tend to produce round numbers, common values, and unrealistic correlations between columns. Use statistical methods for tabular data and LLMs for text.

Method 4: Rule-Based Generation With Faker

When you need synthetic data that looks realistic at the individual field level without needing to preserve statistical relationships between fields, rule-based generation with libraries like Faker produces clean, readable synthetic records quickly.

python

from faker import Faker
import pandas as pd
import numpy as np
import random

fake = Faker()
Faker.seed(42)
np.random.seed(42)

def generate_customer_record():
    signup_date = fake.date_between(start_date="-3y", end_date="today")
    country = fake.country_code()

    # Realistic correlations through conditional logic
    if country in ["US", "CA", "GB", "AU"]:
        currency = "USD" if country in ["US", "CA"] else "GBP" if country == "GB" else "AUD"
        avg_order = np.random.lognormal(mean=4.5, sigma=0.8)
    else:
        currency = "EUR"
        avg_order = np.random.lognormal(mean=4.0, sigma=0.7)

    num_orders = np.random.poisson(lam=8)
    is_churned = 1 if (
        num_orders < 2 or
        avg_order < 20 or
        random.random() < 0.15
    ) else 0

    return {
        "customer_id": fake.uuid4(),
        "email": fake.email(),
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "country": country,
        "currency": currency,
        "signup_date": signup_date,
        "num_orders": num_orders,
        "avg_order_value": round(avg_order, 2),
        "total_spent": round(avg_order * num_orders, 2),
        "is_churned": is_churned
    }

# Generate dataset
records = [generate_customer_record() for _ in range(5000)]
synthetic_customers = pd.DataFrame(records)

print(synthetic_customers.head())
print(f"\nChurn rate: {synthetic_customers['is_churned'].mean():.2%}")
print(f"\nOrders by country:\n{synthetic_customers.groupby('country')['num_orders'].mean()}")

The critical technique in rule-based generation is encoding realistic conditional relationships through explicit logic rather than letting fields be generated independently. If you generate country and currency independently, you will get US customers paying in Yen. Encoding the dependency explicitly produces data that passes the common-sense checks that pure statistical approaches sometimes fail.

Validating Synthetic Data

Generating synthetic data is half the work. Validating that it is actually useful for your machine learning task is the other half and is frequently skipped with costly consequences.

Statistical validation

python

from scipy import stats
import matplotlib.pyplot as plt

def validate_distributions(real_df, synthetic_df, columns):
    results = {}
    for col in columns:
        if real_df[col].dtype in ["float64", "int64"]:
            stat, p_value = stats.ks_2samp(
                real_df[col].dropna(),
                synthetic_df[col].dropna()
            )
            results[col] = {
                "ks_statistic": round(stat, 4),
                "p_value": round(p_value, 4),
                "distributions_similar": p_value > 0.05,
                "real_mean": round(real_df[col].mean(), 4),
                "synthetic_mean": round(synthetic_df[col].mean(), 4)
            }
    return pd.DataFrame(results).T

numeric_cols = ["num_orders", "avg_order_value", "total_spent"]
validation_results = validate_distributions(
    real_data[numeric_cols],
    synthetic_data[numeric_cols],
    numeric_cols
)
print(validation_results)

The Kolmogorov-Smirnov test compares the distribution of each column between real and synthetic data. A high p-value indicates the distributions are statistically similar. A low p-value indicates they diverge in ways the model will detect.

Train on synthetic, test on real

The most important validation is the one that answers the actual question: does a model trained on synthetic data perform adequately on real data?

python

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

def evaluate_synthetic_utility(real_df, synthetic_df, target_col, feature_cols):
    le = LabelEncoder()

    # Prepare real data
    X_real = real_df[feature_cols].copy()
    y_real = real_df[target_col]

    # Prepare synthetic data
    X_synthetic = synthetic_df[feature_cols].copy()
    y_synthetic = synthetic_df[target_col]

    X_train_real, X_test_real, y_train_real, y_test_real = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )

    # Train on real, test on real (upper bound)
    clf_real = GradientBoostingClassifier(random_state=42)
    clf_real.fit(X_train_real, y_train_real)
    auc_real = roc_auc_score(y_test_real, clf_real.predict_proba(X_test_real)[:, 1])

    # Train on synthetic, test on real (utility metric)
    clf_synthetic = GradientBoostingClassifier(random_state=42)
    clf_synthetic.fit(X_synthetic, y_synthetic)
    auc_synthetic = roc_auc_score(y_test_real, clf_synthetic.predict_proba(X_test_real)[:, 1])

    print(f"Train on real, test on real AUC: {auc_real:.4f}")
    print(f"Train on synthetic, test on real AUC: {auc_synthetic:.4f}")
    print(f"Utility ratio: {auc_synthetic / auc_real:.4f}")
    print(f"Performance gap: {auc_real - auc_synthetic:.4f}")

    return auc_real, auc_synthetic

A utility ratio above 0.9 means the synthetic data retains at least ninety percent of the real data’s value for the downstream task. Below 0.8 indicates the synthetic data is too different from real data to be reliably useful

Synthetic Data Cheat Sheet

MethodBest ForLibraryKey Limitation
Gaussian CopulaTabular data, normal distributionsSDVStruggles with multimodal distributions
CTGANComplex tabular distributionsSDVSlower training, needs more data
SMOTEClass imbalance correctionimbalanced-learnInterpolates, does not extrapolate
LLM generationText classification, NLP tasksAnthropic, OpenAIUnreliable for numeric tabular data
Rule-based FakerRealistic-looking individual recordsFakerManual correlation encoding required
Image augmentationComputer visionAlbumentations, torchvisionDomain-specific transforms needed

Common Mistakes

Applying SMOTE to validation and test data is the most consequential mistake. It makes your evaluation metrics look better than they are because the model is being tested on synthetic examples similar to its training data. Always split first, then apply any resampling only to the training fold.

Skipping distribution validation produces synthetic datasets that look reasonable but have subtle statistical differences that cause models trained on them to generalize poorly. The KS test and the train-on-synthetic-test-on-real evaluation together catch most problems before they reach production.

Over-generating synthetic data relative to real data can cause models to overfit to the synthetic distribution rather than learning the real one. A common safe ratio is generating two to five times as many synthetic examples as real ones. Beyond that, the synthetic data increasingly dominates the training signal.

Generating synthetic data without domain review allows statistically plausible but domain-implausible records to enter the training set. A synthetic medical dataset might generate patients with contradictory conditions or impossible lab value combinations that are statistically unusual but not impossible from the model’s perspective. Domain expert review of a sample of synthetic records catches these issues.

FAQs

What is synthetic data in machine learning?

Synthetic data is artificially generated data that mimics the statistical properties of real data without containing actual real records. It is created using statistical models, generative neural networks, large language models, or rule-based systems. Machine learning models trained on high-quality synthetic data can generalize to real data effectively, making synthetic data valuable when real data is scarce, expensive to label, or legally restricted.

When should I use synthetic data instead of collecting more real data?

Synthetic data is most valuable when real data is genuinely scarce or inaccessible. Class imbalance where rare events are underrepresented, privacy constraints that prevent using real customer or patient data, annotation costs that exceed budget, and edge case coverage for scenarios that have never occurred historically are all strong justifications for synthetic data generation. If real data is accessible and affordable, collecting more real data is usually preferable because it avoids the distribution mismatch risks that synthetic data introduces.

What is SMOTE and when should I use it?

SMOTE is Synthetic Minority Oversampling Technique, an algorithm that generates new minority class examples by interpolating between existing ones in feature space. Use it when you have a class imbalance problem where one class has significantly fewer examples than others and you want to give the model more examples of the minority class to learn from. Always apply SMOTE only to training data after splitting your dataset. Never apply it to validation or test sets.

Can LLMs generate training data for machine learning?

Yes, and it has become one of the most practical approaches for NLP tasks. LLMs can generate labeled text examples for classification, named entity recognition, and question answering tasks when given a clear task description, label schema, and a few real examples as demonstrations. The quality is highest when the task is well-defined and the LLM can draw on general language understanding. LLMs are less reliable for generating realistic numeric tabular data, where statistical methods like Gaussian copula or CTGAN produce better results.

How do I know if my synthetic data is good enough to train on?

Use two validation approaches together. First, statistical validation comparing column distributions between real and synthetic data using the Kolmogorov-Smirnov test to verify that the synthetic data preserves the statistical properties of the real data. Second, train-on-synthetic-test-on-real evaluation where you train a model on synthetic data and evaluate it on held-out real data, comparing performance against a model trained on real data. A utility ratio above 0.9 indicates the synthetic data retains most of the real data’s value for the downstream task.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top