- Version
- Download
- File Size 0.00 KB
- File Count 0
- Create Date March 13, 2026
- Last Updated March 27, 2026
The Credit Card Fraud Detection Dataset, created by the Machine Learning Group (MLG) at Université Libre de Bruxelles (ULB) in collaboration with Worldline, is the definitive benchmark dataset for fraud detection and imbalanced classification problems. This dataset contains real credit card transactions made by European cardholders in September 2013, where fraudulent transactions represent only 0.172% of all transactions - creating an extreme class imbalance scenario that mirrors real-world fraud detection challenges.
Available on Kaggle, this dataset is excellent for building anomaly detection models, mastering techniques for handling severely imbalanced data, developing real-time fraud scoring systems, and understanding the critical balance between catching fraud and minimizing customer friction - essential skills for financial technology and security applications.
Key Features
- Records: 284,807 transactions collected over 2 days (48 hours).
- Legitimate transactions: 284,315 (99.828%)
- Fraudulent transactions: 492 (0.172%)
- Variables: 31 features including:
- Time: Seconds elapsed between each transaction and the first transaction in the dataset
- V1-V28: 28 principal components obtained with PCA transformation
- Original features are confidential and cannot be disclosed due to privacy
- Background features relate to transaction details, cardholder information, and merchant data
- Amount: Transaction amount (in unspecified currency, likely EUR)
- Class: Target variable (0 = legitimate, 1 = fraudulent)
- Data Type: All numerical features (continuous values).
- Format: CSV file (approximately 150 MB compressed).
- Anonymization: PCA transformation applied to protect confidentiality of original features.
- Class Imbalance: Extreme imbalance ratio of 577:1 (legitimate:fraudulent).
- Temporal Scope: All transactions occur within 48 hours.
Why This Dataset
This dataset represents one of the most realistic and challenging problems in applied machine learning: identifying extremely rare fraudulent transactions among millions of legitimate ones while minimizing false alarms that damage customer experience. It's ideal for projects that aim to:
- Build binary classification models for fraud detection under extreme class imbalance.
- Master techniques for handling imbalanced data (SMOTE, undersampling, class weights, ensemble methods).
- Optimize models for business-relevant metrics (precision-recall trade-offs, cost-based evaluation).
- Develop anomaly detection systems that identify outlier behavior patterns.
- Implement real-time transaction scoring with millisecond latency requirements.
- Understand the critical balance between fraud detection rates and false positive costs.
- Work with anonymized PCA-transformed features where interpretability is limited.
- Build production-ready fraud detection pipelines for financial services.
How to Use the Dataset
- Download the CSV file from Kaggle (creditcard.csv, ~150MB).
- Load into Python using Pandas:
df = pd.read_csv('creditcard.csv')
- Explore the structure:
print(df.info())
print(df.describe())
print(df['Class'].value_counts())
- Analyze class imbalance:
fraud_percentage = df['Class'].sum() / len(df) * 100
print(f"Fraud percentage: {fraud_percentage:.3f}%")
# Output: 0.172%
- Check for missing values (none expected):
print(df.isnull().sum())
- Visualize key patterns:
- Transaction distribution over time
- Amount distribution for fraud vs. legitimate
- Box plots comparing V1-V28 features by class
- Correlation heatmap for V features
- Fraud transaction timing patterns
- Feature scaling:
- Time: Scale using StandardScaler or RobustScaler
- Amount: Critical to scale (highly skewed, use RobustScaler)
- V1-V28: Already PCA-transformed and scaled, but verify
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df['scaled_amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = scaler.fit_transform(df['Time'].values.reshape(-1,1))
- Handle extreme class imbalance (critical step):
- Random Undersampling: Reduce majority class (risk losing information)
- SMOTE (Synthetic Minority Oversampling): Create synthetic fraud examples
- ADASYN: Adaptive synthetic sampling for borderline cases
- Tomek Links: Remove majority class samples near decision boundary
- Class Weights: Penalize fraud misclassification more heavily
- Ensemble Methods: BalancedRandomForest, EasyEnsemble, BalancedBagging
- Anomaly Detection: Isolation Forest, One-Class SVM, Autoencoders
- Engineer features (limited due to PCA):
# Time-based features
df['hour'] = (df['Time'] / 3600) % 24
# Amount-based features
df['amount_log'] = np.log1p(df['Amount'])
# Statistical interactions (if multiple transactions per card available)
# Note: Card IDs are not provided in this dataset
- Split data strategically:
from sklearn.model_selection import train_test_split
X = df.drop(['Class'], axis=1)
y = df['Class']
# Stratified split to maintain fraud ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Or time-based split (more realistic)
# Train on first 80% chronologically, test on last 20%
- Train models with imbalance handling:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# SMOTE oversampling
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
# XGBoost with scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
xgb = XGBClassifier(scale_pos_weight=scale_pos_weight)
# Random Forest with class weights
rf = RandomForestClassifier(class_weight='balanced')
- Optimize decision threshold (don't use default 0.5):
from sklearn.metrics import precision_recall_curve
# Find optimal threshold based on business requirements
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
# Select threshold that maximizes F1 or meets business constraints
- Evaluate with appropriate metrics (avoid accuracy!):
from sklearn.metrics import (
confusion_matrix, classification_report,
precision_recall_curve, roc_auc_score,
average_precision_score, f1_score
)
# Primary metrics
pr_auc = average_precision_score(y_test, y_pred_proba) # PR-AUC
roc_auc = roc_auc_score(y_test, y_pred_proba) # ROC-AUC
# Confusion matrix breakdown
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
# Calculate business metrics
precision = tp / (tp + fp) # Of predicted frauds, how many are real
recall = tp / (tp + fn) # Of actual frauds, how many we catch
f1 = f1_score(y_test, y_pred)
Possible Project Ideas
- Optimized fraud classifier achieving 90%+ recall with minimal false positives using ensemble methods.
- Imbalanced learning comparison benchmarking SMOTE, ADASYN, undersampling, class weights, and cost-sensitive learning.
- Anomaly detection system using Isolation Forest, One-Class SVM, or autoencoders to identify fraud as outliers.
- Threshold optimization study finding optimal decision boundary based on business cost analysis.
- Real-time fraud scoring API building Flask/FastAPI service with sub-second response time.
- Deep learning approach using neural networks with focal loss or class-weighted cross-entropy.
- Ensemble fraud detector combining XGBoost, Random Forest, and Neural Networks for robust predictions.
- Cost-benefit calculator quantifying financial impact of different model configurations (FP cost vs. FN cost).
- Temporal fraud analysis investigating how fraud patterns evolve over the 48-hour period.
- Explainability project using SHAP values to interpret predictions despite PCA anonymization.
- Streaming fraud detection simulating real-time transaction processing with Apache Kafka or similar.
- Model monitoring system detecting concept drift and triggering retraining.
- Hybrid detection approach combining supervised classification with unsupervised anomaly detection.
- Alert prioritization engine scoring transactions by fraud probability for manual review.
- Production deployment pipeline containerizing model with Docker, deploying with Kubernetes.
Dataset Challenges and Considerations
- Extreme Class Imbalance: 0.172% fraud rate (1 in 577 transactions) renders standard accuracy meaningless.
- PCA Anonymization: V1-V28 features lack interpretability, preventing domain-based feature engineering.
- Unknown Original Features: Cannot leverage knowledge about merchant categories, card types, locations, etc.
- Temporal Constraints: Only 48 hours of data may not capture all fraud patterns or seasonal variations.
- Limited Fraud Samples: Only 492 fraud cases limits deep learning approaches and robust validation.
- Feature Scaling Issues: Time and Amount are not pre-scaled unlike V1-V28.
- Deployment Latency: Real-world systems must score transactions in <100ms.
- Cost Asymmetry: Missing fraud costs banks thousands; false positives annoy customers and reduce revenue.
- Evaluation Complexity: Standard metrics don't capture business value or customer experience impact.
- Evolving Fraud Patterns: Fraudsters adapt; models trained on 2013 data may not generalize to current fraud.
- Privacy Constraints: PCA transformation protects privacy but limits model interpretability.
Critical Performance Metrics
Primary Metrics (Use These):
- Precision-Recall AUC (PR-AUC): Best overall metric for extreme imbalance
- Focuses on minority class performance
- More informative than ROC-AUC for rare events
- Target: >0.80 (good), >0.90 (excellent)
- Recall (Sensitivity): Proportion of actual frauds detected
- Critical for minimizing fraud losses
- Target: 85-95% depending on business tolerance
- Formula: TP / (TP + FN)
- Precision: Proportion of fraud predictions that are correct
- Critical for customer experience (avoid false alarms)
- Target: 90-95% to minimize investigation costs
- Formula: TP / (TP + FP)
- F1-Score: Harmonic mean of precision and recall
- Balanced metric when both FP and FN matter
- Target: >0.75 (good), >0.85 (excellent)
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- Confusion Matrix: Detailed breakdown
- True Positives (TP): Fraud correctly identified
- False Positives (FP): Legitimate flagged as fraud (customer friction)
- False Negatives (FN): Fraud missed (financial loss)
- True Negatives (TN): Legitimate correctly cleared
Secondary Metrics:
- ROC-AUC: Less informative for extreme imbalance but still useful
- Specificity: TN / (TN + FP) - proportion of legitimate transactions correctly cleared
- Cost-based Metric: Custom metric incorporating business costs of FP and FN
Metrics to AVOID:
- Accuracy: Predicting all transactions as legitimate gives 99.828% accuracy - completely useless!
- Balanced Accuracy: Better than accuracy but still not ideal for extreme imbalance
Handling Imbalanced Data - Detailed Strategies
1. Resampling Techniques:
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN
# SMOTE - Synthetic Minority Oversampling
smote = SMOTE(sampling_strategy=0.5, random_state=42) # Balance to 50:50 or custom ratio
X_sm, y_sm = smote.fit_resample(X_train, y_train)
# ADASYN - Adaptive Synthetic Sampling (focuses on harder cases)
adasyn = ADASYN(sampling_strategy=0.5, random_state=42)
X_ad, y_ad = adasyn.fit_resample(X_train, y_train)
# Random Undersampling (fast but loses information)
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
# Combined approach - SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train, y_train)
2. Algorithmic Approaches:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
# Class weights - penalize minority class errors more
lr = LogisticRegression(class_weight='balanced') # Automatic balancing
rf = RandomForestClassifier(class_weight={0:1, 1:100}) # Manual weights
# XGBoost scale_pos_weight
scale = (y_train == 0).sum() / (y_train == 1).sum()
xgb = XGBClassifier(scale_pos_weight=scale, max_depth=3)
# LightGBM with class weights
lgbm = LGBMClassifier(class_weight='balanced', n_estimators=100)
3. Threshold Optimization:
from sklearn.metrics import precision_recall_curve
# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Option 1: Maximize F1-Score
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
# Option 2: Business constraint (e.g., minimum 90% recall)
min_recall = 0.90
valid_indices = recall >= min_recall
if valid_indices.any():
max_precision_idx = np.argmax(precision[valid_indices])
optimal_threshold = thresholds[valid_indices][max_precision_idx]
# Apply optimal threshold
y_pred = (y_proba >= optimal_threshold).astype(int)
4. Anomaly Detection:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
# Isolation Forest (unsupervised)
iso_forest = IsolationForest(contamination=0.00172, random_state=42)
iso_forest.fit(X_train[y_train == 0]) # Train on legitimate only
y_pred = iso_forest.predict(X_test) # -1 = anomaly (fraud), 1 = normal
# One-Class SVM
oc_svm = OneClassSVM(nu=0.00172, kernel='rbf', gamma='auto')
oc_svm.fit(X_train[y_train == 0])
y_pred = oc_svm.predict(X_test)
# Autoencoder (deep learning anomaly detection)
# Train on legitimate transactions, high reconstruction error = fraud
5. Ensemble Methods for Imbalance:
from imblearn.ensemble import (
BalancedRandomForestClassifier,
EasyEnsembleClassifier,
RUSBoostClassifier,
BalancedBaggingClassifier
)
# Balanced Random Forest
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
# EasyEnsemble (multiple undersampled subsets)
easy = EasyEnsembleClassifier(n_estimators=10, random_state=42)
easy.fit(X_train, y_train)
# RUSBoost (Random UnderSampling + Boosting)
rus_boost = RUSBoostClassifier(n_estimators=100, random_state=42)
rus_boost.fit(X_train, y_train)
Model Performance Expectations
Baseline (Predicting all as legitimate):
- Accuracy: 99.828% (misleading!)
- Recall: 0% (catches no fraud)
- Precision: Undefined (no fraud predicted)
- Completely useless despite high "accuracy"
Acceptable Performance:
- PR-AUC: 0.70-0.80
- Recall: 75-85%
- Precision: 85-92%
- F1-Score: 0.70-0.80
Good Performance:
- PR-AUC: 0.80-0.90
- Recall: 85-92%
- Precision: 92-96%
- F1-Score: 0.80-0.88
State-of-the-Art:
- PR-AUC: 0.90-0.95
- Recall: 92-97%
- Precision: 96-98%
- F1-Score: 0.88-0.92
Important Notes:
- Perfect performance is impossible due to inherent label noise
- Trade-offs exist between recall (catch fraud) and precision (avoid false alarms)
- Business requirements determine optimal operating point
Business Context and Cost Analysis
False Negative (FN) - Missing Fraud:
- Average fraud transaction: $100-500
- Chargebacks and fees to bank
- Customer dissatisfaction (fraud victim)
- Cost per FN: $100-1000+ depending on amount
False Positive (FP) - Legitimate Flagged as Fraud:
- Customer friction and annoyance
- Abandoned purchases (lost revenue)
- Manual review costs ($5-20 per investigation)
- Customer support time
- Cost per FP: $10-50 depending on process
Cost Ratio:
- FN typically 10-100x more costly than FP
- Justifies optimizing for higher recall even with some FP increase
- Exact ratio depends on business model and fraud amounts
Threshold Selection Example:
# Business costs
cost_fn = 500 # Missing $500 fraud
cost_fp = 20 # $20 manual review cost
# Calculate cost for different thresholds
def calculate_cost(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
total_cost = (fn * cost_fn) + (fp * cost_fp)
return total_cost, fn, fp
# Find threshold that minimizes cost
best_cost = float('inf')
best_threshold = 0.5
for threshold in np.arange(0.1, 0.9, 0.01):
y_pred = (y_proba >= threshold).astype(int)
cost, fn, fp = calculate_cost(y_test, y_pred)
if cost < best_cost:
best_cost = cost
best_threshold = threshold
Feature Engineering Limitations
Due to PCA anonymization, traditional feature engineering is severely limited:
What You CAN'T Do:
- Create merchant-based features (unknown)
- Use geographic/location information (not provided)
- Leverage transaction category knowledge (anonymized)
- Build cardholder behavioral profiles (no card IDs)
- Time-of-day patterns beyond basic hour extraction
What You CAN Do:
- Time-based patterns (transactions per hour, day/night)
- Amount-based features (log transformation, binning, z-scores)
- Statistical interactions between V1-V28 features
- Polynomial features or interaction terms between V components
- Aggregate statistics if assuming sequential transactions (risky assumption)
Example Limited Engineering:
# Amount transformations
df['amount_log'] = np.log1p(df['Amount'])
df['amount_zscore'] = (df['Amount'] - df['Amount'].mean()) / df['Amount'].std()
# Time features
df['hour'] = (df['Time'] / 3600) % 24
df['day'] = (df['Time'] / (3600 * 24)).astype(int)
# Interaction terms (V features)
df['V1_V2_interaction'] = df['V1'] * df['V2']
df['V1_squared'] = df['V1'] ** 2
Practical Deployment Considerations
Real-Time Requirements:
- Latency: <100ms per transaction (ideally <50ms)
- Throughput: Thousands of transactions per second
- Availability: 99.99%+ uptime (fraud detection can't go down)
Model Serving:
# Lightweight model for production
import joblib
from flask import Flask, request, jsonify
app = Flask(__name__)
model = joblib.load('fraud_model.pkl')
scaler = joblib.load('scaler.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
features = scaler.transform([data['features']])
prob = model.predict_proba(features)[0][1]
return jsonify({
'fraud_probability': float(prob),
'is_fraud': bool(prob > 0.3), # Optimized threshold
'confidence': 'high' if prob > 0.7 or prob < 0.1 else 'medium'
})
Monitoring and Maintenance:
- Track precision/recall daily
- Monitor fraud patterns for concept drift
- A/B test new models against production
- Retrain quarterly or when performance degrades
- Log all predictions for audit trail
Scalability:
- Use lightweight models (XGBoost, LightGBM over deep learning)
- Consider model quantization for faster inference
- Batch predictions where possible
- Cache feature transformations
- Horizontal scaling with load balancers
Regulatory Compliance:
- Explainability requirements (SHAP, LIME despite PCA)
- Fair lending laws (no discrimination)
- Data privacy (GDPR, PCI-DSS)
- Audit trails for all fraud decisions
- Model documentation and validation
Ethical Considerations
Fairness:
- Ensure model doesn't discriminate by demographics (even if encoded in V features)
- Test for disparate impact across customer segments
- Regular bias audits
Transparency:
- Communicate AI role in fraud detection to customers
- Provide appeal process for declined transactions
- Explain decision factors when possible (challenging with PCA)
Privacy:
- PCA transformation protects original features
- Comply with data protection regulations
- Secure model and data storage
- Limited retention of transaction details
Customer Impact:
- Balance fraud prevention with customer experience
- Minimize false positives that frustrate legitimate customers
- Fast resolution process for false fraud flags
- Clear communication when fraud is suspected