Credit Card Fraud Detection Dataset

Version
Download
File Size 0.00 KB
File Count 0
Create Date March 13, 2026
Last Updated March 27, 2026

Download

Description
Attached Files

The Credit Card Fraud Detection Dataset, created by the Machine Learning Group (MLG) at Université Libre de Bruxelles (ULB) in collaboration with Worldline, is the definitive benchmark dataset for fraud detection and imbalanced classification problems. This dataset contains real credit card transactions made by European cardholders in September 2013, where fraudulent transactions represent only 0.172% of all transactions - creating an extreme class imbalance scenario that mirrors real-world fraud detection challenges.

Available on Kaggle, this dataset is excellent for building anomaly detection models, mastering techniques for handling severely imbalanced data, developing real-time fraud scoring systems, and understanding the critical balance between catching fraud and minimizing customer friction - essential skills for financial technology and security applications.

Key Features

Records: 284,807 transactions collected over 2 days (48 hours).
- Legitimate transactions: 284,315 (99.828%)
- Fraudulent transactions: 492 (0.172%)
Variables: 31 features including:
- Time: Seconds elapsed between each transaction and the first transaction in the dataset
- V1-V28: 28 principal components obtained with PCA transformation
  - Original features are confidential and cannot be disclosed due to privacy
  - Background features relate to transaction details, cardholder information, and merchant data
- Amount: Transaction amount (in unspecified currency, likely EUR)
- Class: Target variable (0 = legitimate, 1 = fraudulent)
Data Type: All numerical features (continuous values).
Format: CSV file (approximately 150 MB compressed).
Anonymization: PCA transformation applied to protect confidentiality of original features.
Class Imbalance: Extreme imbalance ratio of 577:1 (legitimate:fraudulent).
Temporal Scope: All transactions occur within 48 hours.

Why This Dataset

This dataset represents one of the most realistic and challenging problems in applied machine learning: identifying extremely rare fraudulent transactions among millions of legitimate ones while minimizing false alarms that damage customer experience. It's ideal for projects that aim to:

Build binary classification models for fraud detection under extreme class imbalance.
Master techniques for handling imbalanced data (SMOTE, undersampling, class weights, ensemble methods).
Optimize models for business-relevant metrics (precision-recall trade-offs, cost-based evaluation).
Develop anomaly detection systems that identify outlier behavior patterns.
Implement real-time transaction scoring with millisecond latency requirements.
Understand the critical balance between fraud detection rates and false positive costs.
Work with anonymized PCA-transformed features where interpretability is limited.
Build production-ready fraud detection pipelines for financial services.

How to Use the Dataset

Download the CSV file from Kaggle (creditcard.csv, ~150MB).
Load into Python using Pandas:

python

   df = pd.read_csv('creditcard.csv')

Explore the structure:

python

   print(df.info())
   print(df.describe())
   print(df['Class'].value_counts())

Analyze class imbalance:

python

   fraud_percentage = df['Class'].sum() / len(df) * 100
   print(f"Fraud percentage: {fraud_percentage:.3f}%")
   # Output: 0.172%

Check for missing values (none expected):

python

   print(df.isnull().sum())

Visualize key patterns:
- Transaction distribution over time
- Amount distribution for fraud vs. legitimate
- Box plots comparing V1-V28 features by class
- Correlation heatmap for V features
- Fraud transaction timing patterns
Feature scaling:
- Time: Scale using StandardScaler or RobustScaler
- Amount: Critical to scale (highly skewed, use RobustScaler)
- V1-V28: Already PCA-transformed and scaled, but verify

python

   from sklearn.preprocessing import RobustScaler
   scaler = RobustScaler()
   df['scaled_amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1,1))
   df['scaled_time'] = scaler.fit_transform(df['Time'].values.reshape(-1,1))

Handle extreme class imbalance (critical step):
- Random Undersampling: Reduce majority class (risk losing information)
- SMOTE (Synthetic Minority Oversampling): Create synthetic fraud examples
- ADASYN: Adaptive synthetic sampling for borderline cases
- Tomek Links: Remove majority class samples near decision boundary
- Class Weights: Penalize fraud misclassification more heavily
- Ensemble Methods: BalancedRandomForest, EasyEnsemble, BalancedBagging
- Anomaly Detection: Isolation Forest, One-Class SVM, Autoencoders
Engineer features (limited due to PCA):

python

   # Time-based features
   df['hour'] = (df['Time'] / 3600) % 24
   
   # Amount-based features
   df['amount_log'] = np.log1p(df['Amount'])
   
   # Statistical interactions (if multiple transactions per card available)
   # Note: Card IDs are not provided in this dataset

Split data strategically:

python

    from sklearn.model_selection import train_test_split
    
    X = df.drop(['Class'], axis=1)
    y = df['Class']
    
    # Stratified split to maintain fraud ratio
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Or time-based split (more realistic)
    # Train on first 80% chronologically, test on last 20%

Train models with imbalance handling:

python

    from imblearn.over_sampling import SMOTE
    from sklearn.ensemble import RandomForestClassifier
    from xgboost import XGBClassifier
    
    # SMOTE oversampling
    smote = SMOTE(random_state=42)
    X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
    
    # XGBoost with scale_pos_weight
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    xgb = XGBClassifier(scale_pos_weight=scale_pos_weight)
    
    # Random Forest with class weights
    rf = RandomForestClassifier(class_weight='balanced')

Optimize decision threshold (don't use default 0.5):

python

    from sklearn.metrics import precision_recall_curve
    
    # Find optimal threshold based on business requirements
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
    # Select threshold that maximizes F1 or meets business constraints

Evaluate with appropriate metrics (avoid accuracy!):

python

    from sklearn.metrics import (
        confusion_matrix, classification_report,
        precision_recall_curve, roc_auc_score,
        average_precision_score, f1_score
    )
    
    # Primary metrics
    pr_auc = average_precision_score(y_test, y_pred_proba)  # PR-AUC
    roc_auc = roc_auc_score(y_test, y_pred_proba)  # ROC-AUC
    
    # Confusion matrix breakdown
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Calculate business metrics
    precision = tp / (tp + fp)  # Of predicted frauds, how many are real
    recall = tp / (tp + fn)  # Of actual frauds, how many we catch
    f1 = f1_score(y_test, y_pred)

Possible Project Ideas

Optimized fraud classifier achieving 90%+ recall with minimal false positives using ensemble methods.
Imbalanced learning comparison benchmarking SMOTE, ADASYN, undersampling, class weights, and cost-sensitive learning.
Anomaly detection system using Isolation Forest, One-Class SVM, or autoencoders to identify fraud as outliers.
Threshold optimization study finding optimal decision boundary based on business cost analysis.
Real-time fraud scoring API building Flask/FastAPI service with sub-second response time.
Deep learning approach using neural networks with focal loss or class-weighted cross-entropy.
Ensemble fraud detector combining XGBoost, Random Forest, and Neural Networks for robust predictions.
Cost-benefit calculator quantifying financial impact of different model configurations (FP cost vs. FN cost).
Temporal fraud analysis investigating how fraud patterns evolve over the 48-hour period.
Explainability project using SHAP values to interpret predictions despite PCA anonymization.
Streaming fraud detection simulating real-time transaction processing with Apache Kafka or similar.
Model monitoring system detecting concept drift and triggering retraining.
Hybrid detection approach combining supervised classification with unsupervised anomaly detection.
Alert prioritization engine scoring transactions by fraud probability for manual review.
Production deployment pipeline containerizing model with Docker, deploying with Kubernetes.

Dataset Challenges and Considerations

Extreme Class Imbalance: 0.172% fraud rate (1 in 577 transactions) renders standard accuracy meaningless.
PCA Anonymization: V1-V28 features lack interpretability, preventing domain-based feature engineering.
Unknown Original Features: Cannot leverage knowledge about merchant categories, card types, locations, etc.
Temporal Constraints: Only 48 hours of data may not capture all fraud patterns or seasonal variations.
Limited Fraud Samples: Only 492 fraud cases limits deep learning approaches and robust validation.
Feature Scaling Issues: Time and Amount are not pre-scaled unlike V1-V28.
Deployment Latency: Real-world systems must score transactions in <100ms.
Cost Asymmetry: Missing fraud costs banks thousands; false positives annoy customers and reduce revenue.
Evaluation Complexity: Standard metrics don't capture business value or customer experience impact.
Evolving Fraud Patterns: Fraudsters adapt; models trained on 2013 data may not generalize to current fraud.
Privacy Constraints: PCA transformation protects privacy but limits model interpretability.

Critical Performance Metrics

Primary Metrics (Use These):

Precision-Recall AUC (PR-AUC): Best overall metric for extreme imbalance
- Focuses on minority class performance
- More informative than ROC-AUC for rare events
- Target: >0.80 (good), >0.90 (excellent)
Recall (Sensitivity): Proportion of actual frauds detected
- Critical for minimizing fraud losses
- Target: 85-95% depending on business tolerance
- Formula: TP / (TP + FN)
Precision: Proportion of fraud predictions that are correct
- Critical for customer experience (avoid false alarms)
- Target: 90-95% to minimize investigation costs
- Formula: TP / (TP + FP)
F1-Score: Harmonic mean of precision and recall
- Balanced metric when both FP and FN matter
- Target: >0.75 (good), >0.85 (excellent)
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
Confusion Matrix: Detailed breakdown
- True Positives (TP): Fraud correctly identified
- False Positives (FP): Legitimate flagged as fraud (customer friction)
- False Negatives (FN): Fraud missed (financial loss)
- True Negatives (TN): Legitimate correctly cleared

Secondary Metrics:

ROC-AUC: Less informative for extreme imbalance but still useful
Specificity: TN / (TN + FP) - proportion of legitimate transactions correctly cleared
Cost-based Metric: Custom metric incorporating business costs of FP and FN

Metrics to AVOID:

Accuracy: Predicting all transactions as legitimate gives 99.828% accuracy - completely useless!
Balanced Accuracy: Better than accuracy but still not ideal for extreme imbalance

Handling Imbalanced Data - Detailed Strategies

1. Resampling Techniques:

python

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN

# SMOTE - Synthetic Minority Oversampling
smote = SMOTE(sampling_strategy=0.5, random_state=42)  # Balance to 50:50 or custom ratio
X_sm, y_sm = smote.fit_resample(X_train, y_train)

# ADASYN - Adaptive Synthetic Sampling (focuses on harder cases)
adasyn = ADASYN(sampling_strategy=0.5, random_state=42)
X_ad, y_ad = adasyn.fit_resample(X_train, y_train)

# Random Undersampling (fast but loses information)
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)

# Combined approach - SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train, y_train)

2. Algorithmic Approaches:

python

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Class weights - penalize minority class errors more
lr = LogisticRegression(class_weight='balanced')  # Automatic balancing
rf = RandomForestClassifier(class_weight={0:1, 1:100})  # Manual weights

# XGBoost scale_pos_weight
scale = (y_train == 0).sum() / (y_train == 1).sum()
xgb = XGBClassifier(scale_pos_weight=scale, max_depth=3)

# LightGBM with class weights
lgbm = LGBMClassifier(class_weight='balanced', n_estimators=100)

3. Threshold Optimization:

python

from sklearn.metrics import precision_recall_curve

# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]

# Find optimal threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Option 1: Maximize F1-Score
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]

# Option 2: Business constraint (e.g., minimum 90% recall)
min_recall = 0.90
valid_indices = recall >= min_recall
if valid_indices.any():
    max_precision_idx = np.argmax(precision[valid_indices])
    optimal_threshold = thresholds[valid_indices][max_precision_idx]

# Apply optimal threshold
y_pred = (y_proba >= optimal_threshold).astype(int)

4. Anomaly Detection:

python

from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope

# Isolation Forest (unsupervised)
iso_forest = IsolationForest(contamination=0.00172, random_state=42)
iso_forest.fit(X_train[y_train == 0])  # Train on legitimate only
y_pred = iso_forest.predict(X_test)  # -1 = anomaly (fraud), 1 = normal

# One-Class SVM
oc_svm = OneClassSVM(nu=0.00172, kernel='rbf', gamma='auto')
oc_svm.fit(X_train[y_train == 0])
y_pred = oc_svm.predict(X_test)

# Autoencoder (deep learning anomaly detection)
# Train on legitimate transactions, high reconstruction error = fraud

5. Ensemble Methods for Imbalance:

python

from imblearn.ensemble import (
    BalancedRandomForestClassifier,
    EasyEnsembleClassifier,
    RUSBoostClassifier,
    BalancedBaggingClassifier
)

# Balanced Random Forest
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)

# EasyEnsemble (multiple undersampled subsets)
easy = EasyEnsembleClassifier(n_estimators=10, random_state=42)
easy.fit(X_train, y_train)

# RUSBoost (Random UnderSampling + Boosting)
rus_boost = RUSBoostClassifier(n_estimators=100, random_state=42)
rus_boost.fit(X_train, y_train)

Model Performance Expectations

Baseline (Predicting all as legitimate):

Accuracy: 99.828% (misleading!)
Recall: 0% (catches no fraud)
Precision: Undefined (no fraud predicted)
Completely useless despite high "accuracy"

Acceptable Performance:

PR-AUC: 0.70-0.80
Recall: 75-85%
Precision: 85-92%
F1-Score: 0.70-0.80

Good Performance:

PR-AUC: 0.80-0.90
Recall: 85-92%
Precision: 92-96%
F1-Score: 0.80-0.88

State-of-the-Art:

PR-AUC: 0.90-0.95
Recall: 92-97%
Precision: 96-98%
F1-Score: 0.88-0.92

Important Notes:

Perfect performance is impossible due to inherent label noise
Trade-offs exist between recall (catch fraud) and precision (avoid false alarms)
Business requirements determine optimal operating point

Business Context and Cost Analysis

False Negative (FN) - Missing Fraud:

Average fraud transaction: $100-500
Chargebacks and fees to bank
Customer dissatisfaction (fraud victim)
Cost per FN: $100-1000+ depending on amount

False Positive (FP) - Legitimate Flagged as Fraud:

Customer friction and annoyance
Abandoned purchases (lost revenue)
Manual review costs ($5-20 per investigation)
Customer support time
Cost per FP: $10-50 depending on process

Cost Ratio:

FN typically 10-100x more costly than FP
Justifies optimizing for higher recall even with some FP increase
Exact ratio depends on business model and fraud amounts

Threshold Selection Example:

python

# Business costs
cost_fn = 500  # Missing $500 fraud
cost_fp = 20   # $20 manual review cost

# Calculate cost for different thresholds
def calculate_cost(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    total_cost = (fn * cost_fn) + (fp * cost_fp)
    return total_cost, fn, fp

# Find threshold that minimizes cost
best_cost = float('inf')
best_threshold = 0.5
for threshold in np.arange(0.1, 0.9, 0.01):
    y_pred = (y_proba >= threshold).astype(int)
    cost, fn, fp = calculate_cost(y_test, y_pred)
    if cost < best_cost:
        best_cost = cost
        best_threshold = threshold

Feature Engineering Limitations

Due to PCA anonymization, traditional feature engineering is severely limited:

What You CAN'T Do:

Create merchant-based features (unknown)
Use geographic/location information (not provided)
Leverage transaction category knowledge (anonymized)
Build cardholder behavioral profiles (no card IDs)
Time-of-day patterns beyond basic hour extraction

What You CAN Do:

Time-based patterns (transactions per hour, day/night)
Amount-based features (log transformation, binning, z-scores)
Statistical interactions between V1-V28 features
Polynomial features or interaction terms between V components
Aggregate statistics if assuming sequential transactions (risky assumption)

Example Limited Engineering:

python

# Amount transformations
df['amount_log'] = np.log1p(df['Amount'])
df['amount_zscore'] = (df['Amount'] - df['Amount'].mean()) / df['Amount'].std()

# Time features
df['hour'] = (df['Time'] / 3600) % 24
df['day'] = (df['Time'] / (3600 * 24)).astype(int)

# Interaction terms (V features)
df['V1_V2_interaction'] = df['V1'] * df['V2']
df['V1_squared'] = df['V1'] ** 2

Practical Deployment Considerations

Real-Time Requirements:

Latency: <100ms per transaction (ideally <50ms)
Throughput: Thousands of transactions per second
Availability: 99.99%+ uptime (fraud detection can't go down)

Model Serving:

python

# Lightweight model for production
import joblib
from flask import Flask, request, jsonify

app = Flask(__name__)
model = joblib.load('fraud_model.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = scaler.transform([data['features']])
    prob = model.predict_proba(features)[0][1]
    
    return jsonify({
        'fraud_probability': float(prob),
        'is_fraud': bool(prob > 0.3),  # Optimized threshold
        'confidence': 'high' if prob > 0.7 or prob < 0.1 else 'medium'
    })

Monitoring and Maintenance:

Track precision/recall daily
Monitor fraud patterns for concept drift
A/B test new models against production
Retrain quarterly or when performance degrades
Log all predictions for audit trail

Scalability:

Use lightweight models (XGBoost, LightGBM over deep learning)
Consider model quantization for faster inference
Batch predictions where possible
Cache feature transformations
Horizontal scaling with load balancers

Regulatory Compliance:

Explainability requirements (SHAP, LIME despite PCA)
Fair lending laws (no discrimination)
Data privacy (GDPR, PCI-DSS)
Audit trails for all fraud decisions
Model documentation and validation

Ethical Considerations

Fairness:

Ensure model doesn't discriminate by demographics (even if encoded in V features)
Test for disparate impact across customer segments
Regular bias audits

Transparency:

Communicate AI role in fraud detection to customers
Provide appeal process for declined transactions
Explain decision factors when possible (challenging with PCA)

Privacy:

PCA transformation protects original features
Comply with data protection regulations
Secure model and data storage
Limited retention of transaction details

Customer Impact:

Balance fraud prevention with customer experience
Minimize false positives that frustrate legitimate customers
Fast resolution process for false fraud flags
Clear communication when fraud is suspected

Credit Card Fraud Detection Dataset

Key Features

Why This Dataset

How to Use the Dataset

Possible Project Ideas

Dataset Challenges and Considerations

Critical Performance Metrics

Handling Imbalanced Data - Detailed Strategies

Model Performance Expectations

Business Context and Cost Analysis

Feature Engineering Limitations

Practical Deployment Considerations

Ethical Considerations

Leave a Reply Cancel reply

Copyright © 2025 codewithfimi.com - All Rights Reserved