Medical Insurance Cost Dataset

Download

[free_download_btn]

The Medical Insurance Cost Dataset, created by Mosab Abdelghany, contains individual medical insurance cost data along with various demographic and lifestyle factors that influence insurance premiums. This dataset provides insights into how insurance companies assess risk and determine pricing based on personal characteristics, health indicators, and behavioral factors.

Available on Kaggle, this dataset is excellent for building regression models that predict insurance costs, understanding the factors that drive healthcare expenses, exploring actuarial pricing mechanisms, and developing fair and transparent insurance pricing models - valuable for both healthcare analytics practice and real-world applications in insurance, healthcare policy, and risk assessment.

Key Features

Records: 1,338 individual insurance beneficiaries.
Variables: 7 features including:
- Age: Age of the primary beneficiary (years)
- Sex: Gender of the insurance contractor (male, female)
- BMI: Body Mass Index - measure of body fat based on height and weight (kg/m²)
- Children: Number of children/dependents covered by insurance
- Smoker: Smoking status (yes, no)
- Region: Beneficiary's residential area in the US (northeast, northwest, southeast, southwest)
- Charges: Individual medical insurance costs billed by health insurance (target variable in USD)
Data Type: Mixed (numerical age, BMI, children, charges; categorical sex, smoker, region).
Format: CSV file.
Cost Distribution: Right-skewed with most costs in lower range and some very high-cost cases.

Why This Dataset

This dataset represents real-world healthcare economics where multiple personal and lifestyle factors combine to determine insurance costs. It provides clear relationships between predictors and outcomes while maintaining complexity through interactions and non-linear patterns. It's ideal for projects that aim to:

Predict individual medical insurance costs using regression models.
Understand how demographic and lifestyle factors influence healthcare expenses.
Quantify the cost impact of smoking, obesity, and other risk factors.
Identify interaction effects between variables (e.g., smoking × BMI, age × smoking).
Build fair and explainable pricing models for insurance applications.
Perform feature engineering to capture non-linear relationships.
Handle outliers and skewed distributions in cost data.
Compare different regression algorithms for cost prediction.

How to Use the Dataset

Download the CSV file from Kaggle.
Load into Python using Pandas: df = pd.read_csv('insurance.csv').
Explore the structure using .info(), .head(), .describe() to understand data composition.
Check for missing values using .isnull().sum() (typically none in this dataset).
Analyze target distribution:
- Histogram of charges showing right-skewed distribution
- Box plot to identify outliers
- Consider log transformation for normality
Visualize relationships:
- Scatter plots of age, BMI vs. charges
- Box plots comparing charges by smoker status, sex, region
- Pair plots to see feature interactions
- Correlation heatmap for numerical features
Explore key patterns:
- Smokers have dramatically higher costs
- Age positively correlates with costs
- BMI impact varies by smoking status
- Regional differences exist but are smaller
Encode categorical variables:
- Binary encoding for sex (0/1) and smoker (0/1)
- One-hot encoding for region
- Consider label encoding for ordinal relationships
Engineer features:
- Age groups/bins (young, middle-aged, senior)
- BMI categories (underweight, normal, overweight, obese)
- Interaction terms: smoker × BMI, smoker × age
- Polynomial features for age and BMI
- Total dependents (children + 1 for beneficiary)
- Risk score combining multiple factors
Handle outliers carefully - high costs may be legitimate, not errors.
Transform target variable using log transformation to handle skewness: np.log(charges).
Scale features using StandardScaler or RobustScaler if needed for certain algorithms.
Split data using train-test split (70-30 or 80-20).
Train regression models including:
- Linear Regression
- Ridge/Lasso Regression (regularization)
- Decision Tree Regression
- Random Forest Regression
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Support Vector Regression (SVR)
- Neural Networks
Evaluate performance using:
- R² score (coefficient of determination)
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- MAPE (Mean Absolute Percentage Error)
Interpret results using coefficient analysis, feature importance, SHAP values, or partial dependence plots.

Possible Project Ideas

Insurance cost predictor building accurate regression models for premium estimation.
Feature importance analysis identifying which factors most influence insurance costs.
Smoking impact study quantifying the cost difference between smokers and non-smokers.
BMI threshold analysis determining at what BMI level costs significantly increase.
Interaction effect exploration investigating how factors combine (e.g., smoking + high BMI).
Regional cost comparison analyzing geographic variations in insurance pricing.
Risk stratification system categorizing individuals into low, medium, high-cost groups.
Interactive cost calculator with Streamlit allowing users to input characteristics and estimate costs.
Polynomial regression study capturing non-linear age and BMI effects.
Ensemble learning approach combining multiple models for improved predictions.
Outlier analysis investigating extremely high-cost cases and their characteristics.
Fairness assessment evaluating if pricing models exhibit demographic bias.
Explainable AI dashboard using SHAP to show cost drivers for individual predictions.
Cost optimization advisor suggesting lifestyle changes to reduce insurance premiums.
Comparative algorithm study benchmarking linear vs. tree-based vs. neural network approaches.

Dataset Challenges and Considerations

Right-skewed Distribution: Charges are heavily right-skewed; log transformation often helps.
Outliers: Some very high-cost individuals exist; determine if legitimate or errors.
Sample Size: 1,338 samples is modest; complex models may overfit.
Smoking Dominance: Smoking status is such a strong predictor it may overshadow other factors.
Interaction Effects: Simple linear models miss important interactions between variables.
Limited Features: Only 7 features; real insurance pricing uses many more factors.
Regional Simplification: Only 4 US regions; real geography is more nuanced.
Missing Health Conditions: Dataset lacks chronic disease information that affects costs.
Temporal Aspect: Static snapshot; doesn't capture cost changes over time.
Selection Bias: Dataset may not represent entire insured population.

Key Cost Drivers (Insights)

Smoking Status:

Smokers have approximately 3-4x higher insurance costs on average
Strongest single predictor in the dataset
Interaction with BMI amplifies costs

Age:

Positive linear relationship with costs
Costs increase steadily with age
Interaction with smoking creates exponential growth

BMI (Body Mass Index):

Moderate positive correlation with costs
Impact is amplified for smokers
Non-linear relationship (costs accelerate at higher BMI)

Children:

Weak positive correlation
More dependents slightly increase costs
Effect is relatively minor compared to other factors

Sex:

Minimal impact on costs
Slight differences exist but not statistically significant in most analyses

Region:

Small variations between regions
Southeast tends to have slightly higher costs
Differences are minor compared to lifestyle factors

Feature Engineering Strategies

Age-based Features:

Age bins: 18-30, 31-45, 46-60, 60+
Age squared or polynomial terms
Age × Smoker interaction

BMI-based Features:

BMI categories: Underweight (<18.5), Normal (18.5-24.9), Overweight (25-29.9), Obese (≥30)
BMI squared for non-linearity
BMI × Smoker interaction (critical)
Distance from healthy BMI range

Composite Risk Scores:

Risk = (Age/100) + (BMI/50) + (Smoker × 2)
Weighted combination of factors
Binary high-risk flag

Interaction Terms:

Smoker × BMI (most important)
Smoker × Age
Age × BMI
Three-way: Smoker × Age × BMI

Family Features:

Has_children (binary)
Family_size (children + 1)

Model Performance Expectations

Linear Regression:

R² score: ~0.75-0.78
RMSE: ~6,000-6,500
Simple baseline, interpretable

Polynomial Regression (degree 2-3):

R² score: ~0.82-0.85
RMSE: ~5,000-5,500
Captures non-linearity better

Random Forest:

R² score: ~0.85-0.87
RMSE: ~4,500-5,000
Handles interactions well

Gradient Boosting (XGBoost/LightGBM):

R² score: ~0.86-0.89
RMSE: ~4,200-4,800
Best performance typically

Neural Networks:

R² score: ~0.84-0.87
RMSE: ~4,500-5,200
May overfit on small dataset

Note: Performance varies with feature engineering quality and hyperparameter tuning.

Evaluation Considerations

R² Score: Measures proportion of variance explained
RMSE: Penalizes large errors more heavily (important for cost prediction)
MAE: More interpretable as average dollar error
MAPE: Percentage error gives sense of relative accuracy
Residual Analysis: Check for patterns in prediction errors
Cross-Validation: Use k-fold CV for reliable performance estimates

Ethical and Practical Considerations

Fairness Concerns:

Ensure models don't discriminate based on protected characteristics
Sex and region should have minimal weight if not actuarially justified
Transparency in how factors affect pricing

Interpretability:

Insurance pricing must be explainable to customers and regulators
Use SHAP or coefficient analysis for transparency
Provide clear reasoning for cost estimates

Actionable Insights:

Identify modifiable risk factors (smoking, BMI)
Provide guidance on cost reduction through lifestyle changes
Support preventive healthcare initiatives

Regulatory Compliance:

Insurance pricing is heavily regulated
Models must comply with state/federal laws
Some factors may be prohibited in certain jurisdictions

Business Application:

Balance accuracy with simplicity for customer understanding
Consider competitive pricing pressures
Ensure sustainable risk pools

Version
Download 360
File Size 0.00 KB
File Count 1
Create Date January 30, 2026
Last Updated April 14, 2026

File	Action
medical-insurance-cost-dataset	Download