- Version
- Download
- File Size 0.00 KB
- File Count 0
- Create Date January 30, 2026
- Last Updated January 30, 2026
The Medical Insurance Cost Dataset, created by Mosab Abdelghany, contains individual medical insurance cost data along with various demographic and lifestyle factors that influence insurance premiums. This dataset provides insights into how insurance companies assess risk and determine pricing based on personal characteristics, health indicators, and behavioral factors.
Available on Kaggle, this dataset is excellent for building regression models that predict insurance costs, understanding the factors that drive healthcare expenses, exploring actuarial pricing mechanisms, and developing fair and transparent insurance pricing models - valuable for both healthcare analytics practice and real-world applications in insurance, healthcare policy, and risk assessment.
Key Features
- Records: 1,338 individual insurance beneficiaries.
- Variables: 7 features including:
- Age: Age of the primary beneficiary (years)
- Sex: Gender of the insurance contractor (male, female)
- BMI: Body Mass Index - measure of body fat based on height and weight (kg/m²)
- Children: Number of children/dependents covered by insurance
- Smoker: Smoking status (yes, no)
- Region: Beneficiary's residential area in the US (northeast, northwest, southeast, southwest)
- Charges: Individual medical insurance costs billed by health insurance (target variable in USD)
- Data Type: Mixed (numerical age, BMI, children, charges; categorical sex, smoker, region).
- Format: CSV file.
- Cost Distribution: Right-skewed with most costs in lower range and some very high-cost cases.
Why This Dataset
This dataset represents real-world healthcare economics where multiple personal and lifestyle factors combine to determine insurance costs. It provides clear relationships between predictors and outcomes while maintaining complexity through interactions and non-linear patterns. It's ideal for projects that aim to:
- Predict individual medical insurance costs using regression models.
- Understand how demographic and lifestyle factors influence healthcare expenses.
- Quantify the cost impact of smoking, obesity, and other risk factors.
- Identify interaction effects between variables (e.g., smoking × BMI, age × smoking).
- Build fair and explainable pricing models for insurance applications.
- Perform feature engineering to capture non-linear relationships.
- Handle outliers and skewed distributions in cost data.
- Compare different regression algorithms for cost prediction.
How to Use the Dataset
- Download the CSV file from Kaggle.
- Load into Python using Pandas:
df = pd.read_csv('insurance.csv'). - Explore the structure using
.info(),.head(),.describe()to understand data composition. - Check for missing values using
.isnull().sum()(typically none in this dataset). - Analyze target distribution:
- Histogram of charges showing right-skewed distribution
- Box plot to identify outliers
- Consider log transformation for normality
- Visualize relationships:
- Scatter plots of age, BMI vs. charges
- Box plots comparing charges by smoker status, sex, region
- Pair plots to see feature interactions
- Correlation heatmap for numerical features
- Explore key patterns:
- Smokers have dramatically higher costs
- Age positively correlates with costs
- BMI impact varies by smoking status
- Regional differences exist but are smaller
- Encode categorical variables:
- Binary encoding for sex (0/1) and smoker (0/1)
- One-hot encoding for region
- Consider label encoding for ordinal relationships
- Engineer features:
- Age groups/bins (young, middle-aged, senior)
- BMI categories (underweight, normal, overweight, obese)
- Interaction terms: smoker × BMI, smoker × age
- Polynomial features for age and BMI
- Total dependents (children + 1 for beneficiary)
- Risk score combining multiple factors
- Handle outliers carefully - high costs may be legitimate, not errors.
- Transform target variable using log transformation to handle skewness:
np.log(charges). - Scale features using StandardScaler or RobustScaler if needed for certain algorithms.
- Split data using train-test split (70-30 or 80-20).
- Train regression models including:
- Linear Regression
- Ridge/Lasso Regression (regularization)
- Decision Tree Regression
- Random Forest Regression
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Support Vector Regression (SVR)
- Neural Networks
- Evaluate performance using:
- R² score (coefficient of determination)
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- MAPE (Mean Absolute Percentage Error)
- Interpret results using coefficient analysis, feature importance, SHAP values, or partial dependence plots.
Possible Project Ideas
- Insurance cost predictor building accurate regression models for premium estimation.
- Feature importance analysis identifying which factors most influence insurance costs.
- Smoking impact study quantifying the cost difference between smokers and non-smokers.
- BMI threshold analysis determining at what BMI level costs significantly increase.
- Interaction effect exploration investigating how factors combine (e.g., smoking + high BMI).
- Regional cost comparison analyzing geographic variations in insurance pricing.
- Risk stratification system categorizing individuals into low, medium, high-cost groups.
- Interactive cost calculator with Streamlit allowing users to input characteristics and estimate costs.
- Polynomial regression study capturing non-linear age and BMI effects.
- Ensemble learning approach combining multiple models for improved predictions.
- Outlier analysis investigating extremely high-cost cases and their characteristics.
- Fairness assessment evaluating if pricing models exhibit demographic bias.
- Explainable AI dashboard using SHAP to show cost drivers for individual predictions.
- Cost optimization advisor suggesting lifestyle changes to reduce insurance premiums.
- Comparative algorithm study benchmarking linear vs. tree-based vs. neural network approaches.
Dataset Challenges and Considerations
- Right-skewed Distribution: Charges are heavily right-skewed; log transformation often helps.
- Outliers: Some very high-cost individuals exist; determine if legitimate or errors.
- Sample Size: 1,338 samples is modest; complex models may overfit.
- Smoking Dominance: Smoking status is such a strong predictor it may overshadow other factors.
- Interaction Effects: Simple linear models miss important interactions between variables.
- Limited Features: Only 7 features; real insurance pricing uses many more factors.
- Regional Simplification: Only 4 US regions; real geography is more nuanced.
- Missing Health Conditions: Dataset lacks chronic disease information that affects costs.
- Temporal Aspect: Static snapshot; doesn't capture cost changes over time.
- Selection Bias: Dataset may not represent entire insured population.
Key Cost Drivers (Insights)
Smoking Status:
- Smokers have approximately 3-4x higher insurance costs on average
- Strongest single predictor in the dataset
- Interaction with BMI amplifies costs
Age:
- Positive linear relationship with costs
- Costs increase steadily with age
- Interaction with smoking creates exponential growth
BMI (Body Mass Index):
- Moderate positive correlation with costs
- Impact is amplified for smokers
- Non-linear relationship (costs accelerate at higher BMI)
Children:
- Weak positive correlation
- More dependents slightly increase costs
- Effect is relatively minor compared to other factors
Sex:
- Minimal impact on costs
- Slight differences exist but not statistically significant in most analyses
Region:
- Small variations between regions
- Southeast tends to have slightly higher costs
- Differences are minor compared to lifestyle factors
Feature Engineering Strategies
Age-based Features:
- Age bins: 18-30, 31-45, 46-60, 60+
- Age squared or polynomial terms
- Age × Smoker interaction
BMI-based Features:
- BMI categories: Underweight (<18.5), Normal (18.5-24.9), Overweight (25-29.9), Obese (≥30)
- BMI squared for non-linearity
- BMI × Smoker interaction (critical)
- Distance from healthy BMI range
Composite Risk Scores:
- Risk = (Age/100) + (BMI/50) + (Smoker × 2)
- Weighted combination of factors
- Binary high-risk flag
Interaction Terms:
- Smoker × BMI (most important)
- Smoker × Age
- Age × BMI
- Three-way: Smoker × Age × BMI
Family Features:
- Has_children (binary)
- Family_size (children + 1)
Model Performance Expectations
Linear Regression:
- R² score: ~0.75-0.78
- RMSE: ~6,000-6,500
- Simple baseline, interpretable
Polynomial Regression (degree 2-3):
- R² score: ~0.82-0.85
- RMSE: ~5,000-5,500
- Captures non-linearity better
Random Forest:
- R² score: ~0.85-0.87
- RMSE: ~4,500-5,000
- Handles interactions well
Gradient Boosting (XGBoost/LightGBM):
- R² score: ~0.86-0.89
- RMSE: ~4,200-4,800
- Best performance typically
Neural Networks:
- R² score: ~0.84-0.87
- RMSE: ~4,500-5,200
- May overfit on small dataset
Note: Performance varies with feature engineering quality and hyperparameter tuning.
Evaluation Considerations
- R² Score: Measures proportion of variance explained
- RMSE: Penalizes large errors more heavily (important for cost prediction)
- MAE: More interpretable as average dollar error
- MAPE: Percentage error gives sense of relative accuracy
- Residual Analysis: Check for patterns in prediction errors
- Cross-Validation: Use k-fold CV for reliable performance estimates
Ethical and Practical Considerations
Fairness Concerns:
- Ensure models don't discriminate based on protected characteristics
- Sex and region should have minimal weight if not actuarially justified
- Transparency in how factors affect pricing
Interpretability:
- Insurance pricing must be explainable to customers and regulators
- Use SHAP or coefficient analysis for transparency
- Provide clear reasoning for cost estimates
Actionable Insights:
- Identify modifiable risk factors (smoking, BMI)
- Provide guidance on cost reduction through lifestyle changes
- Support preventive healthcare initiatives
Regulatory Compliance:
- Insurance pricing is heavily regulated
- Models must comply with state/federal laws
- Some factors may be prohibited in certain jurisdictions
Business Application:
- Balance accuracy with simplicity for customer understanding
- Consider competitive pricing pressures
- Ensure sustainable risk pools