Wine Quality Dataset

Download

[free_download_btn]

The Wine Quality Dataset, created by Yasser H and originally from the UCI Machine Learning Repository, contains detailed physicochemical and sensory data for Portuguese "Vinho Verde" wine variants. This dataset includes measurements of various chemical properties along with quality ratings assigned by wine experts, making it ideal for both classification and regression tasks in predictive analytics.

Available on Kaggle, this dataset is excellent for building models that predict wine quality based on objective chemical measurements, understanding feature relationships in sensory evaluation, and exploring how different physicochemical properties influence wine quality - valuable for both data science practice and real-world applications in food science and quality control.

Key Features

Records: Approximately 6,497 wine samples (combined red and white wines).
- Red wine: ~1,599 samples
- White wine: ~4,898 samples
Variables: 12 features plus target variable:
- Fixed Acidity: Tartaric acid content (g/dm³)
- Volatile Acidity: Acetic acid content (g/dm³) - high levels can lead to unpleasant vinegar taste
- Citric Acid: Adds freshness and flavor (g/dm³)
- Residual Sugar: Sugar remaining after fermentation (g/dm³)
- Chlorides: Salt content (g/dm³)
- Free Sulfur Dioxide: Free form of SO₂ (mg/dm³)
- Total Sulfur Dioxide: Total SO₂ content (mg/dm³)
- Density: Density of wine (g/cm³)
- pH: Acidity level (0-14 scale)
- Sulphates: Wine additive contributing to SO₂ levels (g/dm³)
- Alcohol: Alcohol percentage by volume
- Wine Type: Red or white wine (if included)
- Quality: Target variable - score between 0-10 (typically 3-9 in practice)
Data Type: Numerical (continuous physicochemical measurements) with ordinal target variable.
Format: CSV file.
Quality Distribution: Imbalanced - most wines rated 5-7, few rated 3-4 or 8-9.

Why This Dataset

This dataset bridges chemistry and sensory science, allowing exploration of how objective measurements relate to subjective quality assessments. It provides real-world complexity with imbalanced classes and correlated features. It's ideal for projects that aim to:

Predict wine quality ratings from chemical properties (classification or regression).
Identify which physicochemical factors most influence quality perception.
Compare red and white wine characteristics and quality determinants.
Handle ordinal target variables in machine learning problems.
Address class imbalance in multi-class classification scenarios.
Perform feature engineering and selection for chemical data.
Build interpretable models for domain experts in oenology.
Explore relationships between correlated chemical properties.

How to Use the Dataset

Download the CSV file from Kaggle.
Load into Python using Pandas: df = pd.read_csv('wine_quality.csv').
Explore the structure using .info(), .head(), .describe() to understand feature distributions.
Check for missing values using .isnull().sum() (typically none in this dataset).
Analyze target distribution using .value_counts() on quality scores to understand class imbalance.
Visualize distributions:
- Histograms for each chemical property
- Box plots comparing features across quality ratings
- Correlation heatmap to identify multicollinearity
- Scatter plots of key features vs. quality
Handle wine types if dataset includes both red and white:
- Analyze separately or create binary feature
- Compare chemical profiles between types
Address quality variable:
- Classification approach: Treat as multi-class (3, 4, 5, 6, 7, 8, 9)
- Simplified classification: Binary (good vs. bad) or three-class (low, medium, high)
- Regression approach: Predict exact quality score
Engineer features:
- Ratio features (e.g., free_SO2/total_SO2)
- Acidity combinations (fixed + volatile + citric)
- Polynomial features for non-linear relationships
- Binning continuous variables
- Interaction terms between related chemicals
Handle outliers using IQR method or domain knowledge about realistic ranges.
Scale features using StandardScaler or RobustScaler for chemical measurements.
Address class imbalance:
- SMOTE for oversampling minority classes
- Class weights in algorithms
- Combine adjacent quality scores to reduce classes
Split data using stratified train-test split to maintain quality distribution.
Train models including:
- Logistic Regression (for classification)
- Linear/Ridge/Lasso Regression (for regression)
- Random Forest
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- SVM
- Neural Networks
Evaluate performance:
- Classification: Accuracy, F1-score (macro/weighted), confusion matrix, per-class metrics
- Regression: RMSE, MAE, R², mean absolute percentage error
Interpret results using feature importance plots, SHAP values, or coefficient analysis.

Possible Project Ideas

Multi-class quality classifier predicting exact quality ratings (3-9 scale).
Binary quality classifier distinguishing good wines (≥7) from average/poor wines (<7).
Regression model predicting continuous quality scores with confidence intervals.
Feature importance study identifying which chemical properties most influence quality.
Red vs. white wine comparison analyzing differences in quality determinants.
Optimal wine profile generator finding ideal chemical combinations for high-quality ratings.
Imbalanced learning study comparing SMOTE, class weights, and ensemble methods.
Ensemble learning system combining multiple models using voting or stacking.
Clustering analysis discovering natural wine groupings based on chemical profiles.
Outlier detection system identifying unusual wines or potential data errors.
Interactive wine quality predictor with Streamlit allowing chemical input for quality estimation.
Feature engineering comparison evaluating impact of different engineered features.
Explainable AI dashboard using SHAP to show how each chemical contributes to predictions.
Ordinal regression model specifically handling ordinal nature of quality scores.
Domain-specific validation testing if predictions align with oenological knowledge.

Dataset Challenges and Considerations

Class Imbalance: Most wines rated 5-6; very few rated 3-4 or 8-9, making extreme quality prediction difficult.
Subjectivity: Quality scores are based on sensory evaluation by experts, which can vary.
Ordinal Target: Quality is ordinal (7 > 6 > 5), but many algorithms treat it as nominal or continuous.
Feature Correlation: Many chemical properties are correlated (e.g., density with alcohol and sugar).
Limited Quality Range: Actual scores typically range 3-9, not full 0-10 scale.
Sample Representation: Dataset represents specific Portuguese wine region (Vinho Verde).
Outliers: Some chemical measurements may be outliers or measurement errors.
Feature Scaling: Different chemical properties have vastly different scales and units.
Non-linear Relationships: Wine quality may depend on non-linear combinations of chemicals.

Key Chemical Relationships

Quality Influencers (Generally):

Positive correlations: Higher alcohol content, moderate acidity levels, optimal sulfur dioxide
Negative correlations: High volatile acidity (vinegar taste), excessive chlorides
Complex relationships: Residual sugar (varies by wine type), pH interactions

Red vs. White Wine Differences:

White wines typically have higher residual sugar and sulfur dioxide
Red wines often have higher pH and volatile acidity
Quality determinants may differ between types

Chemical Interactions:

Sulfur dioxide (free vs. total) affects preservation and taste
Acidity types (fixed, volatile, citric) contribute differently to flavor
Alcohol and density are inversely related

Feature Engineering Strategies

Ratio Features:

Free SO₂ / Total SO₂ (preservation efficiency)
Fixed acidity / Volatile acidity (acidity quality)
Citric acid / Total acidity

Aggregate Features:

Total acidity = Fixed + Volatile + Citric
Sulfur dioxide balance metrics

Chemical Category Features:

Acidity level bins (low, medium, high)
Alcohol strength categories
Sweetness levels from residual sugar

Interaction Terms:

Alcohol × Acidity
Sugar × Alcohol (sweetness perception)
pH × Sulfates

Model Performance Expectations

Classification (Multi-class 3-9):

Baseline (predict mode): ~45-50% accuracy
Simple models (Logistic Regression): ~52-58% accuracy
Ensemble methods (Random Forest, XGBoost): ~60-68% accuracy
Top approaches: ~65-70% accuracy

Binary Classification (Good ≥7 vs. Bad <7):

Simple models: ~70-75% accuracy
Ensemble methods: ~75-82% accuracy
Advanced techniques: ~80-85% accuracy

Regression (Predicting exact scores):

Linear models: RMSE ~0.7-0.8
Ensemble methods: RMSE ~0.6-0.7
Best approaches: RMSE ~0.55-0.65

Evaluation Considerations

Macro vs. Weighted F1: Weighted F1 gives better sense of overall performance given imbalance
Per-class Performance: Examine confusion matrix to see which quality levels are hardest to predict
Adjacent Class Errors: Error predicting 6 as 7 is less severe than predicting 6 as 3
Business Context: False negatives (rating good wine as bad) may have different costs than false positives

Domain Knowledge Integration

Oenological Validity: Predictions should align with wine-making principles
Chemical Feasibility: Feature combinations should be chemically realistic
Expert Consultation: Model insights should be validated with wine experts
Regulatory Limits: Some chemicals have legal limits in wine production

Version
Download 312
File Size 0.00 KB
File Count 1
Create Date January 22, 2026
Last Updated April 14, 2026

File	Action
wine+quality	Download