- Version
- Download
- File Size 0.00 KB
- File Count 0
- Create Date January 22, 2026
- Last Updated January 22, 2026
The Wine Quality Dataset, created by Yasser H and originally from the UCI Machine Learning Repository, contains detailed physicochemical and sensory data for Portuguese "Vinho Verde" wine variants. This dataset includes measurements of various chemical properties along with quality ratings assigned by wine experts, making it ideal for both classification and regression tasks in predictive analytics.
Available on Kaggle, this dataset is excellent for building models that predict wine quality based on objective chemical measurements, understanding feature relationships in sensory evaluation, and exploring how different physicochemical properties influence wine quality - valuable for both data science practice and real-world applications in food science and quality control.
Key Features
- Records: Approximately 6,497 wine samples (combined red and white wines).
- Red wine: ~1,599 samples
- White wine: ~4,898 samples
- Variables: 12 features plus target variable:
- Fixed Acidity: Tartaric acid content (g/dm³)
- Volatile Acidity: Acetic acid content (g/dm³) - high levels can lead to unpleasant vinegar taste
- Citric Acid: Adds freshness and flavor (g/dm³)
- Residual Sugar: Sugar remaining after fermentation (g/dm³)
- Chlorides: Salt content (g/dm³)
- Free Sulfur Dioxide: Free form of SO₂ (mg/dm³)
- Total Sulfur Dioxide: Total SO₂ content (mg/dm³)
- Density: Density of wine (g/cm³)
- pH: Acidity level (0-14 scale)
- Sulphates: Wine additive contributing to SO₂ levels (g/dm³)
- Alcohol: Alcohol percentage by volume
- Wine Type: Red or white wine (if included)
- Quality: Target variable - score between 0-10 (typically 3-9 in practice)
- Data Type: Numerical (continuous physicochemical measurements) with ordinal target variable.
- Format: CSV file.
- Quality Distribution: Imbalanced - most wines rated 5-7, few rated 3-4 or 8-9.
Why This Dataset
This dataset bridges chemistry and sensory science, allowing exploration of how objective measurements relate to subjective quality assessments. It provides real-world complexity with imbalanced classes and correlated features. It's ideal for projects that aim to:
- Predict wine quality ratings from chemical properties (classification or regression).
- Identify which physicochemical factors most influence quality perception.
- Compare red and white wine characteristics and quality determinants.
- Handle ordinal target variables in machine learning problems.
- Address class imbalance in multi-class classification scenarios.
- Perform feature engineering and selection for chemical data.
- Build interpretable models for domain experts in oenology.
- Explore relationships between correlated chemical properties.
How to Use the Dataset
- Download the CSV file from Kaggle.
- Load into Python using Pandas:
df = pd.read_csv('wine_quality.csv'). - Explore the structure using
.info(),.head(),.describe()to understand feature distributions. - Check for missing values using
.isnull().sum()(typically none in this dataset). - Analyze target distribution using
.value_counts()on quality scores to understand class imbalance. - Visualize distributions:
- Histograms for each chemical property
- Box plots comparing features across quality ratings
- Correlation heatmap to identify multicollinearity
- Scatter plots of key features vs. quality
- Handle wine types if dataset includes both red and white:
- Analyze separately or create binary feature
- Compare chemical profiles between types
- Address quality variable:
- Classification approach: Treat as multi-class (3, 4, 5, 6, 7, 8, 9)
- Simplified classification: Binary (good vs. bad) or three-class (low, medium, high)
- Regression approach: Predict exact quality score
- Engineer features:
- Ratio features (e.g., free_SO2/total_SO2)
- Acidity combinations (fixed + volatile + citric)
- Polynomial features for non-linear relationships
- Binning continuous variables
- Interaction terms between related chemicals
- Handle outliers using IQR method or domain knowledge about realistic ranges.
- Scale features using StandardScaler or RobustScaler for chemical measurements.
- Address class imbalance:
- SMOTE for oversampling minority classes
- Class weights in algorithms
- Combine adjacent quality scores to reduce classes
- Split data using stratified train-test split to maintain quality distribution.
- Train models including:
- Logistic Regression (for classification)
- Linear/Ridge/Lasso Regression (for regression)
- Random Forest
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- SVM
- Neural Networks
- Evaluate performance:
- Classification: Accuracy, F1-score (macro/weighted), confusion matrix, per-class metrics
- Regression: RMSE, MAE, R², mean absolute percentage error
- Interpret results using feature importance plots, SHAP values, or coefficient analysis.
Possible Project Ideas
- Multi-class quality classifier predicting exact quality ratings (3-9 scale).
- Binary quality classifier distinguishing good wines (≥7) from average/poor wines (<7).
- Regression model predicting continuous quality scores with confidence intervals.
- Feature importance study identifying which chemical properties most influence quality.
- Red vs. white wine comparison analyzing differences in quality determinants.
- Optimal wine profile generator finding ideal chemical combinations for high-quality ratings.
- Imbalanced learning study comparing SMOTE, class weights, and ensemble methods.
- Ensemble learning system combining multiple models using voting or stacking.
- Clustering analysis discovering natural wine groupings based on chemical profiles.
- Outlier detection system identifying unusual wines or potential data errors.
- Interactive wine quality predictor with Streamlit allowing chemical input for quality estimation.
- Feature engineering comparison evaluating impact of different engineered features.
- Explainable AI dashboard using SHAP to show how each chemical contributes to predictions.
- Ordinal regression model specifically handling ordinal nature of quality scores.
- Domain-specific validation testing if predictions align with oenological knowledge.
Dataset Challenges and Considerations
- Class Imbalance: Most wines rated 5-6; very few rated 3-4 or 8-9, making extreme quality prediction difficult.
- Subjectivity: Quality scores are based on sensory evaluation by experts, which can vary.
- Ordinal Target: Quality is ordinal (7 > 6 > 5), but many algorithms treat it as nominal or continuous.
- Feature Correlation: Many chemical properties are correlated (e.g., density with alcohol and sugar).
- Limited Quality Range: Actual scores typically range 3-9, not full 0-10 scale.
- Sample Representation: Dataset represents specific Portuguese wine region (Vinho Verde).
- Outliers: Some chemical measurements may be outliers or measurement errors.
- Feature Scaling: Different chemical properties have vastly different scales and units.
- Non-linear Relationships: Wine quality may depend on non-linear combinations of chemicals.
Key Chemical Relationships
Quality Influencers (Generally):
- Positive correlations: Higher alcohol content, moderate acidity levels, optimal sulfur dioxide
- Negative correlations: High volatile acidity (vinegar taste), excessive chlorides
- Complex relationships: Residual sugar (varies by wine type), pH interactions
Red vs. White Wine Differences:
- White wines typically have higher residual sugar and sulfur dioxide
- Red wines often have higher pH and volatile acidity
- Quality determinants may differ between types
Chemical Interactions:
- Sulfur dioxide (free vs. total) affects preservation and taste
- Acidity types (fixed, volatile, citric) contribute differently to flavor
- Alcohol and density are inversely related
Feature Engineering Strategies
Ratio Features:
- Free SOâ‚‚ / Total SOâ‚‚ (preservation efficiency)
- Fixed acidity / Volatile acidity (acidity quality)
- Citric acid / Total acidity
Aggregate Features:
- Total acidity = Fixed + Volatile + Citric
- Sulfur dioxide balance metrics
Chemical Category Features:
- Acidity level bins (low, medium, high)
- Alcohol strength categories
- Sweetness levels from residual sugar
Interaction Terms:
- Alcohol × Acidity
- Sugar × Alcohol (sweetness perception)
- pH × Sulfates
Model Performance Expectations
Classification (Multi-class 3-9):
- Baseline (predict mode): ~45-50% accuracy
- Simple models (Logistic Regression): ~52-58% accuracy
- Ensemble methods (Random Forest, XGBoost): ~60-68% accuracy
- Top approaches: ~65-70% accuracy
Binary Classification (Good ≥7 vs. Bad <7):
- Simple models: ~70-75% accuracy
- Ensemble methods: ~75-82% accuracy
- Advanced techniques: ~80-85% accuracy
Regression (Predicting exact scores):
- Linear models: RMSE ~0.7-0.8
- Ensemble methods: RMSE ~0.6-0.7
- Best approaches: RMSE ~0.55-0.65
Evaluation Considerations
- Macro vs. Weighted F1: Weighted F1 gives better sense of overall performance given imbalance
- Per-class Performance: Examine confusion matrix to see which quality levels are hardest to predict
- Adjacent Class Errors: Error predicting 6 as 7 is less severe than predicting 6 as 3
- Business Context: False negatives (rating good wine as bad) may have different costs than false positives
Domain Knowledge Integration
- Oenological Validity: Predictions should align with wine-making principles
- Chemical Feasibility: Feature combinations should be chemically realistic
- Expert Consultation: Model insights should be validated with wine experts
- Regulatory Limits: Some chemicals have legal limits in wine production