Completed

Wine Quality Analysis: Key Drivers of Perceived Quality

Course Project · University of Sydney• 2025 S2

Multiple Regression

Diagnostics

Transformations (log)

Model Selection

AIC/BIC

Stability / Robustness

Interpretation

Statistical Modeling · Model Selection · Robustness & Stability Checks

Project Overview

Goal: identify the key physicochemical drivers behind wine quality ratings, and build interpretable models that generalize. Dataset: • Vinho Verde wine quality dataset (red + white variants) • Outcome: sensory “quality” score • Predictors: acidity, alcohol, sulphates, residual sugar, density, etc. Approach: • Ran full diagnostic checks (linearity, normality, heteroskedasticity, leverage/outliers). • Applied transformations (e.g., log) where appropriate to stabilize variance and improve fit. • Compared model candidates via selection criteria and validation logic. • Emphasized interpretability: which variables matter most and why. Key takeaway: Quality is not driven by a single variable; the best-performing models balance interpretability with predictive stability, and show consistent importance for alcohol and a small set of chemical indicators.

What I Did

Performed EDA and data cleaning, checked missingness, ranges, and distribution shifts between red vs white wines.
Ran regression diagnostics: residual patterns, Q-Q plots, leverage/influence checks, and heteroskedasticity indications.
Applied targeted transformations (e.g., log) to improve assumptions and reduce skew/outlier sensitivity.
Built multiple candidate models and compared them using selection criteria (e.g., AIC/BIC) and interpretability constraints.
Evaluated variable importance and stability with repeated selection logic (aiming for robust, not just ‘best AIC’).
Produced a final explainable model and summarized practical insights: which chemical levers are associated with higher perceived quality.
Communicated results in an executive-summary style: conclusions first, then evidence, then limitations and next steps.

Reflection

This project taught me that “best model” depends on the goal: • If the goal is explanation, you want a stable set of predictors and a clean story. • If the goal is prediction, you may accept more complexity but must validate stability. I focused on bridging both: (1) rigorous diagnostics + transformations, (2) selection with sanity checks, (3) conclusions that remain consistent across reasonable modeling choices. If iterating further, I’d: • add cross-validated performance comparisons, • explore nonlinearities (splines/interactions), • compare red vs white with a unified model including type interactions, • build a small interactive report (filters + coefficient explorer) for better storytelling.