Predictive Modeling · EDA · Model Comparison (Python / scikit-learn)
Using the employment dataset (1994–2024), we predict Australia’s weekly total earnings from employment signals. After cleaning/standardizing raw attributes, we built multiple models and compared performance via R² and MSE on an 80/20 train–test split. Ensemble learners (Random Forest / Gradient Boosting) achieved the best fit (R² ≈ 0.99, low MSE), while simpler baselines (KNN / Naive Bayes) underperformed. Feature importance indicates “Employed Persons” dominates predictive power; “Unemployed Persons” adds minor lift.
Two things stood out. First, model hierarchy matters less when signal-to-noise is high and relationships are close to linear—ensembles still win, but a well-specified linear baseline can be surprisingly competitive. Second, stakeholder value comes from translation, not just metrics: we used feature importance and error bands to explain what moves earnings and where predictions are less certain. If I iterate, I’ll add macro covariates (CPI, IR, sector composition) to stress-test non-linear gains. Second, build a simple cross-validation / time-series split to de-bias the random 80/20. Third, quantify stability via rolling windows; and tighten the plotting pipeline so every chart is reproducible from raw inputs. The goal isn’t just a slightly higher R², but a model that remains legible and robust when assumptions inevitably drift.