Completed

Modeling Australia’s Weekly Earnings with Employment Signals

Course Project · University of Sydney 2024 S2
Random Forest
Gradient Boosting
Linear Regression
Extra Trees
AdaBoost
Decision Tree
KNN
Naive Bayes
R² / MSE
Python
scikit-learn

Predictive Modeling · EDA · Model Comparison (Python / scikit-learn)

Project Overview

Using the employment dataset (1994–2024), we predict Australia’s weekly total earnings from employment signals. After cleaning/standardizing raw attributes, we built multiple models and compared performance via R² and MSE on an 80/20 train–test split. Ensemble learners (Random Forest / Gradient Boosting) achieved the best fit (R² ≈ 0.99, low MSE), while simpler baselines (KNN / Naive Bayes) underperformed. Feature importance indicates “Employed Persons” dominates predictive power; “Unemployed Persons” adds minor lift.

What I Did

  • Data prep: parsed & cleaned historical series; standardized numeric fields; reproducible train/test split (80/20, random_state=42).
  • Benchmarked 8 models: Random Forest, Gradient Boosting, Linear Regression, Extra Trees, AdaBoost, Decision Tree, KNN, Naive Bayes.
  • Top accuracy: Random Forest (R²≈0.995, MSE≈411) edges Gradient Boosting (R²≈0.994, MSE≈556); linear model also strong on this dataset.
  • Diagnostics: inspected residuals/fit plots; checked variance and generalization gap; ensembles showed no material overfitting.
  • Interpretability: feature importance highlights Employed Persons as the primary driver; Unemployed Persons contributes marginally.
  • Packaging: produced publication-ready figures and a short model-selection narrative for non-technical stakeholders.

Reflection

Two things stood out. First, model hierarchy matters less when signal-to-noise is high and relationships are close to linear—ensembles still win, but a well-specified linear baseline can be surprisingly competitive. Second, stakeholder value comes from translation, not just metrics: we used feature importance and error bands to explain what moves earnings and where predictions are less certain. If I iterate, I’ll add macro covariates (CPI, IR, sector composition) to stress-test non-linear gains. Second, build a simple cross-validation / time-series split to de-bias the random 80/20. Third, quantify stability via rolling windows; and tighten the plotting pipeline so every chart is reproducible from raw inputs. The goal isn’t just a slightly higher R², but a model that remains legible and robust when assumptions inevitably drift.