Completed

Classifier Accuracy & Runtime — Pima Diabetes vs Room Occupancy

Course Project · University of Sydney 2025 S1
k-NN
Naive Bayes
Ensemble
Random Forest
SVM
10-fold CV
Weka
Python

Machine Learning · Model Comparison (Weka vs Python)

Project Overview

We compare custom Python implementations (1NN/7NN, Naive Bayes, + majority-vote ensemble) with Weka baselines (ZeroR, 1R, Decision Tree, MLP, SVM, Random Forest) on two datasets: (1) Pima Indians Diabetes (768 × 8) and (2) Room Occupancy (2025 × 4). Using 10-fold stratified CV, we evaluate both predictive accuracy and runtime. Results: On Pima, Random Forest is top (~77.44%), while SVM/MLP/7NN/MyEns are close in the mid-70% band. On Occupancy, nearly all models exceed 98–99% (e.g., RF ~99.71%, 1NN ~99.51%), reflecting cleaner signals and clearer class boundaries. Our Python implementations match Weka’s accuracy closely, but run slower (~8–12s vs <2s), highlighting language/implementation overheads. Takeaway: Dataset structure dominates outcomes; ensembles help more on noisier Pima than on clean Occupancy. Accuracy parity across platforms validates our implementations; runtime favors Java/Weka unless Python is vectorized/optimized.

What I did

  • Re-implemented 1NN/7NN & Gaussian Naive Bayes in Python; set up 10-fold stratified CV pipeline.
  • Built a simple majority-vote ensemble (1NN + 7NN + NB) and compared against Weka baselines.
  • Reproduced normalization, folds, and configs to ensure fair, like-for-like comparisons.
  • Analyzed accuracy deltas across datasets and models; explained why Occupancy → ~99% while Pima stays ~75%.
  • Profiled runtime: Weka (<2s) vs Python (8–12s) and attributed gaps to platform optimizations & vectorization.
  • Summarized implications: choose models by data regime; prioritize optimization only where latency matters.

Reflection

This exercise made me appreciate how strongly dataset structure drives model performance. On Pima, we saw modest ceilings (≈75–77%) even for strong models; on Occupancy, almost everything was near-perfect. That pushed me to interpret results beyond “which algorithm wins,” toward “why this dataset favors certain inductive biases.” Re-implementing k-NN & NB deepened my grasp of distance metrics, smoothing, and independence assumptions, and why the same design can feel brittle on noisy, high-variance medical data but excel on clean sensor streams. Matching Weka’s accuracy was a good correctness check; the runtime gap reminded me that production systems benefit from compiled paths (HotSpot JIT, optimized data structures) or Python acceleration (NumPy/Cython/numba). If I iterate, I’ll (1) add precision/recall/F1 and calibration error, (2) test robustness under class imbalance/missingness, (3) vectorize and parallelize the Python path, and (4) probe explainability (feature importance, decision surfaces) to balance performance and interpretability for real deployments.