Completed

Classifying Tumour vs Immune Cells in H&E Images

Course Project · University of Sydney 2026 S1
Computer Vision
Medical Imaging
KNN
HOG
Random Forest
SVM
CNN
ResNet50
Python
R
Explainable AI

Applied Machine Learning · Computer Vision · Pathology Image Classification

Project Overview

This project compares five machine learning models for classifying tumour cells versus immune cells in H&E pathology image patches. The client scenario was a pathologist who needed a reliable, interpretable, and clinically useful classifier. I evaluated models from a simple pixel baseline to feature-engineered and deep learning approaches, including Pixel KNN, HOG + Random Forest, Colour Histogram + SVM, CNN, and ResNet50 transfer learning. The final recommended model was Colour Histogram + SVM, which achieved the strongest performance while remaining lightweight and interpretable.

What I Did

  • Built an end-to-end image classification pipeline using 2,000 H&E image patches: 1,000 immune and 1,000 tumour.
  • Applied an 80/20 stratified split to preserve class balance, resulting in 1,600 training images and 400 test images.
  • Preprocessed images by resizing to 50×50 RGB and normalising pixel values for fair model comparison.
  • Established Pixel KNN as a baseline model; its 0.50 accuracy showed raw pixel distance was not informative.
  • Engineered HOG features with Random Forest to capture edge and texture structure, improving accuracy to 0.65.
  • Designed a domain-informed Colour Histogram + SVM model using HSV colour distributions from H&E staining biology.
  • Tested CNN and ResNet50 to compare deep learning against interpretable feature-based approaches.
  • Selected Colour Histogram + SVM as the final model with accuracy 0.95, tumour sensitivity 0.96, and AUC 0.992.
  • Performed error analysis to identify failure cases: pale tumour nuclei and densely clustered immune cells.

Reflection

This project helped me understand that stronger models are not always more complex models. The most important insight was that H&E staining already encodes biological information through colour: haematoxylin stains nuclei purple-blue and eosin stains cytoplasm pink. By translating this domain knowledge into HSV colour histogram features, the SVM outperformed both CNN and ResNet50. This taught me that in small medical imaging datasets, interpretable and biologically motivated feature engineering can sometimes outperform end-to-end deep learning. The project also strengthened my ability to communicate model trade-offs to a non-technical client, especially around sensitivity, false negatives, interpretability, and deployment risk.