Skip to content

Williamzhibo/BreastCancerClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

BreastCancerClassifier

A lightweight, from-scratch k-Nearest Neighbors (kNN) implementation (K = 5) for classifying breast cell clumps as benign or malignant, with a grayscale distance visualizer and JUnit test suite.

Pairwise Distance Visualizer Above: Grayscale heatmap of pairwise distances between training instances. Darker pixels indicate more similar (closer) samples. Benign clusters concentrate in the upper-left; malignant clusters in the lower-right.

Project Overview

This project is a custom-built k-Nearest Neighbors (kNN) algorithm in Java for classifying breast cancer data. It uses Euclidean distance to classify cell clumps as benign (2) or malignant (4), based on the original dataset's encoding. The project is built with Java 17 and Maven, and includes a JUnit 4 test suite for validation. A Swing-based visualizer is also included to display a distance matrix, offering a quick way to check feature separation.

Dataset

The training and testing data are provided as CSV files in the datasets/ directory. Each row in the files represents a sample and is formatted as [id, feature1, feature2, ..., featureN, class]. The train_data.csv file is used for model fitting, while test_data.csv is used for evaluation.

Architecture

The project is divided into four main components:

  • BreastCancerClassify: Contains the core kNN logic, the main entry point, and accuracy reporting.
  • InputHandler: Handles CSV data ingestion.
  • Grapher: A Swing UI for visualizing pairwise distances as a grayscale heatmap.
  • BreastCancerClassifyTest: A JUnit test suite for testing distance calculations, neighbor selection, classification, and accuracy formatting.

How It Works

  1. The InputHandler loads the training and testing data from the CSV files.
  2. The Grapher visualizes the distances between instances in the training set.
  3. For each instance in the test set, the algorithm calculates the distance to all instances in the training set, finds the K-nearest neighbors, and predicts the class based on a majority vote.
  4. The accuracy of the model is then calculated and printed to the console.

Testing

The project includes a comprehensive test suite to ensure the correctness of the implementation. The tests cover various aspects of the algorithm, including distance calculations, neighbor selection, classification logic, and accuracy formatting. You can run the tests using Maven:

mvn test

Getting Started

To run the project, you'll need Java 17 and Maven 3.8+ installed.

First, build the project:

mvn clean compile

Then, run the main class:

java -cp target/classes BreastCancerClassify

This will open the Swing window and print the model's accuracy to the console.

Example Output

Model Accuracy: 80.00%

A Swing window displaying the grayscale distance matrix will also appear.

Future Improvements

  • Performance: The current implementation can be optimized by using a more efficient data structure, like a min-heap, for finding the nearest neighbors.
  • Flexibility: The value of K could be made configurable through command-line arguments.
  • Metrics: Additional evaluation metrics, such as a confusion matrix, precision, recall, and F1-score, could be implemented.
  • Data Preprocessing: Options for data normalization and feature scaling could be added.

About the Author

This project was created me as a way to explore the fundamentals of machine learning by implementing a classic algorithm from scratch in Java. The focus was on creating clear, correct, and well-tested code, with a simple visualization to aid in understanding the data.

About

A from-scratch KNN implementation for classifying breast cell clumps as benign or malignant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages