BreastCancerClassifier

A lightweight, from-scratch k-Nearest Neighbors (kNN) implementation (K = 5) for classifying breast cell clumps as benign or malignant, with a grayscale distance visualizer and JUnit test suite.

Above: Grayscale heatmap of pairwise distances between training instances. Darker pixels indicate more similar (closer) samples. Benign clusters concentrate in the upper-left; malignant clusters in the lower-right.

Project Overview

This project is a custom-built k-Nearest Neighbors (kNN) algorithm in Java for classifying breast cancer data. It uses Euclidean distance to classify cell clumps as benign (2) or malignant (4), based on the original dataset's encoding. The project is built with Java 17 and Maven, and includes a JUnit 4 test suite for validation. A Swing-based visualizer is also included to display a distance matrix, offering a quick way to check feature separation.

Dataset

The training and testing data are provided as CSV files in the datasets/ directory. Each row in the files represents a sample and is formatted as [id, feature1, feature2, ..., featureN, class]. The train_data.csv file is used for model fitting, while test_data.csv is used for evaluation.

Architecture

The project is divided into four main components:

BreastCancerClassify: Contains the core kNN logic, the main entry point, and accuracy reporting.
InputHandler: Handles CSV data ingestion.
Grapher: A Swing UI for visualizing pairwise distances as a grayscale heatmap.
BreastCancerClassifyTest: A JUnit test suite for testing distance calculations, neighbor selection, classification, and accuracy formatting.

How It Works

The InputHandler loads the training and testing data from the CSV files.
The Grapher visualizes the distances between instances in the training set.
For each instance in the test set, the algorithm calculates the distance to all instances in the training set, finds the K-nearest neighbors, and predicts the class based on a majority vote.
The accuracy of the model is then calculated and printed to the console.

Testing

The project includes a comprehensive test suite to ensure the correctness of the implementation. The tests cover various aspects of the algorithm, including distance calculations, neighbor selection, classification logic, and accuracy formatting. You can run the tests using Maven:

mvn test

Getting Started

To run the project, you'll need Java 17 and Maven 3.8+ installed.

First, build the project:

mvn clean compile

Then, run the main class:

java -cp target/classes BreastCancerClassify

This will open the Swing window and print the model's accuracy to the console.

Example Output

Model Accuracy: 80.00%

A Swing window displaying the grayscale distance matrix will also appear.

Future Improvements

Performance: The current implementation can be optimized by using a more efficient data structure, like a min-heap, for finding the nearest neighbors.
Flexibility: The value of K could be made configurable through command-line arguments.
Metrics: Additional evaluation metrics, such as a confusion matrix, precision, recall, and F1-score, could be implemented.
Data Preprocessing: Options for data normalization and feature scaling could be added.

About the Author

This project was created me as a way to explore the fundamentals of machine learning by implementing a classic algorithm from scratch in Java. The focus was on creating clear, correct, and well-tested code, with a simple visualization to aid in understanding the data.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
BreastCancerClassify		BreastCancerClassify
README.md		README.md
Visualizer.png		Visualizer.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BreastCancerClassifier

Project Overview

Dataset

Architecture

How It Works

Testing

Getting Started

Example Output

Future Improvements

About the Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BreastCancerClassifier

Project Overview

Dataset

Architecture

How It Works

Testing

Getting Started

Example Output

Future Improvements

About the Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages