A lightweight, from-scratch k-Nearest Neighbors (kNN) implementation (K = 5) for classifying breast cell clumps as benign or malignant, with a grayscale distance visualizer and JUnit test suite.
Above: Grayscale heatmap of pairwise distances between training instances. Darker pixels indicate more similar (closer) samples. Benign clusters concentrate in the upper-left; malignant clusters in the lower-right.
This project is a custom-built k-Nearest Neighbors (kNN) algorithm in Java for classifying breast cancer data. It uses Euclidean distance to classify cell clumps as benign (2) or malignant (4), based on the original dataset's encoding. The project is built with Java 17 and Maven, and includes a JUnit 4 test suite for validation. A Swing-based visualizer is also included to display a distance matrix, offering a quick way to check feature separation.
The training and testing data are provided as CSV files in the datasets/ directory. Each row in the files represents a sample and is formatted as [id, feature1, feature2, ..., featureN, class]. The train_data.csv file is used for model fitting, while test_data.csv is used for evaluation.
The project is divided into four main components:
BreastCancerClassify: Contains the core kNN logic, the main entry point, and accuracy reporting.InputHandler: Handles CSV data ingestion.Grapher: A Swing UI for visualizing pairwise distances as a grayscale heatmap.BreastCancerClassifyTest: A JUnit test suite for testing distance calculations, neighbor selection, classification, and accuracy formatting.
- The
InputHandlerloads the training and testing data from the CSV files. - The
Graphervisualizes the distances between instances in the training set. - For each instance in the test set, the algorithm calculates the distance to all instances in the training set, finds the K-nearest neighbors, and predicts the class based on a majority vote.
- The accuracy of the model is then calculated and printed to the console.
The project includes a comprehensive test suite to ensure the correctness of the implementation. The tests cover various aspects of the algorithm, including distance calculations, neighbor selection, classification logic, and accuracy formatting. You can run the tests using Maven:
mvn testTo run the project, you'll need Java 17 and Maven 3.8+ installed.
First, build the project:
mvn clean compileThen, run the main class:
java -cp target/classes BreastCancerClassifyThis will open the Swing window and print the model's accuracy to the console.
Model Accuracy: 80.00%
A Swing window displaying the grayscale distance matrix will also appear.
- Performance: The current implementation can be optimized by using a more efficient data structure, like a min-heap, for finding the nearest neighbors.
- Flexibility: The value of
Kcould be made configurable through command-line arguments. - Metrics: Additional evaluation metrics, such as a confusion matrix, precision, recall, and F1-score, could be implemented.
- Data Preprocessing: Options for data normalization and feature scaling could be added.
This project was created me as a way to explore the fundamentals of machine learning by implementing a classic algorithm from scratch in Java. The focus was on creating clear, correct, and well-tested code, with a simple visualization to aid in understanding the data.