A high-performance C++ program for computing the similarity between text documents. Designed for tasks such as plagiarism detection, duplicate content analysis, and document clustering, this tool delivers accuracy and efficiency for both small and large datasets.
- Multiple Similarity Metrics: Includes algorithms like cosine similarity, Jaccard similarity, and more.
- Text Preprocessing: Handles tokenization, case normalization, stopword removal, and stemming.
- Scalable: Optimized for handling large datasets and multiple comparisons.
- Configurable: Easily extend or modify to suit specific text analysis needs.
Document_Similarity/
├── src/ # Source code files
├── include/ # Header files
├── samples/ # Example input documents
├── build/ # Directory for compiled files (generated)
├── Makefile # Build system configuration
├── README.md # Project documentation
└── LICENSE # License information
Before using the program, ensure you have the following installed:
- C++17 or newer: Required for compilation.
- Make: For building the project.
- CMake (optional): For advanced build configuration.
-
Clone the repository:
git clone https://github.com/Mohammed-3tef/Document_Similarity.git cd Document_Similarity -
Compile the program:
- Using
Make:make
- Using
CMake:mkdir build && cd build cmake .. make
- Using
-
The executable file will be created in the
build/or project root directory.
We welcome contributions from the community! To contribute:
- Fork the repository.
- Create a feature branch:
git checkout -b feature-name - Commit your changes:
git commit -m "Add feature or fix a bug" - Push to your fork and open a pull request.
- Name: Mohammed Atef Abd El-Kader
- ID: 20231143
- Version: 1.0
- Date: 15 Nov. 2024
This project is licensed under the MIT License. You are free to use, modify, and distribute this software under the terms of the license.