WikiFind is a command-line search engine for Wikipedia XML dumps. It allows you to index large Wikipedia datasets and perform fast searches on them.
- Index Wikipedia XML dumps
- Perform keyword searches
- Stemming for better search results
- Support for various Wikipedia markup elements (categories, infoboxes, links)
- Efficient compression for index storage
- Go 1.24 or later
- A Wikipedia XML dump file (e.g., from https://dumps.wikimedia.org/)
Clone the repository:
git clone https://github.com/PhantomInTheWire/wikifind.git
cd wikifindBuild the project:
go build -o wikifind ./cmdTo index a Wikipedia XML dump:
./wikifind index <xml_file> <index_path><xml_file>: Path to the Wikipedia XML dump file<index_path>: Directory where the index will be stored
Example:
./wikifind index enwiki-20231201-pages-articles.xml index/To search the indexed data:
./wikifind search <index_path><index_path>: Directory containing the index
This will start an interactive search prompt. Enter your queries and get results.
Example:
./wikifind search index/
> apple
Found 5 results:
1. DocID: Apple (Score: 0.95)
...The project is organized into several packages:
cmd/: Main application entry pointindexer/: Indexing logic, including XML parsing, text processing, and inverted index creationsearch/: Search engine implementation with compression and query processing
Run the tests:
go test ./...For test coverage:
go test -cover ./...- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Run the pre-commit hooks
- Submit a pull request
This project is licensed under the MIT License.