基于 Transformer + FAISS 的高性能语义文本去重工具,面向大规模语料库
-
Updated
Aug 12, 2025 - Python
基于 Transformer + FAISS 的高性能语义文本去重工具,面向大规模语料库
Tool to deduplicate file contents
high-performance website content extractor
A hyper-optimized, ultra-low-latency real-time text deduplication engine using C++20, AVX2 SIMD vectorization, MinHash, and Locality-Sensitive Hashing (LSH) for sub-millisecond duplicate checks on large-scale streams.
Add a description, image, and links to the text-deduplication topic page so that developers can more easily learn about it.
To associate your repository with the text-deduplication topic, visit your repo's landing page and select "manage topics."