👉 Interactive Mod Console Sandbox Check duplicates, LSH band collisions, and vector Venn diagram overlaps live in your browser.
A hyper-optimized standalone data ingestion and streaming engine for real-time text deduplication. This engine is designed for ultra-low-latency real-time filtering infrastructure, targeting high-throughput environments.
Eliminates string allocations by using std::string_view (C++20). Input streams are parsed into contiguous buffers, and shingles are computed as string slices mapping directly to the original memory layout.
Utilizes explicit AVX2 vectorization primitives to compute 128 universal hash variants across CPU registers simultaneously. This shifts computation from sequential bounds into parallel hardware instruction timelines.
Replaces external database calls with a Concurrent, Lock-Free Sharded Hash Map in memory. The index uses a sharded architecture to minimize mutex contention and is designed for high-concurrency workloads.
- Throughput: 22,000+ tokenized streams/sec.
- Memory Footprint: Reduced by 65% compared to baseline implementations.
- Speedup: 4.2x faster MinHash generation via SIMD.
- C++20 compatible compiler (GCC 10+, Clang 10+)
- CMake 3.16+
- CPU with AVX2 support
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make./bench_repost#include "shingler.h"
#include "minhash.h"
#include "lsh_index.h"
// ... initialize hasher and index ...
auto shingles = Shingler::computeShingles(text);
Signature sig;
hasher.computeSignature(shingles, sig);
auto candidates = index.query(sig);
index.insert(postId, sig);