Skip to content

harsharajkumar-273/Repost-Radar

Repository files navigation

Repost Radar Engine (Ultra-Low Latency)

Live Sandbox

👉 Interactive Mod Console Sandbox Check duplicates, LSH band collisions, and vector Venn diagram overlaps live in your browser.

A hyper-optimized standalone data ingestion and streaming engine for real-time text deduplication. This engine is designed for ultra-low-latency real-time filtering infrastructure, targeting high-throughput environments.

Advanced Architecture & Implementation

1. Zero-Copy Token Shingling

Eliminates string allocations by using std::string_view (C++20). Input streams are parsed into contiguous buffers, and shingles are computed as string slices mapping directly to the original memory layout.

2. SIMD Vectorized MinHash Generation

Utilizes explicit AVX2 vectorization primitives to compute 128 universal hash variants across CPU registers simultaneously. This shifts computation from sequential bounds into parallel hardware instruction timelines.

3. Slab-Allocated In-Memory LSH Index

Replaces external database calls with a Concurrent, Lock-Free Sharded Hash Map in memory. The index uses a sharded architecture to minimize mutex contention and is designed for high-concurrency workloads.

Performance Metrics (Target)

  • Throughput: 22,000+ tokenized streams/sec.
  • Memory Footprint: Reduced by 65% compared to baseline implementations.
  • Speedup: 4.2x faster MinHash generation via SIMD.

Build & Run

Prerequisites

  • C++20 compatible compiler (GCC 10+, Clang 10+)
  • CMake 3.16+
  • CPU with AVX2 support

Compilation

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Run Benchmark

./bench_repost

Usage Example

#include "shingler.h"
#include "minhash.h"
#include "lsh_index.h"

// ... initialize hasher and index ...
auto shingles = Shingler::computeShingles(text);
Signature sig;
hasher.computeSignature(shingles, sig);
auto candidates = index.query(sig);
index.insert(postId, sig);

About

A hyper-optimized, ultra-low-latency real-time text deduplication engine using C++20, AVX2 SIMD vectorization, MinHash, and Locality-Sensitive Hashing (LSH) for sub-millisecond duplicate checks on large-scale streams.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors