Dataflow Probabilistic Hot Key Logger

This project demonstrates an efficient approach to detecting and logging Hot Keys (keys with abnormally high frequencies) in streaming data using Google Cloud Dataflow (Apache Beam).

🚀 Key Features

1. Probabilistic Sketching (Count-Min Sketch)

In high-volume traffic environments, aggregating frequencies for millions of unique keys can lead to Out of Memory (OOM) errors. This project utilizes the Count-Min Sketch algorithm to provide:

Fixed Memory Footprint: Occupies a constant amount of memory regardless of the volume of data or the number of unique keys.
Probabilistic Estimation: Trading absolute accuracy for extreme efficiency, allowing for near real-time frequency estimation.
10% Sampling & Extrapolation: In ultra-high traffic scenarios, updating a sketch for every element can be costly. We now sample only 10% of elements for the sketch and extrapolate the results (x10) to detect hot keys, reducing sketch update overhead by 90%.
Combiner Lifting (Optimization): Utilizes globally() aggregation to enable Map-side combining. This ensures each worker performs local partial aggregation, dramatically reducing network shuffle and avoiding Hot Key bottlenecks at the data transfer level.

2. Sidecar Pattern

The detection logic is executed in a separate branch from the main business logic pipeline.

Stability: Load or latency in the detection logic does not impact the processing of actual business data.
Flexibility: Detection algorithms or thresholds can be modified independently without altering the main logic.

3. Optimized Logging & Scalability

Log Sampling: The main business logic branch samples and logs processing status only every 1,000 records to prevent log flooding while maintaining visibility.
Beam Metrics Integration: Built-in metrics for real-time monitoring:
- detected_hot_keys (Counter): Tracks the total count of hot keys identified.
- estimated_counts_dist (Distribution): Monitors the distribution of estimated frequencies across the window.
Dataflow Streaming Engine: Explicitly enabled to offload state management and improve autoscaling performance on Google Cloud.

🏗 Architecture Overview

graph TD
    A[Data Source] --> B[Fixed Window]
    B --> C[Main Business Logic Branch]
    B --> D[Sidecar Monitor Branch]
    D --> E[Build Count-Min Sketch]
    E --> F[Side Input]
    C --> G[Probabilistic Frequency Estimation]
    F --> G
    G --> H{Above Threshold?}
    H -- Yes --> I[Log Warning/Metric]
    H -- No --> J[Ignore]

🛠 Tech Stack

Language: Java 25
Framework: Apache Beam 2.70.0
Data Structure: Count-Min Sketch (beam-sdks-java-extensions-sketching)
Runner: DirectRunner (Local testing) / DataflowRunner (Production)

🏃 How to Run

Prerequisites

JDK 25 or higher
Apache Maven 3.9+
Google Cloud SDK (Required for running on Dataflow)

Local Execution (DirectRunner)

Generate synthetic data with skewed key distributions to test hot key detection.

mvn compile exec:java \
  -Dexec.mainClass=com.example.dataflow.HotKeyLoggerPipeline \
  -Dexec.args="--runner=DirectRunner \
               --windowDurationSeconds=10 \
               --hotKeyThreshold=500"

Main Parameters

--windowDurationSeconds: The window size for aggregating frequencies (in seconds).
--hotKeyThreshold: The minimum count to be considered a Hot Key.
--epsilon: Relative error of the sketching algorithm (Default: 0.01).
--confidence: Confidence level of the sketching algorithm (Default: 0.99).

📝 Code Structure

HotKeyLoggerPipeline.java: Pipeline configuration and execution entry point.
SketchBasedHotKeyDetector.java: Sidecar transform for probabilistic Count-Min Sketch based detection.
SyntheticDataGenerator.java: Generator of mock data with skewed key distributions (e.g., zipfian-like).

🔍 Monitoring

When running the pipeline, look for warning logs indicating potential hot keys: [Sketch-Sampling-Sidecar] Detected Potential HOT KEY: [hot-key-A], Extrapolated Count: [1465] (Sampled: 146)

Additionally, monitor the main business logic sampling: [Main-Business-Sample] Processed 1000 keys. Current Sample - Key: hot-key-A, Count: 482

The project is designed to integrate with Apache Beam Metrics for easy visualization in monitoring dashboards.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
src/main/java/com/example/dataflow		src/main/java/com/example/dataflow
target		target
README.ko.md		README.ko.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflow Probabilistic Hot Key Logger

🚀 Key Features

1. Probabilistic Sketching (Count-Min Sketch)

2. Sidecar Pattern

3. Optimized Logging & Scalability

🏗 Architecture Overview

🛠 Tech Stack

🏃 How to Run

Prerequisites

Local Execution (DirectRunner)

Main Parameters

📝 Code Structure

🔍 Monitoring

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataflow Probabilistic Hot Key Logger

🚀 Key Features

1. Probabilistic Sketching (Count-Min Sketch)

2. Sidecar Pattern

3. Optimized Logging & Scalability

🏗 Architecture Overview

🛠 Tech Stack

🏃 How to Run

Prerequisites

Local Execution (DirectRunner)

Main Parameters

📝 Code Structure

🔍 Monitoring

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages