Skip to content

akira2705/data-processing-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Data Processing Challenge

Real-time data pipeline using Apache Kafka, Apache Spark, and Debezium — ingesting live transactions, running CDC, and training an incremental ML model on streaming data.

Author: Shivaathmajan P (23BIT101) — B.Tech IT, Kumaraguru College of Technology


What it does

This project implements a full real-time data processing and analytics pipeline:

  • Kafka Producer — generates and publishes raw transaction events to a Kafka topic
  • Spark Streaming Job — consumes events, preprocesses (deduplication, currency normalisation, feature engineering), runs rolling 1-minute aggregations by category, and incrementally trains an SGDRegressor model on each micro-batch
  • CDC with Debezium — captures live PostgreSQL database changes and streams them to Kafka; a separate Spark job consumes and processes CDC events
  • In-memory analytics — lightweight in-memory demo for fast exploratory aggregations without Kafka

Tech Stack

Component Technology
Streaming broker Apache Kafka
Stream processing Apache Spark (PySpark)
Change Data Capture Debezium + Kafka Connect
Database PostgreSQL
ML scikit-learn (SGDRegressor, incremental)
Infrastructure Docker Compose
Language Python 3

Project Structure

dp_challenge_spark_kafka_final/
├── src/
│   ├── streaming/
│   │   ├── producer.py        # Kafka producer — sends transaction events
│   │   └── spark_job.py       # PySpark streaming consumer + ML training
│   ├── cdc/
│   │   └── cdc_spark_job.py   # CDC event consumer via Debezium
│   └── inmemory/
│       └── inmemory_demo.py   # In-memory analytics demo
├── connectors/
│   └── pg_inventory_connector.json  # Debezium PostgreSQL connector config
├── docker/
│   └── init.sql               # PostgreSQL schema init
├── data/
│   └── raw/transactions.csv   # Sample transaction data
├── docker-compose.yml         # Kafka, Zookeeper, PostgreSQL, Kafka Connect
└── requirements.txt

Getting Started

Prerequisites

  • Python 3.8+
  • Docker & Docker Compose
  • Java 11+ (required by Spark)

Installation

git clone https://github.com/akira2705/data-processing-challenge.git
cd data-processing-challenge/dp_challenge_spark_kafka_final

# Create and activate Python virtual environment
python -m venv .venv

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

pip install -r requirements.txt

Running the Pipeline

Step 1 — Start infrastructure (Kafka, PostgreSQL, Kafka Connect)

docker compose up -d

Step 2 — Create the Kafka topic

docker exec -it dp_challenge_spark_kafka-kafka-1 \
  kafka-topics --create \
  --topic transactions \
  --bootstrap-server kafka:9092 \
  --partitions 1 \
  --replication-factor 1

Step 3 — Register the Debezium CDC connector

curl -X POST -H "Content-Type: application/json" \
  -d "@connectors/pg_inventory_connector.json" \
  http://localhost:8083/connectors

Step 4 — Send sample transaction events

python src/streaming/producer.py

Step 5 — Run Spark streaming job (preprocessing + ML)

python -m pyspark \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
  src/streaming/spark_job.py

Step 6 (Optional) — Run CDC Spark consumer

python -m pyspark \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
  src/cdc/cdc_spark_job.py

Step 7 (Optional) — In-memory demo

python src/inmemory/inmemory_demo.py

How the ML works

Each Spark micro-batch is handed to foreach_batch_incremental, which:

  1. Converts the batch to Pandas
  2. One-hot encodes the top 5 transaction categories
  3. Calls SGDRegressor.partial_fit() — updating the model without retraining from scratch
  4. Persists the model to disk as a pickle file

This means the model continuously learns from the live stream without ever seeing the full dataset at once.

About

Real-time data processing pipeline using Apache Kafka, Spark, and Debezium for streaming, CDC, and preprocessing. Containerised with Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages