Data Processing Challenge

Real-time data pipeline using Apache Kafka, Apache Spark, and Debezium — ingesting live transactions, running CDC, and training an incremental ML model on streaming data.

Author: Shivaathmajan P (23BIT101) — B.Tech IT, Kumaraguru College of Technology

What it does

This project implements a full real-time data processing and analytics pipeline:

Kafka Producer — generates and publishes raw transaction events to a Kafka topic
Spark Streaming Job — consumes events, preprocesses (deduplication, currency normalisation, feature engineering), runs rolling 1-minute aggregations by category, and incrementally trains an SGDRegressor model on each micro-batch
CDC with Debezium — captures live PostgreSQL database changes and streams them to Kafka; a separate Spark job consumes and processes CDC events
In-memory analytics — lightweight in-memory demo for fast exploratory aggregations without Kafka

Tech Stack

Component	Technology
Streaming broker	Apache Kafka
Stream processing	Apache Spark (PySpark)
Change Data Capture	Debezium + Kafka Connect
Database	PostgreSQL
ML	scikit-learn (SGDRegressor, incremental)
Infrastructure	Docker Compose
Language	Python 3

Project Structure

dp_challenge_spark_kafka_final/
├── src/
│   ├── streaming/
│   │   ├── producer.py        # Kafka producer — sends transaction events
│   │   └── spark_job.py       # PySpark streaming consumer + ML training
│   ├── cdc/
│   │   └── cdc_spark_job.py   # CDC event consumer via Debezium
│   └── inmemory/
│       └── inmemory_demo.py   # In-memory analytics demo
├── connectors/
│   └── pg_inventory_connector.json  # Debezium PostgreSQL connector config
├── docker/
│   └── init.sql               # PostgreSQL schema init
├── data/
│   └── raw/transactions.csv   # Sample transaction data
├── docker-compose.yml         # Kafka, Zookeeper, PostgreSQL, Kafka Connect
└── requirements.txt

Getting Started

Prerequisites

Python 3.8+
Docker & Docker Compose
Java 11+ (required by Spark)

Installation

git clone https://github.com/akira2705/data-processing-challenge.git
cd data-processing-challenge/dp_challenge_spark_kafka_final

# Create and activate Python virtual environment
python -m venv .venv

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

pip install -r requirements.txt

Running the Pipeline

Step 1 — Start infrastructure (Kafka, PostgreSQL, Kafka Connect)

docker compose up -d

Step 2 — Create the Kafka topic

docker exec -it dp_challenge_spark_kafka-kafka-1 \
  kafka-topics --create \
  --topic transactions \
  --bootstrap-server kafka:9092 \
  --partitions 1 \
  --replication-factor 1

Step 3 — Register the Debezium CDC connector

curl -X POST -H "Content-Type: application/json" \
  -d "@connectors/pg_inventory_connector.json" \
  http://localhost:8083/connectors

Step 4 — Send sample transaction events

python src/streaming/producer.py

Step 5 — Run Spark streaming job (preprocessing + ML)

python -m pyspark \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
  src/streaming/spark_job.py

Step 6 (Optional) — Run CDC Spark consumer

python -m pyspark \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
  src/cdc/cdc_spark_job.py

Step 7 (Optional) — In-memory demo

python src/inmemory/inmemory_demo.py

How the ML works

Each Spark micro-batch is handed to foreach_batch_incremental, which:

Converts the batch to Pandas
One-hot encodes the top 5 transaction categories
Calls SGDRegressor.partial_fit() — updating the model without retraining from scratch
Persists the model to disk as a pickle file

This means the model continuously learns from the live stream without ever seeing the full dataset at once.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dp_challenge_spark_kafka_final		dp_challenge_spark_kafka_final
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing Challenge

What it does

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Pipeline

How the ML works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Processing Challenge

What it does

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Pipeline

How the ML works

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages