Real-time data pipeline using Apache Kafka, Apache Spark, and Debezium — ingesting live transactions, running CDC, and training an incremental ML model on streaming data.
Author: Shivaathmajan P (23BIT101) — B.Tech IT, Kumaraguru College of Technology
This project implements a full real-time data processing and analytics pipeline:
- Kafka Producer — generates and publishes raw transaction events to a Kafka topic
- Spark Streaming Job — consumes events, preprocesses (deduplication, currency normalisation, feature engineering), runs rolling 1-minute aggregations by category, and incrementally trains an
SGDRegressormodel on each micro-batch - CDC with Debezium — captures live PostgreSQL database changes and streams them to Kafka; a separate Spark job consumes and processes CDC events
- In-memory analytics — lightweight in-memory demo for fast exploratory aggregations without Kafka
| Component | Technology |
|---|---|
| Streaming broker | Apache Kafka |
| Stream processing | Apache Spark (PySpark) |
| Change Data Capture | Debezium + Kafka Connect |
| Database | PostgreSQL |
| ML | scikit-learn (SGDRegressor, incremental) |
| Infrastructure | Docker Compose |
| Language | Python 3 |
dp_challenge_spark_kafka_final/
├── src/
│ ├── streaming/
│ │ ├── producer.py # Kafka producer — sends transaction events
│ │ └── spark_job.py # PySpark streaming consumer + ML training
│ ├── cdc/
│ │ └── cdc_spark_job.py # CDC event consumer via Debezium
│ └── inmemory/
│ └── inmemory_demo.py # In-memory analytics demo
├── connectors/
│ └── pg_inventory_connector.json # Debezium PostgreSQL connector config
├── docker/
│ └── init.sql # PostgreSQL schema init
├── data/
│ └── raw/transactions.csv # Sample transaction data
├── docker-compose.yml # Kafka, Zookeeper, PostgreSQL, Kafka Connect
└── requirements.txt
- Python 3.8+
- Docker & Docker Compose
- Java 11+ (required by Spark)
git clone https://github.com/akira2705/data-processing-challenge.git
cd data-processing-challenge/dp_challenge_spark_kafka_final
# Create and activate Python virtual environment
python -m venv .venv
# Windows
.\.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
pip install -r requirements.txtStep 1 — Start infrastructure (Kafka, PostgreSQL, Kafka Connect)
docker compose up -dStep 2 — Create the Kafka topic
docker exec -it dp_challenge_spark_kafka-kafka-1 \
kafka-topics --create \
--topic transactions \
--bootstrap-server kafka:9092 \
--partitions 1 \
--replication-factor 1Step 3 — Register the Debezium CDC connector
curl -X POST -H "Content-Type: application/json" \
-d "@connectors/pg_inventory_connector.json" \
http://localhost:8083/connectorsStep 4 — Send sample transaction events
python src/streaming/producer.pyStep 5 — Run Spark streaming job (preprocessing + ML)
python -m pyspark \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
src/streaming/spark_job.pyStep 6 (Optional) — Run CDC Spark consumer
python -m pyspark \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
src/cdc/cdc_spark_job.pyStep 7 (Optional) — In-memory demo
python src/inmemory/inmemory_demo.pyEach Spark micro-batch is handed to foreach_batch_incremental, which:
- Converts the batch to Pandas
- One-hot encodes the top 5 transaction categories
- Calls
SGDRegressor.partial_fit()— updating the model without retraining from scratch - Persists the model to disk as a pickle file
This means the model continuously learns from the live stream without ever seeing the full dataset at once.