📈 MarketPulse — Distributed Stock Intelligence Platform

A real-time, end-to-end data engineering pipeline built on a 2-node home cluster (Mac + Windows), processing live stock market data through a Bronze/Silver/Gold medallion architecture.

🏗️ Architecture

Yahoo Finance API
       │
       ▼
  Kafka Producer (Mac)
  [stock-prices topic]
       │
       ▼
  Bronze Layer (Raw) ✅
  Delta Lake
       │
       ▼
  Silver Layer (Cleaned) ✅
  PySpark + Technical Indicators
  (RSI, MACD, Bollinger Bands, EMA)
       │
       ▼
  Gold Layer (Aggregated) ✅
  PySpark → PostgreSQL
  (stock_prices, technical_signals, stock_summary)
       │
       ▼
  Power BI Dashboard ✅

🖥️ Cluster Setup

Node	OS	Role	CPU	Memory
tejs-MacBook-Pro.local	macOS (ARM)	Spark Master + Kafka Broker + PostgreSQL	10 cores	15 GiB
LAPTOP-46KBTNB2 (WSL2)	Ubuntu 22.04	Spark Worker	8 threads	14.8 GiB

Cluster Configuration

Spark Master: spark://172.23.181.20:7077
Kafka Broker: 172.23.181.20:9092
PostgreSQL: 172.23.181.20:5432
Data Sync: Syncthing (Mac ↔ Windows)
Python Version: 3.11 (unified across both nodes)
Path Resolution: Symlink on Windows WSL2 maps Mac paths to local paths

🛠️ Tech Stack

Layer	Technology
Ingestion	Apache Kafka 3.7.0 (KRaft mode)
Processing	Apache Spark 3.5.1 (Standalone Cluster)
Storage	Delta Lake 3.1.0
Serving	PostgreSQL 17.4
Orchestration	Apache Airflow 2.8.1 ✅
Dashboard	Power BI
Language	Python 3.11 (Anaconda)
Data Sync	Syncthing

📁 Project Structure

MarketPulse/
├── producers/
│   └── stock_producer.py           # Live stock → Kafka (17 symbols)
├── pipelines/
│   ├── bronze/
│   │   └── kafka_to_bronze.py      # Kafka → Delta Lake (raw)
│   ├── silver/
│   │   └── bronze_to_silver.py     # Delta → RSI, MACD, Bollinger
│   └── gold/
│       └── silver_to_gold.py       # Delta → PostgreSQL (3 tables)
├── airflow/
│   ├── docker-compose.yaml         # Airflow stack (postgres + webserver + scheduler)
│   ├── .env                        # Airflow secrets (gitignored)
│   └── dags/
│       └── marketpulse_pipeline.py  # Silver→Gold DAG
├── docs/
│   └── architecture-decisions.md   # Technical decision log
├── config/
│   └── .env.example                # Environment variables template
├── tests/
│   ├── delta_log_checking.py       # Delta log inspection
│   ├── test_spark_delta.py         # Silver layer verification
│   └── debug_cluster.py            # Cluster hostname + path debug
├── requirements.txt
└── README.md

🚀 Getting Started

Prerequisites

Python 3.11+
Java 11
Apache Spark 3.5.1
Apache Kafka 3.7.0
PostgreSQL 17+
Syncthing (for data sync between nodes)

1 — Clone the repo

git clone https://github.com/tejfaster/MarketPulse-Distributed-Stock-Intelligence-Platform.git
cd MarketPulse-Distributed-Stock-Intelligence-Platform

2 — Install dependencies

pip install -r requirements.txt

3 — Configure environment

cp config/.env.example config/.env
# Edit .env with your credentials

4 — Start Kafka (Mac)

/opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/kraft/server.properties

5 — Start Spark Cluster (Mac)

/opt/spark/sbin/start-master.sh

6 — Start Spark Worker (Windows WSL2)

PYSPARK_PYTHON=/usr/bin/python3.11 /opt/spark/sbin/start-worker.sh spark://172.23.181.20:7077

7 — Run Stock Producer

python producers/stock_producer.py

8 — Run Bronze Layer

spark-submit \
  --master spark://172.23.181.20:7077 \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/bronze/kafka_to_bronze.py

9 — Run Silver Layer

spark-submit \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/silver/bronze_to_silver.py

10 — Run Gold Layer

spark-submit \
  --master spark://172.23.181.20:7077 \
  --jars /opt/spark/jars/postgresql-42.7.3.jar \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/gold/silver_to_gold.py

📊 Data Flow

Bronze Layer ✅

Raw tick data ingested from Kafka topic stock-prices
Stored as Delta Lake at MarketPulse-data/bronze/stocks
No transformations — raw as received from yfinance

Silver Layer ✅

Reads Bronze Delta, cleans and validates OHLCV data
Technical indicators computed per symbol:
- RSI — Relative Strength Index (14-period)
- MACD — Moving Average Convergence Divergence
- Bollinger Bands — Upper, Lower, Width (20-period)
- EMA — Exponential Moving Averages (12, 26)
Buy/Sell/Hold signal generated based on RSI thresholds
Runs on local Spark (local[*]) on Mac

Gold Layer ✅

Reads Silver Delta on full cluster (Mac Master + Windows Worker)
Writes three tables to PostgreSQL via JDBC:

Table	Rows	Description
`gold.stock_prices`	18,694	Historical OHLCV per symbol per date
`gold.technical_signals`	18,694	RSI, MACD, Bollinger per record
`gold.stock_summary`	17	Latest snapshot per symbol

📈 Stocks Tracked (17 Symbols)

Category	Symbols
US Stocks	AAPL, GOOGL, MSFT, AMZN, TSLA, NVDA, META
Indian Stocks	RELIANCE.NS, TCS.NS, INFY.NS
Crypto	BTC-USD, ETH-USD
Indices	^NSEI, ^BSESN, ^GSPC, ^DJI, ^IXIC

🔧 Cluster Troubleshooting Notes

Windows WSL2 Python Version

WSL2 defaults to Python 3.10 but Spark requires matching versions across nodes.

# Fix: set Python 3.11 as system default on Windows WSL2
sudo update-alternatives --set python3 /usr/bin/python3.11

Path Resolution Between Nodes

Mac and Windows use different base paths for the same data. Resolved via symlink:

# On Windows WSL2
sudo ln -s /home/tejfaster/MarketPulse-Data /Users/tejfaster/Developer/Python/MarketPulse-data

PostgreSQL Network Access

PostgreSQL must accept connections from the cluster network:

# postgresql.conf
listen_addresses = '*'

# pg_hba.conf
host    all    all    10.0.0.0/8    md5

⚙️ Orchestration (Airflow)

The Silver and Gold batch layers are automated via Apache Airflow 2.8.1 running on Docker.

DAG: `marketpulse_silver_gold`

Property	Value
Schedule	`0 17 * * 1-5` — 17:00 UTC, Monday–Friday
Timezone	UTC (markets close ~21:00 UTC, Silver runs on delayed data)
Executor	LocalExecutor
Retries	1 retry, 5-minute delay

Pipeline Flow

17:00 UTC (Mon–Fri)
       │
       ▼
  run_silver
  spark-submit bronze_to_silver.py (local[*] on Mac)
       │
       ▼
  wait_for_sync (90 seconds)
  Gives Syncthing time to propagate Silver Delta files to Windows
       │
       ▼
  run_gold
  spark-submit silver_to_gold.py (spark://172.23.181.20:7077)
  Distributed across Mac Master + Windows Worker
  Writes 3 tables to PostgreSQL

Why this design?

Bronze runs continuously — it's a streaming job consuming from Kafka 24/7. Airflow doesn't manage it.
Silver + Gold run once daily — batch jobs that depend on each other. Airflow enforces the dependency and schedule.
LocalExecutor — Spark already handles distribution. Airflow only needs to trigger jobs, not distribute them.
SSH-based execution — Airflow container SSHs into Mac host to call spark-submit natively in the Anaconda environment.
90s sync wait — Syncthing needs time to propagate new Silver Delta files from Mac to Windows before Gold reads them.

Starting Airflow

cd /Users/tejfaster/Developer/Python/MarketPulse/airflow
docker compose up airflow-webserver airflow-scheduler -d

UI available at: http://localhost:8081 (login: airflow / airflow)

Triggering manually

docker exec airflow-airflow-scheduler-1 airflow dags trigger marketpulse_silver_gold

Stopping Airflow

docker compose down

📊 Power BI Dashboard

A 5-page interactive dashboard built on top of the Gold PostgreSQL layer.

Pages

Page	Name	Description
1	Market Overview	KPI cards, stock summary table, Buy/Hold/Sell donut chart, category + signal slicers
2	Signal Dashboard	RSI/MACD/Bollinger signal analysis, bar chart by category, overbought/oversold counts
3	Price History	Area chart, OHLCV table, volume distribution by symbol and category
4	Stock Detail	Drill through page — RSI history, MACD chart, price trend per symbol
5	Tooltip	Hover popup showing quick stats for any symbol across all pages

Features

Drill Through — right-click any symbol → jump to Stock Detail page filtered to that symbol
Tooltip Page — hover over any symbol → popup shows price, RSI, signal instantly
Cross Filtering — click any visual → all other visuals filter automatically
Category Slicer — filter by US Stocks / Indian Stocks / Crypto / Indices
Dynamic Title — Stock Detail page title updates automatically per symbol

DAX Measures (20+)

Buy / Sell / Hold Signal counts
RSI Status (Overbought / Oversold / Neutral)
Latest Price, RSI, MACD, Signal per symbol
Price Change % Daily
Overbought / Oversold counts
Positive MACD count
Category classification (US / Indian / Crypto / Indices)

Data Connection

Source: PostgreSQL 17.4 (gold schema)
Mode: Import (daily refresh after Airflow pipeline)
Tables: gold.stock_prices, gold.technical_signals, gold.stock_summary
Refresh: Manual after Airflow DAG completes at 17:00 UTC Mon–Fri

🗺️ Roadmap

👤 Author

Tej Pratap — Aspiring Data Engineer

GitHub: @tejfaster
LinkedIn: Tej Pratap

📄 License

MIT License — feel free to use and modify.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
airflow		airflow
dashboard		dashboard
docs		docs
pipelines		pipelines
producers		producers
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📈 MarketPulse — Distributed Stock Intelligence Platform

🏗️ Architecture

🖥️ Cluster Setup

Cluster Configuration

🛠️ Tech Stack

📁 Project Structure

🚀 Getting Started

Prerequisites

1 — Clone the repo

2 — Install dependencies

3 — Configure environment

4 — Start Kafka (Mac)

5 — Start Spark Cluster (Mac)

6 — Start Spark Worker (Windows WSL2)

7 — Run Stock Producer

8 — Run Bronze Layer

9 — Run Silver Layer

10 — Run Gold Layer

📊 Data Flow

Bronze Layer ✅

Silver Layer ✅

Gold Layer ✅

📈 Stocks Tracked (17 Symbols)

🔧 Cluster Troubleshooting Notes

Windows WSL2 Python Version

Path Resolution Between Nodes

PostgreSQL Network Access

⚙️ Orchestration (Airflow)

DAG: marketpulse_silver_gold

Pipeline Flow

Why this design?

Starting Airflow

Triggering manually

Stopping Airflow

📊 Power BI Dashboard

Pages

Features

DAX Measures (20+)

Data Connection

🗺️ Roadmap

👤 Author

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DAG: `marketpulse_silver_gold`

Packages