Skip to content

tejfaster/MarketPulse-Distributed-Stock-Intelligence-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📈 MarketPulse — Distributed Stock Intelligence Platform

A real-time, end-to-end data engineering pipeline built on a 2-node home cluster (Mac + Windows), processing live stock market data through a Bronze/Silver/Gold medallion architecture.


🏗️ Architecture

Yahoo Finance API
       │
       ▼
  Kafka Producer (Mac)
  [stock-prices topic]
       │
       ▼
  Bronze Layer (Raw) ✅
  Delta Lake
       │
       ▼
  Silver Layer (Cleaned) ✅
  PySpark + Technical Indicators
  (RSI, MACD, Bollinger Bands, EMA)
       │
       ▼
  Gold Layer (Aggregated) ✅
  PySpark → PostgreSQL
  (stock_prices, technical_signals, stock_summary)
       │
       ▼
  Power BI Dashboard ✅

🖥️ Cluster Setup

Node OS Role CPU Memory
tejs-MacBook-Pro.local macOS (ARM) Spark Master + Kafka Broker + PostgreSQL 10 cores 15 GiB
LAPTOP-46KBTNB2 (WSL2) Ubuntu 22.04 Spark Worker 8 threads 14.8 GiB

Cluster Configuration

  • Spark Master: spark://172.23.181.20:7077
  • Kafka Broker: 172.23.181.20:9092
  • PostgreSQL: 172.23.181.20:5432
  • Data Sync: Syncthing (Mac ↔ Windows)
  • Python Version: 3.11 (unified across both nodes)
  • Path Resolution: Symlink on Windows WSL2 maps Mac paths to local paths

🛠️ Tech Stack

Layer Technology
Ingestion Apache Kafka 3.7.0 (KRaft mode)
Processing Apache Spark 3.5.1 (Standalone Cluster)
Storage Delta Lake 3.1.0
Serving PostgreSQL 17.4
Orchestration Apache Airflow 2.8.1 ✅
Dashboard Power BI
Language Python 3.11 (Anaconda)
Data Sync Syncthing

📁 Project Structure

MarketPulse/
├── producers/
│   └── stock_producer.py           # Live stock → Kafka (17 symbols)
├── pipelines/
│   ├── bronze/
│   │   └── kafka_to_bronze.py      # Kafka → Delta Lake (raw)
│   ├── silver/
│   │   └── bronze_to_silver.py     # Delta → RSI, MACD, Bollinger
│   └── gold/
│       └── silver_to_gold.py       # Delta → PostgreSQL (3 tables)
├── airflow/
│   ├── docker-compose.yaml         # Airflow stack (postgres + webserver + scheduler)
│   ├── .env                        # Airflow secrets (gitignored)
│   └── dags/
│       └── marketpulse_pipeline.py  # Silver→Gold DAG
├── docs/
│   └── architecture-decisions.md   # Technical decision log
├── config/
│   └── .env.example                # Environment variables template
├── tests/
│   ├── delta_log_checking.py       # Delta log inspection
│   ├── test_spark_delta.py         # Silver layer verification
│   └── debug_cluster.py            # Cluster hostname + path debug
├── requirements.txt
└── README.md

🚀 Getting Started

Prerequisites

  • Python 3.11+
  • Java 11
  • Apache Spark 3.5.1
  • Apache Kafka 3.7.0
  • PostgreSQL 17+
  • Syncthing (for data sync between nodes)

1 — Clone the repo

git clone https://github.com/tejfaster/MarketPulse-Distributed-Stock-Intelligence-Platform.git
cd MarketPulse-Distributed-Stock-Intelligence-Platform

2 — Install dependencies

pip install -r requirements.txt

3 — Configure environment

cp config/.env.example config/.env
# Edit .env with your credentials

4 — Start Kafka (Mac)

/opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/kraft/server.properties

5 — Start Spark Cluster (Mac)

/opt/spark/sbin/start-master.sh

6 — Start Spark Worker (Windows WSL2)

PYSPARK_PYTHON=/usr/bin/python3.11 /opt/spark/sbin/start-worker.sh spark://172.23.181.20:7077

7 — Run Stock Producer

python producers/stock_producer.py

8 — Run Bronze Layer

spark-submit \
  --master spark://172.23.181.20:7077 \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/bronze/kafka_to_bronze.py

9 — Run Silver Layer

spark-submit \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/silver/bronze_to_silver.py

10 — Run Gold Layer

spark-submit \
  --master spark://172.23.181.20:7077 \
  --jars /opt/spark/jars/postgresql-42.7.3.jar \
  --packages io.delta:delta-spark_2.12:3.1.0 \
  pipelines/gold/silver_to_gold.py

📊 Data Flow

Bronze Layer ✅

  • Raw tick data ingested from Kafka topic stock-prices
  • Stored as Delta Lake at MarketPulse-data/bronze/stocks
  • No transformations — raw as received from yfinance

Silver Layer ✅

  • Reads Bronze Delta, cleans and validates OHLCV data
  • Technical indicators computed per symbol:
    • RSI — Relative Strength Index (14-period)
    • MACD — Moving Average Convergence Divergence
    • Bollinger Bands — Upper, Lower, Width (20-period)
    • EMA — Exponential Moving Averages (12, 26)
  • Buy/Sell/Hold signal generated based on RSI thresholds
  • Runs on local Spark (local[*]) on Mac

Gold Layer ✅

  • Reads Silver Delta on full cluster (Mac Master + Windows Worker)
  • Writes three tables to PostgreSQL via JDBC:
Table Rows Description
gold.stock_prices 18,694 Historical OHLCV per symbol per date
gold.technical_signals 18,694 RSI, MACD, Bollinger per record
gold.stock_summary 17 Latest snapshot per symbol

📈 Stocks Tracked (17 Symbols)

Category Symbols
US Stocks AAPL, GOOGL, MSFT, AMZN, TSLA, NVDA, META
Indian Stocks RELIANCE.NS, TCS.NS, INFY.NS
Crypto BTC-USD, ETH-USD
Indices ^NSEI, ^BSESN, ^GSPC, ^DJI, ^IXIC

🔧 Cluster Troubleshooting Notes

Windows WSL2 Python Version

WSL2 defaults to Python 3.10 but Spark requires matching versions across nodes.

# Fix: set Python 3.11 as system default on Windows WSL2
sudo update-alternatives --set python3 /usr/bin/python3.11

Path Resolution Between Nodes

Mac and Windows use different base paths for the same data. Resolved via symlink:

# On Windows WSL2
sudo ln -s /home/tejfaster/MarketPulse-Data /Users/tejfaster/Developer/Python/MarketPulse-data

PostgreSQL Network Access

PostgreSQL must accept connections from the cluster network:

# postgresql.conf
listen_addresses = '*'

# pg_hba.conf
host    all    all    10.0.0.0/8    md5


⚙️ Orchestration (Airflow)

The Silver and Gold batch layers are automated via Apache Airflow 2.8.1 running on Docker.

DAG: marketpulse_silver_gold

Property Value
Schedule 0 17 * * 1-5 — 17:00 UTC, Monday–Friday
Timezone UTC (markets close ~21:00 UTC, Silver runs on delayed data)
Executor LocalExecutor
Retries 1 retry, 5-minute delay

Pipeline Flow

17:00 UTC (Mon–Fri)
       │
       ▼
  run_silver
  spark-submit bronze_to_silver.py (local[*] on Mac)
       │
       ▼
  wait_for_sync (90 seconds)
  Gives Syncthing time to propagate Silver Delta files to Windows
       │
       ▼
  run_gold
  spark-submit silver_to_gold.py (spark://172.23.181.20:7077)
  Distributed across Mac Master + Windows Worker
  Writes 3 tables to PostgreSQL

Why this design?

  • Bronze runs continuously — it's a streaming job consuming from Kafka 24/7. Airflow doesn't manage it.
  • Silver + Gold run once daily — batch jobs that depend on each other. Airflow enforces the dependency and schedule.
  • LocalExecutor — Spark already handles distribution. Airflow only needs to trigger jobs, not distribute them.
  • SSH-based execution — Airflow container SSHs into Mac host to call spark-submit natively in the Anaconda environment.
  • 90s sync wait — Syncthing needs time to propagate new Silver Delta files from Mac to Windows before Gold reads them.

Starting Airflow

cd /Users/tejfaster/Developer/Python/MarketPulse/airflow
docker compose up airflow-webserver airflow-scheduler -d

UI available at: http://localhost:8081 (login: airflow / airflow)

Triggering manually

docker exec airflow-airflow-scheduler-1 airflow dags trigger marketpulse_silver_gold

Stopping Airflow

docker compose down

📊 Power BI Dashboard

A 5-page interactive dashboard built on top of the Gold PostgreSQL layer.

Pages

Page Name Description
1 Market Overview KPI cards, stock summary table, Buy/Hold/Sell donut chart, category + signal slicers
2 Signal Dashboard RSI/MACD/Bollinger signal analysis, bar chart by category, overbought/oversold counts
3 Price History Area chart, OHLCV table, volume distribution by symbol and category
4 Stock Detail Drill through page — RSI history, MACD chart, price trend per symbol
5 Tooltip Hover popup showing quick stats for any symbol across all pages

Features

  • Drill Through — right-click any symbol → jump to Stock Detail page filtered to that symbol
  • Tooltip Page — hover over any symbol → popup shows price, RSI, signal instantly
  • Cross Filtering — click any visual → all other visuals filter automatically
  • Category Slicer — filter by US Stocks / Indian Stocks / Crypto / Indices
  • Dynamic Title — Stock Detail page title updates automatically per symbol

DAX Measures (20+)

  • Buy / Sell / Hold Signal counts
  • RSI Status (Overbought / Oversold / Neutral)
  • Latest Price, RSI, MACD, Signal per symbol
  • Price Change % Daily
  • Overbought / Oversold counts
  • Positive MACD count
  • Category classification (US / Indian / Crypto / Indices)

Data Connection

  • Source: PostgreSQL 17.4 (gold schema)
  • Mode: Import (daily refresh after Airflow pipeline)
  • Tables: gold.stock_prices, gold.technical_signals, gold.stock_summary
  • Refresh: Manual after Airflow DAG completes at 17:00 UTC Mon–Fri

🗺️ Roadmap

  • Cluster setup (Spark + Kafka)
  • Bronze layer ingestion
  • Silver layer transformations
  • Gold layer → PostgreSQL
  • Airflow orchestration
  • Power BI dashboard

👤 Author

Tej Pratap — Aspiring Data Engineer


📄 License

MIT License — feel free to use and modify.

About

Description: Real-time distributed stock intelligence platform on Mac+Windows cluster

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages