A real-time, end-to-end data engineering pipeline built on a 2-node home cluster (Mac + Windows), processing live stock market data through a Bronze/Silver/Gold medallion architecture.
Yahoo Finance API
│
▼
Kafka Producer (Mac)
[stock-prices topic]
│
▼
Bronze Layer (Raw) ✅
Delta Lake
│
▼
Silver Layer (Cleaned) ✅
PySpark + Technical Indicators
(RSI, MACD, Bollinger Bands, EMA)
│
▼
Gold Layer (Aggregated) ✅
PySpark → PostgreSQL
(stock_prices, technical_signals, stock_summary)
│
▼
Power BI Dashboard ✅
| Node | OS | Role | CPU | Memory |
|---|---|---|---|---|
| tejs-MacBook-Pro.local | macOS (ARM) | Spark Master + Kafka Broker + PostgreSQL | 10 cores | 15 GiB |
| LAPTOP-46KBTNB2 (WSL2) | Ubuntu 22.04 | Spark Worker | 8 threads | 14.8 GiB |
- Spark Master:
spark://172.23.181.20:7077 - Kafka Broker:
172.23.181.20:9092 - PostgreSQL:
172.23.181.20:5432 - Data Sync: Syncthing (Mac ↔ Windows)
- Python Version: 3.11 (unified across both nodes)
- Path Resolution: Symlink on Windows WSL2 maps Mac paths to local paths
| Layer | Technology |
|---|---|
| Ingestion | Apache Kafka 3.7.0 (KRaft mode) |
| Processing | Apache Spark 3.5.1 (Standalone Cluster) |
| Storage | Delta Lake 3.1.0 |
| Serving | PostgreSQL 17.4 |
| Orchestration | Apache Airflow 2.8.1 ✅ |
| Dashboard | Power BI |
| Language | Python 3.11 (Anaconda) |
| Data Sync | Syncthing |
MarketPulse/
├── producers/
│ └── stock_producer.py # Live stock → Kafka (17 symbols)
├── pipelines/
│ ├── bronze/
│ │ └── kafka_to_bronze.py # Kafka → Delta Lake (raw)
│ ├── silver/
│ │ └── bronze_to_silver.py # Delta → RSI, MACD, Bollinger
│ └── gold/
│ └── silver_to_gold.py # Delta → PostgreSQL (3 tables)
├── airflow/
│ ├── docker-compose.yaml # Airflow stack (postgres + webserver + scheduler)
│ ├── .env # Airflow secrets (gitignored)
│ └── dags/
│ └── marketpulse_pipeline.py # Silver→Gold DAG
├── docs/
│ └── architecture-decisions.md # Technical decision log
├── config/
│ └── .env.example # Environment variables template
├── tests/
│ ├── delta_log_checking.py # Delta log inspection
│ ├── test_spark_delta.py # Silver layer verification
│ └── debug_cluster.py # Cluster hostname + path debug
├── requirements.txt
└── README.md
- Python 3.11+
- Java 11
- Apache Spark 3.5.1
- Apache Kafka 3.7.0
- PostgreSQL 17+
- Syncthing (for data sync between nodes)
git clone https://github.com/tejfaster/MarketPulse-Distributed-Stock-Intelligence-Platform.git
cd MarketPulse-Distributed-Stock-Intelligence-Platformpip install -r requirements.txtcp config/.env.example config/.env
# Edit .env with your credentials/opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/kraft/server.properties/opt/spark/sbin/start-master.shPYSPARK_PYTHON=/usr/bin/python3.11 /opt/spark/sbin/start-worker.sh spark://172.23.181.20:7077python producers/stock_producer.pyspark-submit \
--master spark://172.23.181.20:7077 \
--packages io.delta:delta-spark_2.12:3.1.0 \
pipelines/bronze/kafka_to_bronze.pyspark-submit \
--packages io.delta:delta-spark_2.12:3.1.0 \
pipelines/silver/bronze_to_silver.pyspark-submit \
--master spark://172.23.181.20:7077 \
--jars /opt/spark/jars/postgresql-42.7.3.jar \
--packages io.delta:delta-spark_2.12:3.1.0 \
pipelines/gold/silver_to_gold.py- Raw tick data ingested from Kafka topic
stock-prices - Stored as Delta Lake at
MarketPulse-data/bronze/stocks - No transformations — raw as received from yfinance
- Reads Bronze Delta, cleans and validates OHLCV data
- Technical indicators computed per symbol:
- RSI — Relative Strength Index (14-period)
- MACD — Moving Average Convergence Divergence
- Bollinger Bands — Upper, Lower, Width (20-period)
- EMA — Exponential Moving Averages (12, 26)
- Buy/Sell/Hold signal generated based on RSI thresholds
- Runs on local Spark (
local[*]) on Mac
- Reads Silver Delta on full cluster (Mac Master + Windows Worker)
- Writes three tables to PostgreSQL via JDBC:
| Table | Rows | Description |
|---|---|---|
gold.stock_prices |
18,694 | Historical OHLCV per symbol per date |
gold.technical_signals |
18,694 | RSI, MACD, Bollinger per record |
gold.stock_summary |
17 | Latest snapshot per symbol |
| Category | Symbols |
|---|---|
| US Stocks | AAPL, GOOGL, MSFT, AMZN, TSLA, NVDA, META |
| Indian Stocks | RELIANCE.NS, TCS.NS, INFY.NS |
| Crypto | BTC-USD, ETH-USD |
| Indices | ^NSEI, ^BSESN, ^GSPC, ^DJI, ^IXIC |
WSL2 defaults to Python 3.10 but Spark requires matching versions across nodes.
# Fix: set Python 3.11 as system default on Windows WSL2
sudo update-alternatives --set python3 /usr/bin/python3.11Mac and Windows use different base paths for the same data. Resolved via symlink:
# On Windows WSL2
sudo ln -s /home/tejfaster/MarketPulse-Data /Users/tejfaster/Developer/Python/MarketPulse-dataPostgreSQL must accept connections from the cluster network:
# postgresql.conf
listen_addresses = '*'
# pg_hba.conf
host all all 10.0.0.0/8 md5
The Silver and Gold batch layers are automated via Apache Airflow 2.8.1 running on Docker.
| Property | Value |
|---|---|
| Schedule | 0 17 * * 1-5 — 17:00 UTC, Monday–Friday |
| Timezone | UTC (markets close ~21:00 UTC, Silver runs on delayed data) |
| Executor | LocalExecutor |
| Retries | 1 retry, 5-minute delay |
17:00 UTC (Mon–Fri)
│
▼
run_silver
spark-submit bronze_to_silver.py (local[*] on Mac)
│
▼
wait_for_sync (90 seconds)
Gives Syncthing time to propagate Silver Delta files to Windows
│
▼
run_gold
spark-submit silver_to_gold.py (spark://172.23.181.20:7077)
Distributed across Mac Master + Windows Worker
Writes 3 tables to PostgreSQL
- Bronze runs continuously — it's a streaming job consuming from Kafka 24/7. Airflow doesn't manage it.
- Silver + Gold run once daily — batch jobs that depend on each other. Airflow enforces the dependency and schedule.
- LocalExecutor — Spark already handles distribution. Airflow only needs to trigger jobs, not distribute them.
- SSH-based execution — Airflow container SSHs into Mac host to call
spark-submitnatively in the Anaconda environment. - 90s sync wait — Syncthing needs time to propagate new Silver Delta files from Mac to Windows before Gold reads them.
cd /Users/tejfaster/Developer/Python/MarketPulse/airflow
docker compose up airflow-webserver airflow-scheduler -dUI available at: http://localhost:8081 (login: airflow / airflow)
docker exec airflow-airflow-scheduler-1 airflow dags trigger marketpulse_silver_golddocker compose downA 5-page interactive dashboard built on top of the Gold PostgreSQL layer.
| Page | Name | Description |
|---|---|---|
| 1 | Market Overview | KPI cards, stock summary table, Buy/Hold/Sell donut chart, category + signal slicers |
| 2 | Signal Dashboard | RSI/MACD/Bollinger signal analysis, bar chart by category, overbought/oversold counts |
| 3 | Price History | Area chart, OHLCV table, volume distribution by symbol and category |
| 4 | Stock Detail | Drill through page — RSI history, MACD chart, price trend per symbol |
| 5 | Tooltip | Hover popup showing quick stats for any symbol across all pages |
- Drill Through — right-click any symbol → jump to Stock Detail page filtered to that symbol
- Tooltip Page — hover over any symbol → popup shows price, RSI, signal instantly
- Cross Filtering — click any visual → all other visuals filter automatically
- Category Slicer — filter by US Stocks / Indian Stocks / Crypto / Indices
- Dynamic Title — Stock Detail page title updates automatically per symbol
- Buy / Sell / Hold Signal counts
- RSI Status (Overbought / Oversold / Neutral)
- Latest Price, RSI, MACD, Signal per symbol
- Price Change % Daily
- Overbought / Oversold counts
- Positive MACD count
- Category classification (US / Indian / Crypto / Indices)
- Source: PostgreSQL 17.4 (
goldschema) - Mode: Import (daily refresh after Airflow pipeline)
- Tables:
gold.stock_prices,gold.technical_signals,gold.stock_summary - Refresh: Manual after Airflow DAG completes at 17:00 UTC Mon–Fri
- Cluster setup (Spark + Kafka)
- Bronze layer ingestion
- Silver layer transformations
- Gold layer → PostgreSQL
- Airflow orchestration
- Power BI dashboard
Tej Pratap — Aspiring Data Engineer
- GitHub: @tejfaster
- LinkedIn: Tej Pratap
MIT License — feel free to use and modify.