Skip to content

SPADAKI/bloomberg-commodities-de

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bloomberg Commodities ETL – Simple & Scalable

Built in < 1 hour – no Bloomberg Terminal, no live data, just 6 real Kaggle commodity CSVs.


Goal

Show a clean, maintainable ETL pipeline that:

  • Auto-detects Date and numeric columns (Price, High, Low, etc.)
  • Handles real-world messiness: commas in numbers, missing columns, mixed formats
  • Combines all assets into one clean file
  • Delivers an interactive dashboard for instant insight

Designed for junior onboarding – runs in 60 seconds.


Tech Stack

Tool Purpose
Polars Fast, memory-efficient data processing
Streamlit One-click interactive dashboard
CSV → CSV Simple, portable, no external storage

Folder Structure

bloomberg-commodities-de/ ├── data/ │ └── raw/ # ← Your 6 Kaggle CSVs go here ├── pipeline.py # ← Ingest + clean + combine ├── dashboard.py # ← Interactive chart └── README.md

Dashboard Snapshots:

Screenshot 2025-11-05 at 3 26 01 PM Screenshot 2025-11-05 at 3 35 43 PM Screenshot 2025-11-05 at 3 26 23 PM Screenshot 2025-11-05 at 3 35 56 PM Screenshot 2025-11-05 at 3 35 30 PM

##Run Instructions:

# 1. Install (once)
pip3 install polars streamlit

# 2. Run ETL
python3 pipeline.py

# Bloomberg Commodities ETL – Day 1 Ready (ML + Anomaly Detection)

Built for Bloomberg interview prep: From Kaggle mocked CSVs to production-grade pipeline in <2 hours.  
**Now with Data Quality (DQ) guards + Z-Score anomaly detection** — catches negative prices and 3σ outliers instantly.

## 🚀 What's New in Day 1
- **DQ Check**: Auto-removes invalid prices ≤ 0 (critical for finance).
- **Anomaly Detection**: Log-transform + Z-Score (3σ rule) on prices — industry standard for commodity volatility.
- **Robustness**: Column-name agnostic (works with "Price", "Close", etc.).
- **Zero Extra Deps**: Pure Polars + NumPy/SciPy — no Great Expectations bloat.

**Sample Output**:

# 3. Launch dashboard
streamlit run dashboard.py


## 🎯 Goal
Scalable ETL for commodity data:
- Ingests messy CSVs (commas, % signs, mixed formats).
- Cleans, combines, and validates.
- Flags anomalies for trading desk alerts.
- Ready for Prefect orchestration (Day 2).

## 🛠 Tech Stack
| Tool | Purpose |
|------|---------|
| **Polars** | Lightning-fast data processing (10x Pandas). |
| **NumPy/SciPy** | Z-Score stats (log-normal for prices). |
| **Pandas** (minimal) | Temp conversions only. |
| **CSV Output** | Portable + human-readable (Parquet ready for prod). |

## 📁 Folder Structure
bloomberg-commodities-de/
├── data/
│   ├── raw/              # Drop your 6 Kaggle CSVs here (e.g., crude_oil.csv)
│   ├── combined_commodities.csv     # Raw combined output
│   └── combined_commodities_clean.csv  # DQ + anomaly-flagged
├── pipeline.py           # Full ETL + Day 1 upgrades
├── dashboard.py          # Interactive Streamlit viz (launch with streamlit run dashboard.py)
├── flow.py               # Prefect for logs,retries and visuals for Orchestration

Key Points :
- B-Pipe → Kafka: exactly-once, 100k+ msgs/sec
- Prefect: retries, observability, scheduling (cron="@daily")
- Polars: 10–50x faster than Pandas on commodity ticks
- Z-Score anomalies → alert trading desk instantly
- Final storage: columnar Parquet for BI tools (Tableau, etc.)
└── README.md



About

Sample project using Kaggle mocked data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages