Skip to content

A real-time and batch-based crypto analytics platform built with Lambda Architecture using Kafka, Spark, HDFS, HBase, MongoDB, OpenAI, and XGBoost - providing live short‑term predictions via Flask and long‑term insights with Streamlit dashboards, integrating Yahoo Finance and Binance data for 10 major cryptocurrencies.

License

Notifications You must be signed in to change notification settings

yammdd/crypto-bigdata-project

Repository files navigation

Crypto Big Data Platform

A full end-to-end Real-Time + Batch Big Data system for cryptocurrency analytics
Spark • Kafka • HBase • Airflow • HDFS • MongoDB • XGBoost • LLM Chatbot

A complete end-to-end Big Data pipeline for real-time cryptocurrency analytics.

This project demonstrates Lambda Architecture using:

  • Kafka for ingestion

  • Spark Streaming for real-time processing

  • Spark Batch + Airflow for scheduled ML training

  • HDFS + HBase + MongoDB for storage layers

  • XGBoost for crypto price prediction

  • Flask + Streamlit for UI layers (real-time & batch)

  • Gemini LLM + NewsAPI for sentiment-aware market analysis


🧩 Project Setup

1️⃣ Clone the Project

Then open your terminal and navigate to the desired directory:

git clone https://github.com/yourusername/crypto-bigdata-project
cd crypto-bigdata-project

2️⃣ Build and Start Docker Containers

Run the following command to build and launch all services, make sure you have Docker running:

docker compose up -d --build

This will automatically start:

Component Purpose
Kafka + Zookeeper Message queue
Spark Master + Worker Batch + streaming processing
HDFS (Namenode/Datanode) Distributed storage
HBase Real-time serving layer
MongoDB Batch aggregation storage
Airflow Workflow scheduler
Streamlit Batch dashboard
Flask Real-time dashboard + chatbot
Producer Binance → Kafka WebSocket bridge

3️⃣ Setup Batch jobs

This step is how you can get XGBoost price prediction models and historical data.

After starting Docker:

docker logs airflow | findstr /i "password"

The result will show your Airflow username and password like:

Login with username: admin  password: nUPv6yYUp5WRRD94

Use this on localhost:3636 and trigger DAG for the 1st run.

This pipeline will:

  • Pull 2 years of historical data

  • Store them in HDFS

  • Train XGBoost models

  • Write predictions to MongoDB

💡Tips:

Press F5 once or twice (refresh localhost:3636) in case no username or password is shown upon using the aforementioned command.

In file airflow/dags/batch_pipeline_dag.py, set schedule_interval="@daily" and it will update models daily when you start Docker.


📊 Dashboards

Layer Description URL
🟣 Real-Time Dashboard (Flask) • Live price feed
• AI chatbot
• Real-time XGBoost predictions
• Technical indicators overlay
http://localhost:5000
🟢 Batch Dashboard (Streamlit) • Historical trend analysis
• Model metrics (RMSE, MAPE, MAE)
• Risk dashboard
• Portfolio allocation
• Data explorer
http://localhost:8501

💻 Demo Videos

Batch Layer

batchDemo.mp4

Stream Layer

streamDemo.mp4

🧠 Used Technologies

Category Tools
Ingestion Kafka, Binance WebSocket
Real-time Processing Spark Structured Streaming
Batch Processing Spark ML, XGBoost
Scheduling Airflow
Storage HDFS, HBase, MongoDB
Dashboards Flask, Streamlit
AI Gemini LLM, NewsAPI
Deployment Docker Compose

🤖 AI Chatbot Features

Feature Description
📰 News Summarization Extracts and summarizes relevant crypto news articles automatically.
😃 Sentiment Detection Identifies positive/negative/neutral sentiment from live news text.
📈 Technical + Fundamental Analysis Combines market data + indicators (SMA, RSI, support/resistance) with news sentiment.
🇻🇳 Vietnamese Support Fully understands and responds in Vietnamese (and other languages).
🔀 Mixed Question Handling Can answer hybrid or complex questions and filter only the crypto-relevant parts.
🧠 Short-Term Memory Remembers the latest 10 conversation turns for context-aware replies.

👥 Contributors


Thanh Dan Bui

Tien Dung Pham

Nguyen Dan Vu

📝 Notes

We also wrote a LaTeX report for this project. Refer to the file REPORT_EN.pdf in English and REPORT_VI.pdf in Vietnamese.

About

A real-time and batch-based crypto analytics platform built with Lambda Architecture using Kafka, Spark, HDFS, HBase, MongoDB, OpenAI, and XGBoost - providing live short‑term predictions via Flask and long‑term insights with Streamlit dashboards, integrating Yahoo Finance and Binance data for 10 major cryptocurrencies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •