A full end-to-end Real-Time + Batch Big Data system for cryptocurrency analytics
Spark • Kafka • HBase • Airflow • HDFS • MongoDB • XGBoost • LLM Chatbot
A complete end-to-end Big Data pipeline for real-time cryptocurrency analytics.
This project demonstrates Lambda Architecture using:
-
Kafka for ingestion
-
Spark Streaming for real-time processing
-
Spark Batch + Airflow for scheduled ML training
-
HDFS + HBase + MongoDB for storage layers
-
XGBoost for crypto price prediction
-
Flask + Streamlit for UI layers (real-time & batch)
-
Gemini LLM + NewsAPI for sentiment-aware market analysis
Then open your terminal and navigate to the desired directory:
git clone https://github.com/yourusername/crypto-bigdata-project
cd crypto-bigdata-projectRun the following command to build and launch all services, make sure you have Docker running:
docker compose up -d --buildThis will automatically start:
| Component | Purpose |
|---|---|
| Kafka + Zookeeper | Message queue |
| Spark Master + Worker | Batch + streaming processing |
| HDFS (Namenode/Datanode) | Distributed storage |
| HBase | Real-time serving layer |
| MongoDB | Batch aggregation storage |
| Airflow | Workflow scheduler |
| Streamlit | Batch dashboard |
| Flask | Real-time dashboard + chatbot |
| Producer | Binance → Kafka WebSocket bridge |
This step is how you can get XGBoost price prediction models and historical data.
After starting Docker:
docker logs airflow | findstr /i "password"The result will show your Airflow username and password like:
Login with username: admin password: nUPv6yYUp5WRRD94Use this on localhost:3636 and trigger DAG for the 1st run.
This pipeline will:
-
Pull 2 years of historical data
-
Store them in HDFS
-
Train XGBoost models
-
Write predictions to MongoDB
💡Tips:
Press F5 once or twice (refresh
localhost:3636) in case no username or password is shown upon using the aforementioned command.In file
airflow/dags/batch_pipeline_dag.py, setschedule_interval="@daily"and it will update models daily when you start Docker.
| Layer | Description | URL |
|---|---|---|
| 🟣 Real-Time Dashboard (Flask) | • Live price feed • AI chatbot • Real-time XGBoost predictions • Technical indicators overlay |
http://localhost:5000 |
| 🟢 Batch Dashboard (Streamlit) | • Historical trend analysis • Model metrics (RMSE, MAPE, MAE) • Risk dashboard • Portfolio allocation • Data explorer |
http://localhost:8501 |
batchDemo.mp4
streamDemo.mp4
| Category | Tools |
|---|---|
| Ingestion | Kafka, Binance WebSocket |
| Real-time Processing | Spark Structured Streaming |
| Batch Processing | Spark ML, XGBoost |
| Scheduling | Airflow |
| Storage | HDFS, HBase, MongoDB |
| Dashboards | Flask, Streamlit |
| AI | Gemini LLM, NewsAPI |
| Deployment | Docker Compose |
| Feature | Description |
|---|---|
| 📰 News Summarization | Extracts and summarizes relevant crypto news articles automatically. |
| 😃 Sentiment Detection | Identifies positive/negative/neutral sentiment from live news text. |
| 📈 Technical + Fundamental Analysis | Combines market data + indicators (SMA, RSI, support/resistance) with news sentiment. |
| 🇻🇳 Vietnamese Support | Fully understands and responds in Vietnamese (and other languages). |
| 🔀 Mixed Question Handling | Can answer hybrid or complex questions and filter only the crypto-relevant parts. |
| 🧠 Short-Term Memory | Remembers the latest 10 conversation turns for context-aware replies. |
![]() Thanh Dan Bui |
![]() Tien Dung Pham |
![]() Nguyen Dan Vu |
We also wrote a LaTeX report for this project. Refer to the file
REPORT_EN.pdfin English andREPORT_VI.pdfin Vietnamese.


