git clone https://github.com/nhungoc1508/S25-BDM-Project.gitcd S25-BDM-Project
cd data-simulation
docker network create data-processing-network
docker compose up -dCheck that the PostgreSQL server is running and data has been loaded:
docker logs postgres_sis | grep "Mock data"Should see either Mock data inserted successfully! or Mock data already exists. Skipping insertion.
Check that the MongoDB server is running and data has been loaded:
docker logs mongo_counselors | grep "Mock data"Should see Mock data inserted successfully!
Important
If you are running locally, comment out this line in the docker-compose.yaml file (this line appears 3 times under spark-master, spark-worker-1, and spark-worker-2):
platform: linux/arm64Build custom images for Spark and Airflow:
cd ../delta-lake
docker build -t custom-airflow -f Dockerfile.airflow .
docker build -t custom-spark -f Dockerfile.spark .Start MongoDB and Neo4J:
docker compose up counseling-db graph-db -dStart master:
docker compose up spark-master -dCheck that master is running:
docker logs spark-master | grep "I have been elected leader! New state: ALIVE"Start workers:
docker compose up spark-worker-1 spark-worker-2 -dCheck that all nodes are running and the workers are registered with master:
docker logs spark-worker-1 | grep "Successfully registered with master spark://spark-master:7077"
docker logs spark-worker-2 | grep "Successfully registered with master spark://spark-master:7077"In case of failure to register, run compose down then repeat previous steps:
docker compose down spark-master spark-worker-1 spark-worker-2 -vOnce running, the Spark master UI is available at localhost:8081/ and will show 2 alive workers:
mkdir -p ./data ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env
docker compose up airflow-initCheck that Airflow is using PostgreSQL for metadata (and not SQLite):
docker logs airflow-init | grep "DB: postgresql+psycopg2"Start the rest of the Airflow-related services:
docker compose up airflow-worker airflow-scheduler airflow-dag-processor airflow-apiserver airflow-triggerer airflow-cli flower -dCheck that the webserver UI is up and running:
docker logs airflow-apiserver | grep "Application startup complete"The Airflow webserver is available at localhost:8080/:
Log in with username airflow and password airflow. After logging in, click on the Dags tab on the left menu bar, the webserver UI will list all available DAGs:
In the Airflow webserver UI, go to Admin > Connections. Select Add Connection and add a connection to the Spark master with ID spark-default, type Spark, host spark://spark-master, and port 7077:
Either trigger the DAGs manually or wait for scheduled runs, and monitor the DAG logs:





