This project showcases a professional level data engineering solution using PostgreSQL sharding across four containers, integrated with Grafana for real-time monitoring and dashboards. The dataset is based on Airbnb listings in London. The project has been successfully deployed to AWS using CloudFormation, EC2, Elastic IP, and SSH access via key pairs. The deployment simulates a production-ready environment where PostgreSQL shards run inside Docker containers on an EC2 instance.
βββ data/ # (Git-ignored) Raw + cleaned Airbnb data
βββ docker/ # Docker-related setup
β βββ shard1/init.sql # Shard 1 schema
β βββ shard2/init.sql # Shard 2 schema
β βββ shard3/init.sql # Shard 3 schema
β βββ shard4/init.sql # Shard 4 schema
β βββ docker-compose.yml # Multi-service setup: shards, pgAdmin,
β βββ .env # Stores credentials and ports
βββ infrastructure/
β βββ ec2-sharded-postgres-stack.yml # CloudFormation Template
βββ notebooks/
β βββ eda_analysis.ipynb # EDA Analysis to understand the data
βββ clean_data.ipynb # Cleans and saves Airbnb data
βββ src/
β βββ shard_router_v1.py # Routes cleaned data to appropriate shard using .env
βββ .gitignore
βββ README.md
- Shard Airbnb listings by borough into 4 PostgreSQL instances
- Route the cleaned data programmatically to correct shards
- Visualise insights in Grafana with PostgreSQL data sources
- Use environment variables for security and config
- Avoid committing large files by excluding
/datafrom Git
- Programming: Python (
pandas,psycopg2,dotenv) - Databases: PostgreSQL (sharded across 4 Docker containers)
- Containerisation & Orchestration: Docker, Docker Compose
- Monitoring & Visualisation: Grafana (connected to PostgreSQL shards)
- Database Admin UI: pgAdmin
- Security: Environment variables via
.env(for credentials and configuration) - Version Control: Git, GitHub (with
.gitignoreto exclude large files and data)
git clone https://github.com/konomissira/airbnb_sharding_project.git
cd airbnb_sharding_project/docker# .env file
POSTGRES_USER=YourPostgreUsername
POSTGRES_PASSWORD=YourPostgrePassword
PG_SHARD1_PORT=5433
PG_SHARD2_PORT=5434
PG_SHARD3_PORT=5435
PG_SHARD4_PORT=5436
PGADMIN_DEFAULT_EMAIL=admin@example.com
PGADMIN_DEFAULT_PASSWORD=YourPgadminPassword
GRAFANA_ADMIN_USER=YourGrafanaUsername
GRAFANA_ADMIN_PASSWORD=YourGrafanaPasswordBecause /data folder is in .gitignore file to avoid slow Git pushes, you must manually download and clean the dataset here's the link: https://insideairbnb.com/fr/get-the-data/
Also, you must have Jupyter Notebook in order to run eda_analysis.ipynb and clean_data.ipynb which will clean and save the clean data into the data folder.
You must have docker installed in your computer.
cd ../docker
docker compose up -d- Access pgAdmin at this url: http://localhost:5050
- Use credentials from
.env
When registering each shard server:
- Host:
pg_shard1,pg_shard2, etc. - Port:
5432 - Username: from
.env - Password: from
.env
cd src
python3 shard_router_v1.pyThis script reads cleaned_airbnb_listings.csv, maps each borough to its correct shard, and inserts records using psycopg2 and execute_values.
- Access Grafana at this url: http://localhost:3000
- Use credentials from
.env
Repeat this step 4 times (one per shard):
- Name:
shard1,shard2, etc. - Host:
pg_shard1:5432 - Database:
shard1_db, etc. - User: from
.env - Password: from
.env
-- Number of listings per room type
SELECT room_type, COUNT(*) FROM listings GROUP BY room_type;
-- Average price per borough
SELECT neighbourhood, AVG(price) FROM listings GROUP BY neighbourhood;- π Modular Docker setup for each shard
- π Environment variables used for all credentials
- Clean, professional code with clear structure
- π Interactive visual analytics via Grafana
- AWS account (Free Tier eligible)
- AWS CLI configured with your IAM user
- An EC2 key pair (RSA format,
.pemfile) - GitHub repo cloned locally
- Docker & Docker Compose installed
Make sure your IAM user has the following permissions:
AmazonEC2FullAccessCloudFormationFullAccess
Create a key pair to SSH into your EC2 instance:
aws ec2 create-key-pair --key-name kono-key \
--query 'KeyMaterial' --output text > kono-key.pem
chmod 400 kono-key.pemMove the .pem file from downloads folder to your project root folder:
mv ~/Downloads/kono-key.pem ./
Do not push the .pem (e.g. kono-key.pem) file to GitHub. Add it to .gitignore:
# .gitignore
kono-key.pemThe default AMI ID may not exist in eu-west-1. To get the latest Ubuntu AMI:
aws ec2 describe-images \
--owners 099720109477 \
--filters 'Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*' \
--query 'Images[*].[ImageId,CreationDate]' \
--output text \
--region eu-west-1 | sort -k2 -r | head -n 1Copy the AMI ID (e.g., ami-0dc0a.......) and replace it in your CloudFormation template file:
infrastructure/ec2-sharded-postgres-stack.ymlRun the command below to launch the infrastructure:
aws cloudformation create-stack \
--stack-name postgres-shard-stack \
--template-body file://infrastructure/ec2-sharded-postgres-stack.yml \
--parameters ParameterKey=KeyName,ParameterValue=kono-key \
--capabilities CAPABILITY_NAMED_IAM \
--region eu-west-1You should see a CREATE_COMPLETE message once deployed.
An Elastic IP was associated to ensure the EC2 instance always has the same IP address:
Elastic IP used: 54.247.109.226
ssh -i kono-key.pem ubuntu@54.247.X.XOnce connected to your instance, navigate to the project folder and run:
docker compose up -dThis will start the following services:
pg_shard1,pg_shard2,pg_shard3,pg_shard4pgAdmin(on port 5050)Grafana(on port 3000)
aws ec2 stop-instances --instance-ids i-YourInstanceID e.g.0c6beb07XXXXXXXX --region eu-west-1You can confirm itβs stopped with:
aws ec2 describe-instances --instance-ids i-0c6beb07XXXXXXXX --region eu-west-1 \
--query 'Reservations[*].Instances[*].State.Name' --output text- Integrate Apache Airflow
- Add automated test scripts for ETL validation
- Provision Grafana panels programmatically
- This project is licensed under the MIT License.
- You are free to use, modify, and distribute this software with proper attribution.
- Name: Mahamadou
- Role: Data Engineer