This project implements a complete ETL (Extract, Transform, Load) pipeline for an airline's business intelligence system. It demonstrates the process of extracting data from multiple sources (airline's operational database and CSV files), transforming it into a data warehouse, and creating star schemas for analytical purposes.
- Project Overview
- Architecture
- Prerequisites
- Installation
- Usage
- Project Structure
- Implementation Details
- Data Flow
- Visualization
- Development
This project simulates a real-world business intelligence scenario where data from multiple sources needs to be integrated and transformed for analytical purposes. The system:
- Generates synthetic data for a relational database and CSV files representing an airline's data stores
- Loads this data into a data warehouse
- Creates star schemas for analytical queries
- Handles orchestration through Prefect for scheduled incremental loads.

- Handles sending push notifications on loads.

- Provides visualization capabilities through Metabase
(full Metabase dashboard found in images/metabase_dashboard.pdf)
The project uses a modern data stack:
- Data Sources:
- Relational Database (PostgreSQL)
- CSV Files
- ETL Orchestration: Prefect
- Data Warehouse: PostgreSQL
- Visualization: Metabase
- Docker and Docker Compose
- Python 3.12+
- Git
- Clone the repository:
git clone https://github.com/inconspicuoususername/airline-bi
cd airline-bi- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Start the services using Docker Compose:
docker compose up -dThis will start:
- PostgreSQL database
- Prefect server and worker
- Metabase
- Prefect UI: http://localhost:4200
- Metabase: http://localhost:3000
- Start the Prefect server, worker and Metabase frontend:
docker compose up -d- Synthesize the database and csv
python3 -m data.synthesize_reldb
python3 -m data.csv- Run full load
python3 -m index- Deploy the flow for incremental loads:
python3 -m flows.initial- Either wait for the flow to be scheduled or run the flow directly from the Prefect Server dashboard
.
├── data/ # Data generation and storage
├── database/ # Database models and connections
├── etl/ # ETL implementation
│ ├── warehouse.py # Data warehouse operations
│ ├── star_schema.py # Star schema transformations
│ └── utils.py # Utility functions
├── flows/ # Prefect workflow definitions
├── model/ # Data models
├── util/ # Utility functions
├── compose.yaml # Docker Compose configuration
└── requirements.txt # Python dependencies
The project generates synthetic data for an airline operations system, including:
- Pilots: Pilot information, license numbers, and qualifications
- Cabin Crew: Staff details and employee IDs
- Customers: Passenger information, frequent flyer status
- Airports: Airport codes, names, locations
- Airplanes: Aircraft models, registration numbers, fuel consumption
- Flights: Flight schedules, routes, status
- Flight Bookings: Passenger bookings and seat assignments
- Airline Reviews: Customer feedback and ratings
-
Relational Database:
- Uses SQLAlchemy ORM for data extraction
- Implements incremental loading with change data capture
- Handles complex joins between related tables
- Supports batch processing for large datasets
-
CSV Files:
- Pandas for efficient CSV reading
- Handles various data formats and encodings
- Supports incremental loading from CSV updates
-
Data Warehouse Loading:
- Implements Type 2 Slowly Changing Dimensions (SCD)
- Handles surrogate keys for dimension tables
- Manages historical data tracking
- Supports incremental updates
-
Data Quality:
- Validates data integrity
- Handles missing values
- Standardizes data formats
- Enforces business rules
-
Warehouse Operations:
- Uses PostgreSQL for the data warehouse
- Implements efficient bulk loading
- Supports transaction management
- Handles concurrent operations
-
Incremental Loading:
- Tracks changes using end_date timestamps
- Supports partial updates
- Maintains data consistency
- Optimizes performance
The project implements a comprehensive star schema for airline analytics:
-
DimDate:
- Date attributes for temporal analysis
- Supports various date hierarchies
- Includes business day indicators
-
DimAirport:
- Airport details
- Geographic information
- Operational metrics
-
DimAirplane:
- Aircraft specifications
- Maintenance information
- Performance metrics
-
DimPilot:
- Pilot qualifications
- Experience metrics
- Performance indicators
-
DimCustomer:
- Customer demographics
- Loyalty program status
- Travel preferences
-
DimFlight:
- Flight details
- Route information
- Schedule data
-
FactFlight:
- Flight performance metrics
- Delay information
- Operational statistics
- Links to date, airport, and airplane dimensions
-
FactBooking:
- Booking statistics
- Revenue metrics
- Seat utilization
- Links to customer and flight dimensions
-
FactReview:
- Customer satisfaction metrics
- Service ratings
- Feedback analysis
- Links to customer, flight, and date dimensions
- Uses SQLAlchemy for ORM
- Implements proper indexing for performance
- Supports referential integrity
- Handles complex relationships
- Prefect for workflow orchestration
- Modular design for maintainability
- Error handling and logging
- Monitoring and alerting
- Input validation
- Data type checking
- Business rule enforcement
- Error reporting
- Batch processing
- Parallel execution
- Efficient SQL queries
- Proper indexing
- Data is generated and stored in the relational database and CSV files
- Prefect orchestrates the ETL process
- Data is transformed and loaded into the warehouse
- Star schemas are created for analytical queries
- Metabase connects to the warehouse for visualization
Metabase is used for data visualization:
- Access Metabase at http://localhost:3000
- Set up a new database connection to the warehouse
- Create dashboards and visualizations using the star schemas