Airline Business Intelligence Project

This project implements a complete ETL (Extract, Transform, Load) pipeline for an airline's business intelligence system. It demonstrates the process of extracting data from multiple sources (airline's operational database and CSV files), transforming it into a data warehouse, and creating star schemas for analytical purposes.

Project Overview

This project simulates a real-world business intelligence scenario where data from multiple sources needs to be integrated and transformed for analytical purposes. The system:

Generates synthetic data for a relational database and CSV files representing an airline's data stores
Loads this data into a data warehouse
Creates star schemas for analytical queries
Handles orchestration through Prefect for scheduled incremental loads.
Handles sending push notifications on loads.
Provides visualization capabilities through Metabase (full Metabase dashboard found in images/metabase_dashboard.pdf)

Architecture

The project uses a modern data stack:

Data Sources:
- Relational Database (PostgreSQL)
- CSV Files
ETL Orchestration: Prefect
Data Warehouse: PostgreSQL
Visualization: Metabase

Prerequisites

Docker and Docker Compose
Python 3.12+
Git

Installation

Clone the repository:

git clone https://github.com/inconspicuoususername/airline-bi
cd airline-bi

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Start the services using Docker Compose:

docker compose up -d

This will start:

PostgreSQL database
Prefect server and worker
Metabase

Usage

Accessing Services

Prefect UI: http://localhost:4200
Metabase: http://localhost:3000

Running the ETL Pipeline

Start the Prefect server, worker and Metabase frontend:

docker compose up -d

Synthesize the database and csv

python3 -m data.synthesize_reldb
python3 -m data.csv

Run full load

python3 -m index

Deploy the flow for incremental loads:

python3 -m flows.initial

Either wait for the flow to be scheduled or run the flow directly from the Prefect Server dashboard

Project Structure

.
├── data/               # Data generation and storage
├── database/          # Database models and connections
├── etl/              # ETL implementation
│   ├── warehouse.py  # Data warehouse operations
│   ├── star_schema.py # Star schema transformations
│   └── utils.py      # Utility functions
├── flows/            # Prefect workflow definitions
├── model/            # Data models
├── util/             # Utility functions
├── compose.yaml      # Docker Compose configuration
└── requirements.txt  # Python dependencies

Implementation Details

Data Generation

The project generates synthetic data for an airline operations system, including:

Pilots: Pilot information, license numbers, and qualifications
Cabin Crew: Staff details and employee IDs
Customers: Passenger information, frequent flyer status
Airports: Airport codes, names, locations
Airplanes: Aircraft models, registration numbers, fuel consumption
Flights: Flight schedules, routes, status
Flight Bookings: Passenger bookings and seat assignments
Airline Reviews: Customer feedback and ratings

ETL Process

1. Extract

Relational Database:
- Uses SQLAlchemy ORM for data extraction
- Implements incremental loading with change data capture
- Handles complex joins between related tables
- Supports batch processing for large datasets
CSV Files:
- Pandas for efficient CSV reading
- Handles various data formats and encodings
- Supports incremental loading from CSV updates

2. Transform

Data Warehouse Loading:
- Implements Type 2 Slowly Changing Dimensions (SCD)
- Handles surrogate keys for dimension tables
- Manages historical data tracking
- Supports incremental updates
Data Quality:
- Validates data integrity
- Handles missing values
- Standardizes data formats
- Enforces business rules

3. Load

Warehouse Operations:
- Uses PostgreSQL for the data warehouse
- Implements efficient bulk loading
- Supports transaction management
- Handles concurrent operations
Incremental Loading:
- Tracks changes using end_date timestamps
- Supports partial updates
- Maintains data consistency
- Optimizes performance

Star Schema Design

The project implements a comprehensive star schema for airline analytics:

Dimension Tables

DimDate:
- Date attributes for temporal analysis
- Supports various date hierarchies
- Includes business day indicators
DimAirport:
- Airport details
- Geographic information
- Operational metrics
DimAirplane:
- Aircraft specifications
- Maintenance information
- Performance metrics
DimPilot:
- Pilot qualifications
- Experience metrics
- Performance indicators
DimCustomer:
- Customer demographics
- Loyalty program status
- Travel preferences
DimFlight:
- Flight details
- Route information
- Schedule data

Fact Tables

FactFlight:
- Flight performance metrics
- Delay information
- Operational statistics
- Links to date, airport, and airplane dimensions
FactBooking:
- Booking statistics
- Revenue metrics
- Seat utilization
- Links to customer and flight dimensions
FactReview:
- Customer satisfaction metrics
- Service ratings
- Feedback analysis
- Links to customer, flight, and date dimensions

Technical Implementation

Database Design

Uses SQLAlchemy for ORM
Implements proper indexing for performance
Supports referential integrity
Handles complex relationships

ETL Pipeline

Prefect for workflow orchestration
Modular design for maintainability
Error handling and logging
Monitoring and alerting

Data Quality

Input validation
Data type checking
Business rule enforcement
Error reporting

Performance Optimization

Batch processing
Parallel execution
Efficient SQL queries
Proper indexing

Data Flow

Data is generated and stored in the relational database and CSV files
Prefect orchestrates the ETL process
Data is transformed and loaded into the warehouse
Star schemas are created for analytical queries
Metabase connects to the warehouse for visualization

Visualization

Metabase is used for data visualization:

Usage

Access Metabase at http://localhost:3000
Set up a new database connection to the warehouse
Create dashboards and visualizations using the star schemas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airline Business Intelligence Project

Table of Contents

Project Overview

Architecture

Prerequisites

Installation

Usage

Accessing Services

Running the ETL Pipeline

Project Structure

Implementation Details

Data Generation

ETL Process

1. Extract

2. Transform

3. Load

Star Schema Design

Dimension Tables

Fact Tables

Technical Implementation

Database Design

ETL Pipeline

Data Quality

Performance Optimization

Data Flow

Visualization

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.metabase/metabase.db		.metabase/metabase.db
.vscode		.vscode
data		data
database		database
etl		etl
flows		flows
images		images
model		model
notifications		notifications
util		util
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
batch		batch
compose.yaml		compose.yaml
constants.py		constants.py
index.py		index.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Airline Business Intelligence Project

Table of Contents

Project Overview

Architecture

Prerequisites

Installation

Usage

Accessing Services

Running the ETL Pipeline

Project Structure

Implementation Details

Data Generation

ETL Process

1. Extract

2. Transform

3. Load

Star Schema Design

Dimension Tables

Fact Tables

Technical Implementation

Database Design

ETL Pipeline

Data Quality

Performance Optimization

Data Flow

Visualization

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages