A beginner-friendly ETL project that extracts GitHub repository data, transforms it with Pandas, and loads it into SQLite.
This pipeline follows a simple ETL flow:
- Extract repository data from the GitHub Search API
- Show the raw data in the terminal
- Transform and clean the dataset
- Load both raw and transformed data into SQLite
- Record each run in an audit table
GitHub API -> RAW DATA -> Transform -> TRANSFORMED DATA -> SQLite
The script in etl_pipeline.py performs these transformations:
- fills missing values for repo, owner, language, and numeric metrics
- converts numeric fields like stars, forks, issues, and watchers
- normalizes text fields such as
owner_lowerandlanguage_upper - creates derived columns like
popularityandactivity - removes duplicate rows based on repository id
| Table | Purpose |
|---|---|
raw_table |
Stores extracted GitHub repository data |
staging_table |
Stores cleaned and transformed data |
etl_job_audit |
Stores ETL run metadata such as run ID, row counts, and status |
ETL_pipeline/
|-- config.py
|-- etl_pipeline.py
|-- requirements.txt
|-- README.md
|-- warehouse.db
`-- venv/
- Python 3
- Pandas
- Requests
- Internet access for the GitHub API call
Open a terminal in your project folder first. For example:
cd /path/to/ETL_pipeline./venv/bin/python etl_pipeline.pypython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python etl_pipeline.pyWhen the pipeline runs successfully, you will see stages like:
[EXTRACT] Fetching repositories from GitHub API...RAW GITHUB DATA[TRANSFORM] Cleaning and transforming data...TRANSFORMED GITHUB DATA[LOAD] Loading data into SQLite...ETL SUMMARYPipeline completed successfully.
You can inspect the SQLite tables with:
sqlite3 warehouse.dbThen run:
SELECT COUNT(*) FROM raw_table;
SELECT COUNT(*) FROM staging_table;
SELECT * FROM etl_job_audit ORDER BY rowid DESC LIMIT 5;You can change the pipeline settings in config.py, including:
- API URL
- request timeout
- SQLite database path
- table names
- default values used during transformation