ETL Data Pipeline

This project implements a robust data engineering pipeline to scrape product data from IndiaMART, process it through a Medallion Architecture (Bronze → Silver → Gold), and load curated results into AWS Redshift Serverless for visualization in AWS QuickSight. The pipeline is orchestrated using Apache Airflow for seamless scheduling and execution.

Setup Instructions

1. Install Python 3.11 using `pyenv`

pyenv install 3.11.9
pyenv virtualenv 3.11.9 datascrape-env
pyenv activate datascrape-env

2. Install Project Dependencies

pip install -r requirements.txt

Running the Pipeline

1. Initialise Airflow

airflow db init
airflow standalone

2. Copy DAGs to Airflow Directory

cp dags/*.py ~/airflow/dags/

3. Trigger DAGs

The DAGs are scheduled to run automatically at the following times:

indiamart_category_dag: Daily at 00:00
product_scrapping_dag: Daily at 00:05
medallion_dags: Daily at 00:10

DAG Details

1. indiamart_category_dag (Runs daily at 00:00)

Purpose: Scrapes category links from IndiaMART using Scrapy. Output: Stores results in categories.json

2. product_scrapping_dag (Runs daily at 00:05)

Purpose: Scrapes product details from each category link using Scrapy. Output: Stores results in products.json

3. medallion_dags (Runs daily at 00:10)

Purpose: Processes data through the Medallion Architecture using Apache Spark 3.4.3.

Tasks:

bronze_task: Ingests raw data.

silver_task: Cleans and transforms data.

gold_task: Aggregates and curates data for analytics.

Flow: bronze_task >> silver_task >> gold_task

Architecture

Analytics

Storage: Gold layer data is loaded into AWS Redshift Serverless for scalable querying. Visualisation: Dashboards are built in AWS QuickSight to provide insights into:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dags		dags
medallion-scripts		medallion-scripts
medallion		medallion
redshiftjdbcdriver		redshiftjdbcdriver
scraped_content		scraped_content
scraper_v1		scraper_v1
.gitignore		.gitignore
README.md		README.md
product_scraper.py		product_scraper.py
redshifttest.py		redshifttest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Data Pipeline

Setup Instructions

1. Install Python 3.11 using `pyenv`

2. Install Project Dependencies

Running the Pipeline

1. Initialise Airflow

2. Copy DAGs to Airflow Directory

3. Trigger DAGs

DAG Details

1. indiamart_category_dag (Runs daily at 00:00)

2. product_scrapping_dag (Runs daily at 00:05)

3. medallion_dags (Runs daily at 00:10)

Architecture

Analytics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Data Pipeline

Setup Instructions

1. Install Python 3.11 using pyenv

2. Install Project Dependencies

Running the Pipeline

1. Initialise Airflow

2. Copy DAGs to Airflow Directory

3. Trigger DAGs

DAG Details

1. indiamart_category_dag (Runs daily at 00:00)

2. product_scrapping_dag (Runs daily at 00:05)

3. medallion_dags (Runs daily at 00:10)

Architecture

Analytics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Install Python 3.11 using `pyenv`

Packages