This project implements a robust data engineering pipeline to scrape product data from IndiaMART, process it through a Medallion Architecture (Bronze → Silver → Gold), and load curated results into AWS Redshift Serverless for visualization in AWS QuickSight. The pipeline is orchestrated using Apache Airflow for seamless scheduling and execution.
pyenv install 3.11.9
pyenv virtualenv 3.11.9 datascrape-env
pyenv activate datascrape-env
pip install -r requirements.txt
airflow db init
airflow standalone
cp dags/*.py ~/airflow/dags/
The DAGs are scheduled to run automatically at the following times:
indiamart_category_dag: Daily at 00:00
product_scrapping_dag: Daily at 00:05
medallion_dags: Daily at 00:10
Purpose: Scrapes category links from IndiaMART using Scrapy.
Output: Stores results in categories.json
Purpose: Scrapes product details from each category link using Scrapy.
Output: Stores results in products.json
Purpose: Processes data through the Medallion Architecture using Apache Spark 3.4.3.
Tasks:
bronze_task: Ingests raw data.
silver_task: Cleans and transforms data.
gold_task: Aggregates and curates data for analytics.
Flow: bronze_task >> silver_task >> gold_task
Storage: Gold layer data is loaded into AWS Redshift Serverless for scalable querying.
Visualisation: Dashboards are built in AWS QuickSight to provide insights into: