Skip to content

codes-by-sethu/Bike-PIM-Agentic-Scraper

Repository files navigation

🚲 Bike-PIM Agentic Scraper

Automating structured product intelligence for circular retail


🌍 Project Purpose

This project is a high-performance Proof of Concept (POC) supporting Decathlon’s Circular PIM (Product Information Management) initiative.

Its goal is to automatically transform unstructured cycling product listings from global marketplaces into validated, PIM-ready data, enabling repair, rental, and second-hand services at scale.

The system acts as a core intelligence layer for sustainable, circular business models.


🏗️ Architecture Overview

1. Data Ingestion & Scraping

  • Web scraping using Requests and BeautifulSoup
  • Text normalization to prepare raw listings for LLM processing

2. Agentic Intelligence Layer

  • Local inference with Llama 3.2 via Ollama (GDPR-safe, offline)
  • Zero-shot extraction of product attributes:
    • Brand
    • Model
    • Condition
    • Price
    • Material
  • Strict JSON schema enforcement for deterministic, enterprise-ready output

3. Data Integrity & Validation

  • Pydantic models enforcing types and business rules (e.g. circular pricing logic)
  • Automated Pytest suite to guarantee data quality and pipeline stability

🛠️ Tech Stack

  • Language: Python 3.11+
  • GenAI: Llama 3.2, Ollama, Prompt Engineering, Agentic Workflows
  • Web Scraping: Requests, BeautifulSoup
  • Data & Validation: Pydantic, SQL, Data Modeling
  • Quality & DevOps: Git, Pytest, GitHub Actions

🚀 Getting Started

Prerequisites

  • Python 3.11+
  • Ollama installed locally

Installation

git clone https://github.com/codes-by-sethu/bike-pim-scraper.git
cd bike-pim-scraper
pip install -r requirements.txt

Download LLM Weights

ollama pull llama3.2

Run the Pipeline

python main.py

🌱 Responsible & Sustainable AI

This project runs entirely on local LLM infrastructure, minimizing:

  • Data leakage risk
  • Cloud dependency
  • Energy consumption from large-scale API calls

It is built with the belief that technology should accelerate circularity while protecting our global playing field.


📌 Status

  • ✔ Modular, object-oriented architecture
  • ✔ Deterministic JSON output for PIM ingestion
  • ✔ Fully local, GDPR-compliant inference
  • ✔ Production-ready validation layer

Author: Sethulakshmi K B Focus: Circular Retail · Agentic AI · Product Intelligence

About

Automates extraction of cycling product data from web listings allowing structured, PIM-ready intelligence for circular retail using local LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages