Skip to content
View irfanalidv's full-sized avatar
🌍
Irfanalidv
🌍
Irfanalidv

Organizations

@brainsfeed @re-sources-io

Block or report irfanalidv

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
irfanalidv/README.md

πŸ‘‹ Hi, I’m Irfan

Data Scientist β€’ AI Systems Architect β€’ ML, NLP, LLMs & Scalable Data Engineering

I build production-grade AI systems that transform messy, multi-source data into real-time intelligence, automation, and decision-ready insights.

My work blends ML, NLP, LLMs, large-scale scraping, enrichment architectures, and agentic automation to design AI products that actually ship β€” not just experiments.

I specialize in:

  • πŸ•ΈοΈ Universal scraping systems across 50+ dynamic, anti-bot-protected websites
  • 🧩 Multi-source data enrichment engines integrating 20+ APIs with waterfall logic
  • 🧠 LLM-powered extraction & automation workflows
  • βš™οΈ High-uptime data pipelines & research intelligence platforms
  • πŸ“‘ Production APIs & real-time enrichment services

πŸš€ What I Do Best

  • AI Systems Architecture
    End-to-end workflows, LLM automation, intelligent extraction, decision systems

  • Universal Web Scraping
    Playwright/Selenium-based scrapers for dynamic JS, forms, pagination, bot protection

  • Multi-Source Data Enrichment
    Apollo, PDL, ContactOut, SimilarWeb, Enrich.so, custom API routing & fallback pipelines

  • LLM-Driven Data Workflows
    Classification, entity extraction, topic mapping, lead intelligence generation

  • Scalable Data Engineering
    FastAPI services, data validation, auto-retry systems, queue-based workflows

  • Product & Team Leadership
    Led global teams across India, Hong Kong, France & the US


🧠 Recent Role

Principal Data Scientist – AI & Scalable Data Engineering @ KurationAI

(Hong Kong β€” Remote)

At KurationAI, I built the foundational intelligence layer powering:

  • 🌐 A universal web scraper deployed across 50+ global sources
  • πŸ”— A 20+ API enrichment engine with waterfall failovers, retries & key rotation
  • 🧠 LLM-based classification & extraction pipelines
  • πŸ”Ž Similarity-search-driven lead intelligence datasets
  • ⚑ Production-grade FastAPI services for real-time enrichment

Stack: Python, Playwright, Selenium, FastAPI, LangChain, MongoDB, RSS aggregation, GPT/Claude/Perplexity APIs


🏒 Past Experience

Head of Data & Analytics β€” Luminous Power Technologies

Built org-wide data strategy, BI platform, ML operations, and scalable pipelines.

Data Analytics & Automation β€” Lynk

Optimized expert-matching using NLP, automation, search & scalable data workflows.

Head of Data & Analytics β€” Brainsfeed

Built Infosphere, an NLP-powered research engine using 15+ extracted attributes.

Data Scientist β€” RightCust Technologies

Customer segmentation, forecasting, sentiment analysis.

Developer Evangelist β€” DevMetric

Built a university developer community; delivered technical workshops.

Data Visualization Developer β€” Datavis Tech (SF)

Interactive visualization systems using D3.js, Node.js, MongoDB.


πŸŽ“ Education

  • M.Sc. Data Science & AI β€” IISER Tirupati (2025–2026)
  • International Exchange β€” ISEP Paris (Data Science & Big Data Analytics)
  • B.Tech Computer Science Engineering β€” Alliance University

🧰 Tech Stack

Area Tools
Languages Python, R, SQL
ML/AI LangChain, LangSmith, scikit-learn, LLM APIs
Scraping Playwright, Selenium, Scrapy, PhantomBuster
APIs / Enrichment Apollo, ContactOut, PDL, SimilarWeb, RSS
Cloud / DevOps Azure, GCP, Docker, Azure DevOps
Data Engineering FastAPI, REST APIs, MongoDB, PostgreSQL
Low/No-Code Bubble.io, Airtable, Make.com, Zapier
Visualization RStudio, Jupyter, Klipfolio, Power BI

πŸ† Highlights

  • πŸ… Winners β€” Philips Digital Healthcare Conclave 2015
  • 🧠 Built intelligence platforms integrating 100+ data sources
  • πŸ“ Research in neural-symbolic topic evolution & text analytics
  • πŸ₯‡ Multiple Best Speaker awards

πŸ”¬ Featured Projects

  • ⚑ Universal AI Web Scraper β€” Dynamic JS, anti-bot, forms, pagination
  • πŸ”— Multi-Source Enrichment Engine β€” 20+ APIs with smart fallback
  • πŸ” Infosphere β€” NLP-driven research engine with Algolia
  • 🚫 LLM-Powered Toxic Comment Classifier
  • πŸ€– Automated Lead Intelligence Platform

➑️ Check pinned repositories for demos & code.


πŸ“Š Development Activity


🀝 Let’s Connect


I’m always open to collaborating on AI systems, enrichment engines, LLM automation, scalable pipelines, or research intelligence tooling.
Let’s build something impactful.

Pinned Loading

  1. ragfallback ragfallback Public

    A production-ready Python library that adds intelligent fallback mechanisms to RAG (Retrieval-Augmented Generation) systems, preventing silent failures and improving answer quality.

    Python

  2. AgentEnsemble AgentEnsemble Public

    AgentEnsemble is a simple, practical Python library for building and orchestrating AI agents. Perfect for real-world tasks like web search, research, document Q&A, and multi-agent collaboration.

    Python

  3. lingo-nlp-toolkit lingo-nlp-toolkit Public

    Advanced NLP Toolkit - Lightweight, Fast, and Transformer-Ready

    Python

  4. AskPandas AskPandas Public

    AI-powered data engineering and analytics assistant for querying CSV data using natural languageβ€”locally and intelligently

    Python

  5. PyroChain PyroChain Public

    PyroChain combines PyTorch's deep learning capabilities with LangChain's agentic AI to automate feature extraction from complex, multimodal data. AI agents collaborate to understand, process, and e…

    Python

  6. GoogleSearchR GoogleSearchR Public

    GoogleSearchR is an R package that provides functions to query Google and extract information from search results.

    R