Skip to content

feat: add AST code ingestion pipeline components and schemas#165

Open
himanshu1573 wants to merge 1 commit intokubeflow:mainfrom
himanshu1573:feature/code-ingestion-pipeline
Open

feat: add AST code ingestion pipeline components and schemas#165
himanshu1573 wants to merge 1 commit intokubeflow:mainfrom
himanshu1573:feature/code-ingestion-pipeline

Conversation

@himanshu1573
Copy link
Copy Markdown

Resolves #120

Description

This PR introduces the foundational AST Code Ingestion pipeline and the underlying Milvus Vector schemas designed for the Agentic RAG architecture. This implements the design proposed in #120 to enable precise, semantic code search across the Kubeflow repositories.

Key Additions:

  1. Production-Ready Milvus Schemas (backend/schemas/)
    • Introduces code_collection_schema.py and docs_collection_schema.py.
    • Crucially: Uses chunk_id (a hash of the source URL/path + chunk index) as the primary key. This is explicitly designed to support resilient Upsert operations and prevents the "entire collection drop" issue (Issue Bug: Pipeline drops entire Milvus collection on every run #10) when updating indexes.
  2. Modular AST Pipeline Components (pipelines/code_ingestion/)
    • repo_cloner.py: Safely clones target repositories into the pipeline volume.
    • ast_parser.py: Parses Python source files into Abstract Syntax Trees, extracting function signatures, docstrings, and class definitions to preserve code context.
    • chunker.py & embedder.py: Slices AST nodes into token-aware windows and generates 384-dimensional embeddings using sentence-transformers/all-MiniLM-L6-v2.
    • loader.py: Connects to Milvus and performs robust upserts using the chunk_id primary key.
  3. Shared Pipeline Utilities (pipelines/shared/)
    • Connection pooling and embedding utilities to prevent per-component model reloading.

Testing Done:

  • Successfully compiled full_pipeline.yaml using KFP v2 SDK.
  • Validated Milvus connection and upsert behavior against a local standalone Milvus instance, confirming that updates to existing files do not result in dropped collections.

Next Steps:

Following this merge, I will open follow-up PRs to bring in the agent/ routing core and the search_code_tool MCP server that consumes these collections.

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Himanshu Prajapati <himanshuprajapati15072003@gmail.com>
@himanshu1573 himanshu1573 force-pushed the feature/code-ingestion-pipeline branch from 5d52bea to a13ae3d Compare March 22, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories

1 participant