Skip to content

Feat: Implement Code ingestion pipeline#162

Open
haroon0x wants to merge 4 commits intokubeflow:mainfrom
haroon0x:feat/code-ingestion-pipeline
Open

Feat: Implement Code ingestion pipeline#162
haroon0x wants to merge 4 commits intokubeflow:mainfrom
haroon0x:feat/code-ingestion-pipeline

Conversation

@haroon0x
Copy link
Copy Markdown
Contributor

@haroon0x haroon0x commented Mar 22, 2026

Summary

Fixes #120
PR introduces the initial implementation of the code ingestion pipeline for the Kubeflow RAG system. The pipeline is designed to process code repositories, specifically focusing on the kubeflow/manifests repository, to provide high-fidelity context for the agentic retrieval system.

Implementation Overview

Modular Directory Structure

The ingestion logic is now organized into two distinct modules under the pipelines directory to separate documentation and code processing based on the repository structure mentioned in here

  • pipelines/docs_ingestion/: Manages documentation scraping and indexing.
  • pipelines/code_ingestion/: Manages codebase parsing and structural indexing.

Component Architecture

The code ingestion pipeline consists of four primary stages:

  1. Repository Ingestion: Securely clones the target repository into an isolated temporary scratch space.
  2. Structure-Aware Parsing: Utilizes specialized parsers in parsers.py to extract metadata from Kubernetes manifests, Kustomize configurations, Dockerfiles, and shell scripts.
  3. Vector Embedding: Generates 768-dimensional embeddings using the all-mpnet-base-v2 model for precise semantic representation.
  4. Milvus Storage: Efficiently indexes the generated vectors and associated metadata into the docs_rag collection.

Security and Robustness

  • Process Isolation: Subprocess execution is restricted with explicit timeouts and security boundaries to prevent execution errors.
  • Automated Resource Management: All file operations utilize secure, ephemeral directories created via tempfile.mkdtemp() to maintain system integrity.
  • Ingestion Stability: The storage component includes logic to ensure Milvus collections are correctly initialized and indexed before the retrieval layer is activated.

Verification and Testing Results

The implementation has been verified through a complete end-to-end execution targeting the latest kubeflow/manifests repository. The results of the verification run are as follows:

  • File Throughput: Successfully processed 1,102 distinct code and configuration files.
  • Semantic Chunking: Generated 2,305 hierarchical chunks with associated AST metadata.
  • Vector Accuracy: Verified the generation of 768-dimensional vectors compatible with the production embedding model.
  • Storage Verification: Confirmed successful data persistence in a local Milvus instance via port-forwarding.

Verification Execution Summary

State: Success
Repository: kubeflow/manifests
Total Files Found: 1102
Total Chunks Created: 2305
Embedding Model: all-mpnet-base-v2
Vector Storage: docs_rag (partition: code)

Future Enhancements

  • Specialized Helm Template Parser: Implement a fuzzy parser to extract kind, name, and component from templates, ensuring Helm-defined resources are searchable beyond raw text.
  • AST-Aware Parsing (Tree-Sitter): Leverage Tree-Sitter for precise function boundary detection and multilingual support (Python, Go) with standardized structural metadata.
  • Cross-Repository Dependency Mapping: Build a graph of dependencies between related repos (e.g., manifests to operator code) to improve multi-repo context retrieval.
  • Dynamic File Citations: Implement automated GitHub line-number linking for every chunk to provide perfect traceability for the agent.
  • Hybrid Embedding Ensembles: Combine code-specific models (like CodeBERT) with general models to improve retrieval for both syntax and semantic queries.
  • Incremental Code Ingestion: Utilize git-diff logs to trigger targeted updates of changed files instead of full repository re-scans.

Signed-off-by: haroon0x <haroonbmc0@gmail.com>
Signed-off-by: haroon0x <haroonbmc0@gmail.com>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haroon0x haroon0x changed the title Feat: Add Code ingestion pipeline Feat: Implement Code ingestion pipeline Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories

1 participant