Skip to content

feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168

Open
shishir-kuet wants to merge 2 commits intokubeflow:mainfrom
shishir-kuet:feat/github-issues-ingestion
Open

feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168
shishir-kuet wants to merge 2 commits intokubeflow:mainfrom
shishir-kuet:feat/github-issues-ingestion

Conversation

@shishir-kuet
Copy link
Copy Markdown

Summary

This PR adds a GitHub issues ingestion component to the Kubeflow-based RAG pipeline, enabling ingestion of issues and their comments as an additional knowledge source.

Changes

  • Added download_github_issues component to fetch issues and comments via the GitHub API
  • Integrated the component into github_rag_pipeline
  • Added a chunk_and_embed task for ingested issues data
  • Ensured task dependency (issues_chunk_task.after(issues_task)) for correct execution order

Scope

  • This PR focuses only on adding GitHub issues ingestion
  • No changes were made to existing chunking or embedding logic

Why

Including GitHub issues improves RAG answer quality by incorporating real-world discussions, troubleshooting context, and community knowledge.

Related Issue

Closes #9

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>
Copilot AI review requested due to automatic review settings March 22, 2026 19:38
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GitHub Issues (including comments) as an additional ingestion source in the Kubeflow-based RAG pipeline to improve retrieval quality using real-world troubleshooting discussions (closes #9).

Changes:

  • Updated chunk_and_embed to use a different RecursiveCharacterTextSplitter import path.
  • Added a download_github_issues task into github_rag_pipeline.
  • Added an issues chunk_and_embed task and a dependency to ensure it runs after issues download.
Comments suppressed due to low confidence (1)

pipelines/kubeflow-pipeline.py:449

  • issues_chunk_task produces embedded issue chunks, but its output is not consumed by store_milvus (only chunk_task.outputs["embedded_data"] is stored). This means the new issues ingestion work won’t actually be indexed in Milvus. Consider either merging docs+issues embedded datasets before storage, or adding storage logic that can insert both without dropping/recreating the collection.
    issues_chunk_task = chunk_and_embed(
    github_data=issues_task.outputs["issues_data"],
    repo_name="kubeflow-issues",
    base_url="https://github.com",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
    )
    issues_chunk_task.after(issues_task)
    # Store in Milvus
    store_task = store_milvus(
        embedded_data=chunk_task.outputs["embedded_data"],
        milvus_host=milvus_host,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +423 to +429
issues_task = download_github_issues(
repos="kubeflow/kubeflow,kubeflow/pipelines",
labels="",
state="open",
max_issues_per_repo=50,
github_token=github_token
)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repos list for download_github_issues is hard-coded to kubeflow/kubeflow,kubeflow/pipelines, which makes github_rag_pipeline less reusable and inconsistent with the existing repo_owner/repo_name parameters. Consider adding a pipeline parameter (e.g., issues_repos) with this as the default, or deriving it from existing inputs.

Copilot uses AI. Check for mistakes.
import torch
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitter import RecursiveCharacterTextSplitter
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_and_embed now imports RecursiveCharacterTextSplitter from langchain_text_splitter, but the installed dependency in this component is langchain (see packages_to_install), and the repository requirements use langchain-text-splitters whose module name is typically langchain_text_splitters (plural). As written, this is likely to fail at runtime with ModuleNotFoundError. Align the import with the dependency you install (either revert to langchain.text_splitter when installing langchain, or install langchain-text-splitters and import from langchain_text_splitters).

Suggested change
from langchain_text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

Copilot uses AI. Check for mistakes.
issues_chunk_task = chunk_and_embed(
github_data=issues_task.outputs["issues_data"],
repo_name="kubeflow-issues",
base_url="https://github.com",
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For issues, download_github_issues emits records with path like issues/{repo}/{number} and also includes the real url (html_url). However, chunk_and_embed builds citation_url as f"{base_url}/{file_data['path']}" for non-doc paths, so passing base_url="https://github.com" here will generate invalid citation links (e.g., https://github.com/issues/...). Use the issue url field when present (or pass a base URL that matches the emitted paths, though this is tricky with multiple repos).

Suggested change
base_url="https://github.com",
base_url="",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

You're right that using html_url from the GitHub API would produce more accurate citations for issues.

To keep this PR focused on integrating issues ingestion without modifying existing chunk_and_embed behavior, I’ve kept the current approach unchanged.

I’m happy to open a follow-up PR to update chunk_and_embed to prefer html_url when available.

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>
@shishir-kuet
Copy link
Copy Markdown
Author

Thanks for the review! I've addressed the feedback:

  • Reverted the text splitter import to match existing dependencies
  • Parameterized the GitHub issues ingestion instead of using hardcoded values
  • Fixed indentation to align with the existing pipeline style
  • Moved unrelated changes (ZeroDivisionError fix) to a separate PR to keep this one focused

Please let me know if any further changes are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Ingest GitHub Issue Comments into RAG pipeline

2 participants