feat(pipelines): integrate GitHub issues ingestion into RAG pipeline by shishir-kuet · Pull Request #168 · kubeflow/docs-agent

shishir-kuet · 2026-03-22T19:38:26Z

Summary

This PR adds a GitHub issues ingestion component to the Kubeflow-based RAG pipeline, enabling ingestion of issues and their comments as an additional knowledge source.

Changes

Added download_github_issues component to fetch issues and comments via the GitHub API
Integrated the component into github_rag_pipeline
Added a chunk_and_embed task for ingested issues data
Ensured task dependency (issues_chunk_task.after(issues_task)) for correct execution order

Scope

This PR focuses only on adding GitHub issues ingestion
No changes were made to existing chunking or embedding logic

Why

Including GitHub issues improves RAG answer quality by incorporating real-world discussions, troubleshooting context, and community knowledge.

Related Issue

Closes #9

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>

google-oss-prow · 2026-03-22T19:38:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds GitHub Issues (including comments) as an additional ingestion source in the Kubeflow-based RAG pipeline to improve retrieval quality using real-world troubleshooting discussions (closes #9).

Changes:

Updated chunk_and_embed to use a different RecursiveCharacterTextSplitter import path.
Added a download_github_issues task into github_rag_pipeline.
Added an issues chunk_and_embed task and a dependency to ensure it runs after issues download.

Comments suppressed due to low confidence (1)

pipelines/kubeflow-pipeline.py:449

issues_chunk_task produces embedded issue chunks, but its output is not consumed by store_milvus (only chunk_task.outputs["embedded_data"] is stored). This means the new issues ingestion work won’t actually be indexed in Milvus. Consider either merging docs+issues embedded datasets before storage, or adding storage logic that can insert both without dropping/recreating the collection.

    issues_chunk_task = chunk_and_embed(
    github_data=issues_task.outputs["issues_data"],
    repo_name="kubeflow-issues",
    base_url="https://github.com",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
    )
    issues_chunk_task.after(issues_task)
    # Store in Milvus
    store_task = store_milvus(
        embedded_data=chunk_task.outputs["embedded_data"],
        milvus_host=milvus_host,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T19:40:41Z

pipelines/kubeflow-pipeline.py

+    issues_task = download_github_issues(
+    repos="kubeflow/kubeflow,kubeflow/pipelines",
+    labels="",
+    state="open",
+    max_issues_per_repo=50,
+    github_token=github_token
+    )


The repos list for download_github_issues is hard-coded to kubeflow/kubeflow,kubeflow/pipelines, which makes github_rag_pipeline less reusable and inconsistent with the existing repo_owner/repo_name parameters. Consider adding a pipeline parameter (e.g., issues_repos) with this as the default, or deriving it from existing inputs.

Copilot · 2026-03-22T19:40:41Z

pipelines/kubeflow-pipeline.py

    import torch
    from sentence_transformers import SentenceTransformer
-    from langchain.text_splitter import RecursiveCharacterTextSplitter
+    from langchain_text_splitter import RecursiveCharacterTextSplitter


chunk_and_embed now imports RecursiveCharacterTextSplitter from langchain_text_splitter, but the installed dependency in this component is langchain (see packages_to_install), and the repository requirements use langchain-text-splitters whose module name is typically langchain_text_splitters (plural). As written, this is likely to fail at runtime with ModuleNotFoundError. Align the import with the dependency you install (either revert to langchain.text_splitter when installing langchain, or install langchain-text-splitters and import from langchain_text_splitters).

Suggested change

from langchain_text_splitter import RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

Copilot · 2026-03-22T19:40:42Z

pipelines/kubeflow-pipeline.py

+    issues_chunk_task = chunk_and_embed(
+    github_data=issues_task.outputs["issues_data"],
+    repo_name="kubeflow-issues",
+    base_url="https://github.com",


For issues, download_github_issues emits records with path like issues/{repo}/{number} and also includes the real url (html_url). However, chunk_and_embed builds citation_url as f"{base_url}/{file_data['path']}" for non-doc paths, so passing base_url="https://github.com" here will generate invalid citation links (e.g., https://github.com/issues/...). Use the issue url field when present (or pass a base URL that matches the emitted paths, though this is tricky with multiple repos).

Suggested change

base_url="https://github.com",

base_url="",

Good catch, thanks!

You're right that using html_url from the GitHub API would produce more accurate citations for issues.

To keep this PR focused on integrating issues ingestion without modifying existing chunk_and_embed behavior, I’ve kept the current approach unchanged.

I’m happy to open a follow-up PR to update chunk_and_embed to prefer html_url when available.

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>

shishir-kuet · 2026-03-22T19:50:18Z

Thanks for the review! I've addressed the feedback:

Reverted the text splitter import to match existing dependencies
Parameterized the GitHub issues ingestion instead of using hardcoded values
Fixed indentation to align with the existing pipeline style
Moved unrelated changes (ZeroDivisionError fix) to a separate PR to keep this one focused

Please let me know if any further changes are needed.

feat(pipelines): integrate GitHub issues ingestion into RAG pipeline

f9a1683

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>

Copilot AI review requested due to automatic review settings March 22, 2026 19:38

google-oss-prow bot requested a review from franciscojavierarceo March 22, 2026 19:38

google-oss-prow bot added the size/S label Mar 22, 2026

Copilot started reviewing on behalf of shishir-kuet March 22, 2026 19:38 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

fix: align issues ingestion with pipeline params and correct import

8c1227b

Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168

feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168
shishir-kuet wants to merge 2 commits intokubeflow:mainfrom
shishir-kuet:feat/github-issues-ingestion

shishir-kuet commented Mar 22, 2026

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

shishir-kuet Mar 22, 2026

Uh oh!

shishir-kuet commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from langchain_text_splitter import RecursiveCharacterTextSplitter
	from langchain.text_splitter import RecursiveCharacterTextSplitter

Conversation

shishir-kuet commented Mar 22, 2026

Summary

Changes

Scope

Why

Related Issue

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

shishir-kuet Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

shishir-kuet commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants