feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168
feat(pipelines): integrate GitHub issues ingestion into RAG pipeline#168shishir-kuet wants to merge 2 commits intokubeflow:mainfrom
Conversation
Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds GitHub Issues (including comments) as an additional ingestion source in the Kubeflow-based RAG pipeline to improve retrieval quality using real-world troubleshooting discussions (closes #9).
Changes:
- Updated
chunk_and_embedto use a differentRecursiveCharacterTextSplitterimport path. - Added a
download_github_issuestask intogithub_rag_pipeline. - Added an issues
chunk_and_embedtask and a dependency to ensure it runs after issues download.
Comments suppressed due to low confidence (1)
pipelines/kubeflow-pipeline.py:449
issues_chunk_taskproduces embedded issue chunks, but its output is not consumed bystore_milvus(onlychunk_task.outputs["embedded_data"]is stored). This means the new issues ingestion work won’t actually be indexed in Milvus. Consider either merging docs+issues embedded datasets before storage, or adding storage logic that can insert both without dropping/recreating the collection.
issues_chunk_task = chunk_and_embed(
github_data=issues_task.outputs["issues_data"],
repo_name="kubeflow-issues",
base_url="https://github.com",
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
issues_chunk_task.after(issues_task)
# Store in Milvus
store_task = store_milvus(
embedded_data=chunk_task.outputs["embedded_data"],
milvus_host=milvus_host,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| issues_task = download_github_issues( | ||
| repos="kubeflow/kubeflow,kubeflow/pipelines", | ||
| labels="", | ||
| state="open", | ||
| max_issues_per_repo=50, | ||
| github_token=github_token | ||
| ) |
There was a problem hiding this comment.
The repos list for download_github_issues is hard-coded to kubeflow/kubeflow,kubeflow/pipelines, which makes github_rag_pipeline less reusable and inconsistent with the existing repo_owner/repo_name parameters. Consider adding a pipeline parameter (e.g., issues_repos) with this as the default, or deriving it from existing inputs.
pipelines/kubeflow-pipeline.py
Outdated
| import torch | ||
| from sentence_transformers import SentenceTransformer | ||
| from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
| from langchain_text_splitter import RecursiveCharacterTextSplitter |
There was a problem hiding this comment.
chunk_and_embed now imports RecursiveCharacterTextSplitter from langchain_text_splitter, but the installed dependency in this component is langchain (see packages_to_install), and the repository requirements use langchain-text-splitters whose module name is typically langchain_text_splitters (plural). As written, this is likely to fail at runtime with ModuleNotFoundError. Align the import with the dependency you install (either revert to langchain.text_splitter when installing langchain, or install langchain-text-splitters and import from langchain_text_splitters).
| from langchain_text_splitter import RecursiveCharacterTextSplitter | |
| from langchain.text_splitter import RecursiveCharacterTextSplitter |
| issues_chunk_task = chunk_and_embed( | ||
| github_data=issues_task.outputs["issues_data"], | ||
| repo_name="kubeflow-issues", | ||
| base_url="https://github.com", |
There was a problem hiding this comment.
For issues, download_github_issues emits records with path like issues/{repo}/{number} and also includes the real url (html_url). However, chunk_and_embed builds citation_url as f"{base_url}/{file_data['path']}" for non-doc paths, so passing base_url="https://github.com" here will generate invalid citation links (e.g., https://github.com/issues/...). Use the issue url field when present (or pass a base URL that matches the emitted paths, though this is tricky with multiple repos).
| base_url="https://github.com", | |
| base_url="", |
There was a problem hiding this comment.
Good catch, thanks!
You're right that using html_url from the GitHub API would produce more accurate citations for issues.
To keep this PR focused on integrating issues ingestion without modifying existing chunk_and_embed behavior, I’ve kept the current approach unchanged.
I’m happy to open a follow-up PR to update chunk_and_embed to prefer html_url when available.
Signed-off-by: Md. Mosaddek Ali <mosaddekshishir4113@gmail.com>
|
Thanks for the review! I've addressed the feedback:
Please let me know if any further changes are needed. |
Summary
This PR adds a GitHub issues ingestion component to the Kubeflow-based RAG pipeline, enabling ingestion of issues and their comments as an additional knowledge source.
Changes
download_github_issuescomponent to fetch issues and comments via the GitHub APIgithub_rag_pipelinechunk_and_embedtask for ingested issues dataissues_chunk_task.after(issues_task)) for correct execution orderScope
Why
Including GitHub issues improves RAG answer quality by incorporating real-world discussions, troubleshooting context, and community knowledge.
Related Issue
Closes #9