Skip to content

fix(pipelines): guard against ZeroDivisionError in chunk_and_embed_incremental when chunk list is empty#164

Open
Kunal-Somani wants to merge 1 commit intokubeflow:mainfrom
Kunal-Somani:fix/issue-163-incremental-chunk-zero-division
Open

fix(pipelines): guard against ZeroDivisionError in chunk_and_embed_incremental when chunk list is empty#164
Kunal-Somani wants to merge 1 commit intokubeflow:mainfrom
Kunal-Somani:fix/issue-163-incremental-chunk-zero-division

Conversation

@Kunal-Somani
Copy link
Copy Markdown

Summary

Fixes #163

chunk_and_embed_incremental in pipelines/incremental-pipeline.py has the same ZeroDivisionError crash as kubeflow-pipeline.py, fixed in PR #149 (issue #148) — but the incremental pipeline was not addressed by that fix.

Root Cause

The incremental pipeline applies identical aggressive regex cleaning (Hugo frontmatter, HTML tags, URLs). A document can pass the < 50 char content guard but produce zero chunks after splitting. The print statement on the next line divides by len(chunks) == 0, crashing the entire incremental KFP run.

Affected Line

print(f"File: {file_data['path']} -> {len(chunks)} chunks (avg: {sum(len(c) for c in chunks)/len(chunks):.0f} chars)")

Fix

Added the same empty-chunk guard already applied in PR #149:

if not chunks:
    print(f"Skipping file after chunking (no chunks produced): {file_data['path']}")
    continue

Checklist

…cremental

The incremental pipeline has the same ZeroDivisionError crash as kubeflow-pipeline.py (fixed in PR kubeflow#149, issue kubeflow#148). When RecursiveCharacterTextSplitter returns an empty chunk list, the print statement divides by len(chunks) == 0, crashing the entire incremental KFP run.

Add the same empty-chunk guard already applied in PR kubeflow#149, consistent with the existing short-content guard pattern in both pipelines.

Fixes kubeflow#163

Signed-off-by: Kunal <kunal120222@gmail.com>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(pipelines): ZeroDivisionError in chunk_and_embed_incremental when text splitter returns empty chunks

1 participant