Skip to content

skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88

Open
jamesbroadhead wants to merge 4 commits into
mainfrom
jb/reframe-unstructured-pdf
Open

skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88
jamesbroadhead wants to merge 4 commits into
mainfrom
jb/reframe-unstructured-pdf

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Summary

Per Lennart's audit on #73, item #9: databricks-unstructured-pdf-generation reads as "not very Databricks-specific" because the headline is local HTML → PDF generation, with the Databricks workflow (UC volume + RAG-eval dataset) buried.

This reframes the skill to put the Databricks-specific value up front, without removing any content.

Changes

experimental/databricks-unstructured-pdf-generation/SKILL.md:

  • Frontmatter description now leads with "Build RAG / unstructured-document evaluation datasets on Databricks". PDF generation is positioned as a step, not the headline.
  • Body intro states explicitly that the Databricks-specific value is the workflow shape (UC volume layout, paired question files, hand-off to downstream ai_extract / ai_parse_document / mlflow.genai.evaluate()), not the HTML → PDF tooling itself.
  • One-line escape hatch added: "If you only need ad-hoc PDFs (no Databricks workflow), any HTML → PDF tool works directly — this skill exists for the synthetic-dataset-on-UC end-to-end shape, not as a general PDF generator."

Manifest regenerated to pick up the new description. No deletions; this is a framing change.

What this doesn't do

Two stronger alternatives in the audit are not implemented here:

  • Trim the local HTML → PDF tooling and link to an external tool. Would destroy useful content; the templates and parallel-conversion patterns are still valuable for users following the end-to-end workflow.
  • Rename to e.g. databricks-rag-test-data. Has cross-PR implications (the a-d-k tombstone PR databricks-solutions/ai-dev-kit#546 references the current name) and changes the install command for existing users.

If either is preferred over this lighter reframe, happy to open a follow-up.

Test plan

  • python3 scripts/skills.py generate clean.
  • python3 scripts/skills.py validate passes.
  • CI green.
  • Reviewer sign-off (@lennartkats-db raised this; @dustinvannoy-db cc'd).

This pull request and its description were written by Claude.

Per Lennart's audit on #73 ("this is almost not very Databricks-specific
at all?"): the skill's value is the synthetic-PDFs-on-UC-volume workflow
shape for RAG / unstructured-document retrieval evaluation, not the
HTML → PDF generation step itself (any local HTML → PDF tool works for
that — weasyprint, wkhtmltopdf, playwright pdf, plutoprint).

Reframe SKILL.md to put the Databricks-specific value up front:
- Frontmatter description now leads with "Build RAG /
  unstructured-document evaluation datasets on Databricks"; PDF
  generation is positioned as a step, not the headline.
- Body intro states explicitly that the Databricks-specific value is
  the workflow shape (UC volume layout, paired question files, hand-off
  to downstream `ai_extract` / `ai_parse_document` / mlflow.genai eval),
  not the local HTML → PDF tooling.
- Adds a one-line note: "if you only need ad-hoc PDFs, any local
  HTML → PDF tool works directly — this skill exists for the
  synthetic-dataset-on-UC end-to-end shape".

No content removed; this is a framing change so users (and reviewers)
can tell what the Databricks-specific value of the skill is at a glance.
Manifest regenerated to pick up the new description.

Co-authored-by: Isaac
Copy link
Copy Markdown

@QuentinAmbard QuentinAmbard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skill main use-case was to create dataset to be able to setup / demo KA or entity extraction quickly (more than eval dataset) let's just make sure this is clear in the skill so that it's being picked-up for this use case too

---
name: databricks-unstructured-pdf-generation
description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets."
description: "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of our usage is generating dataset for demo, not only evaluation datasets. Could we keep in the description something like this instead:
Build RAG / unstructured-document evaluation datasets on Databricks and Generate PDF documents for demos having Knowledge Asssistant: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 61d0d8b, used your wording.

---

# PDF Generation from HTML
# Unstructured-Document Eval Datasets on Databricks
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unstructured-Document for Demos and Eval Datasets on Databricks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 61d0d8b, used your wording.

# Unstructured-Document Eval Datasets on Databricks

Convert HTML content to PDF documents and upload them to Unity Catalog Volumes.
Workflow for producing **synthetic PDF documents + paired test questions** as a Unity Catalog-resident dataset for RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / `ai_extract` / `ai_parse_document` evaluation.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow for producing synthetic PDF documents + paired test questions as a Unity Catalog-resident dataset for Demos and RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / ai_extract / ai_parse_document evaluation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 61d0d8b, used your wording.

Comment thread manifest.json Outdated
"databricks-unstructured-pdf-generation": {
"version": "0.0.1",
"description": "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets.",
"description": "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build RAG / unstructured-document for demos and evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifest.json is auto-regenerated from SKILL.md frontmatter via scripts/skills.py generate — the new synopsis carries the broadened wording. Verified in 61d0d8b.

Per @QuentinAmbard's review: a lot of real usage is generating
synthetic PDFs for demos with Knowledge Assistant, not just eval
datasets. Reword the frontmatter description, H1, and intro
paragraph to name both surfaces explicitly. Manifest synopsis
regenerates from the frontmatter.

This PR was prepared by Claude.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants