skills(unstructured-pdf): reframe around Databricks RAG-eval workflow by jamesbroadhead · Pull Request #88 · databricks/databricks-agent-skills

jamesbroadhead · 2026-05-25T13:58:47Z

Summary

Per Lennart's audit on #73, item #9: databricks-unstructured-pdf-generation reads as "not very Databricks-specific" because the headline is local HTML → PDF generation, with the Databricks workflow (UC volume + RAG-eval dataset) buried.

This reframes the skill to put the Databricks-specific value up front, without removing any content.

Changes

experimental/databricks-unstructured-pdf-generation/SKILL.md:

Frontmatter description now leads with "Build RAG / unstructured-document evaluation datasets on Databricks". PDF generation is positioned as a step, not the headline.
Body intro states explicitly that the Databricks-specific value is the workflow shape (UC volume layout, paired question files, hand-off to downstream ai_extract / ai_parse_document / mlflow.genai.evaluate()), not the HTML → PDF tooling itself.
One-line escape hatch added: "If you only need ad-hoc PDFs (no Databricks workflow), any HTML → PDF tool works directly — this skill exists for the synthetic-dataset-on-UC end-to-end shape, not as a general PDF generator."

Manifest regenerated to pick up the new description. No deletions; this is a framing change.

What this doesn't do

Two stronger alternatives in the audit are not implemented here:

Trim the local HTML → PDF tooling and link to an external tool. Would destroy useful content; the templates and parallel-conversion patterns are still valuable for users following the end-to-end workflow.
Rename to e.g. databricks-rag-test-data. Has cross-PR implications (the a-d-k tombstone PR databricks-solutions/ai-dev-kit#546 references the current name) and changes the install command for existing users.

If either is preferred over this lighter reframe, happy to open a follow-up.

Test plan

python3 scripts/skills.py generate clean.
python3 scripts/skills.py validate passes.
CI green.
Reviewer sign-off (@lennartkats-db raised this; @dustinvannoy-db cc'd).

This pull request and its description were written by Claude.

Per Lennart's audit on #73 ("this is almost not very Databricks-specific at all?"): the skill's value is the synthetic-PDFs-on-UC-volume workflow shape for RAG / unstructured-document retrieval evaluation, not the HTML → PDF generation step itself (any local HTML → PDF tool works for that — weasyprint, wkhtmltopdf, playwright pdf, plutoprint). Reframe SKILL.md to put the Databricks-specific value up front: - Frontmatter description now leads with "Build RAG / unstructured-document evaluation datasets on Databricks"; PDF generation is positioned as a step, not the headline. - Body intro states explicitly that the Databricks-specific value is the workflow shape (UC volume layout, paired question files, hand-off to downstream `ai_extract` / `ai_parse_document` / mlflow.genai eval), not the local HTML → PDF tooling. - Adds a one-line note: "if you only need ad-hoc PDFs, any local HTML → PDF tool works directly — this skill exists for the synthetic-dataset-on-UC end-to-end shape". No content removed; this is a framing change so users (and reviewers) can tell what the Databricks-specific value of the skill is at a glance. Manifest regenerated to pick up the new description. Co-authored-by: Isaac

…ed-pdf # Conflicts: # manifest.json

QuentinAmbard

this skill main use-case was to create dataset to be able to setup / demo KA or entity extraction quickly (more than eval dataset) let's just make sure this is clear in the skill so that it's being picked-up for this use case too

QuentinAmbard · 2026-05-27T10:06:57Z

 ---
 name: databricks-unstructured-pdf-generation
-description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets."
+description: "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation."


Lot of our usage is generating dataset for demo, not only evaluation datasets. Could we keep in the description something like this instead:
Build RAG / unstructured-document evaluation datasets on Databricks and Generate PDF documents for demos having Knowledge Asssistant: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.

Done in 61d0d8b, used your wording.

QuentinAmbard · 2026-05-27T10:07:16Z

 ---

-# PDF Generation from HTML
+# Unstructured-Document Eval Datasets on Databricks


Unstructured-Document for Demos and Eval Datasets on Databricks

Done in 61d0d8b, used your wording.

QuentinAmbard · 2026-05-27T10:08:12Z

+# Unstructured-Document Eval Datasets on Databricks

-Convert HTML content to PDF documents and upload them to Unity Catalog Volumes.
+Workflow for producing **synthetic PDF documents + paired test questions** as a Unity Catalog-resident dataset for RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / `ai_extract` / `ai_parse_document` evaluation.


Workflow for producing synthetic PDF documents + paired test questions as a Unity Catalog-resident dataset for Demos and RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / ai_extract / ai_parse_document evaluation.

Done in 61d0d8b, used your wording.

QuentinAmbard · 2026-05-27T10:08:45Z

    "databricks-unstructured-pdf-generation": {
      "version": "0.0.1",
-      "description": "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets.",
+      "description": "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.",


Build RAG / unstructured-document for demos and evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval

manifest.json is auto-regenerated from SKILL.md frontmatter via scripts/skills.py generate — the new synopsis carries the broadened wording. Verified in 61d0d8b.

…ed-pdf # Conflicts: # manifest.json

@QuentinAmbard

Per @QuentinAmbard's review: a lot of real usage is generating synthetic PDFs for demos with Knowledge Assistant, not just eval datasets. Reword the frontmatter description, H1, and intro paragraph to name both surfaces explicitly. Manifest synopsis regenerates from the frontmatter. This PR was prepared by Claude.

jamesbroadhead requested review from lennartkats-db and simonfaltum as code owners May 25, 2026 13:58

jamesbroadhead requested a review from lennartkats-db May 25, 2026 13:58

jamesbroadhead requested a review from dustinvannoy-db as a code owner May 25, 2026 13:58

jamesbroadhead requested a review from dustinvannoy-db May 25, 2026 13:58

jamesbroadhead requested a review from a team as a code owner May 25, 2026 13:58

Merge remote-tracking branch 'origin/main' into jb/reframe-unstructur…

d0668e9

…ed-pdf # Conflicts: # manifest.json

QuentinAmbard reviewed May 27, 2026

View reviewed changes

jamesbroadhead added 2 commits May 27, 2026 14:41

Merge remote-tracking branch 'origin/main' into jb/reframe-unstructur…

52f9d99

…ed-pdf # Conflicts: # manifest.json

jamesbroadhead requested a review from QuentinAmbard May 28, 2026 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88

skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88
jamesbroadhead wants to merge 4 commits into
mainfrom
jb/reframe-unstructured-pdf

jamesbroadhead commented May 25, 2026

Uh oh!

QuentinAmbard left a comment

Uh oh!

QuentinAmbard May 27, 2026

Uh oh!

jamesbroadhead May 28, 2026

Uh oh!

QuentinAmbard May 27, 2026

Uh oh!

jamesbroadhead May 28, 2026

Uh oh!

QuentinAmbard May 27, 2026

Uh oh!

jamesbroadhead May 28, 2026

Uh oh!

QuentinAmbard May 27, 2026

Uh oh!

jamesbroadhead May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesbroadhead commented May 25, 2026

Summary

Changes

What this doesn't do

Test plan

Uh oh!

QuentinAmbard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Unstructured-Document for Demos and Eval Datasets on Databricks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants