skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88
skills(unstructured-pdf): reframe around Databricks RAG-eval workflow#88jamesbroadhead wants to merge 4 commits into
Conversation
Per Lennart's audit on #73 ("this is almost not very Databricks-specific at all?"): the skill's value is the synthetic-PDFs-on-UC-volume workflow shape for RAG / unstructured-document retrieval evaluation, not the HTML → PDF generation step itself (any local HTML → PDF tool works for that — weasyprint, wkhtmltopdf, playwright pdf, plutoprint). Reframe SKILL.md to put the Databricks-specific value up front: - Frontmatter description now leads with "Build RAG / unstructured-document evaluation datasets on Databricks"; PDF generation is positioned as a step, not the headline. - Body intro states explicitly that the Databricks-specific value is the workflow shape (UC volume layout, paired question files, hand-off to downstream `ai_extract` / `ai_parse_document` / mlflow.genai eval), not the local HTML → PDF tooling. - Adds a one-line note: "if you only need ad-hoc PDFs, any local HTML → PDF tool works directly — this skill exists for the synthetic-dataset-on-UC end-to-end shape". No content removed; this is a framing change so users (and reviewers) can tell what the Databricks-specific value of the skill is at a glance. Manifest regenerated to pick up the new description. Co-authored-by: Isaac
…ed-pdf # Conflicts: # manifest.json
QuentinAmbard
left a comment
There was a problem hiding this comment.
this skill main use-case was to create dataset to be able to setup / demo KA or entity extraction quickly (more than eval dataset) let's just make sure this is clear in the skill so that it's being picked-up for this use case too
| --- | ||
| name: databricks-unstructured-pdf-generation | ||
| description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets." | ||
| description: "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation." |
There was a problem hiding this comment.
Lot of our usage is generating dataset for demo, not only evaluation datasets. Could we keep in the description something like this instead:
Build RAG / unstructured-document evaluation datasets on Databricks and Generate PDF documents for demos having Knowledge Asssistant: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.
| --- | ||
|
|
||
| # PDF Generation from HTML | ||
| # Unstructured-Document Eval Datasets on Databricks |
There was a problem hiding this comment.
Unstructured-Document for Demos and Eval Datasets on Databricks
| # Unstructured-Document Eval Datasets on Databricks | ||
|
|
||
| Convert HTML content to PDF documents and upload them to Unity Catalog Volumes. | ||
| Workflow for producing **synthetic PDF documents + paired test questions** as a Unity Catalog-resident dataset for RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / `ai_extract` / `ai_parse_document` evaluation. |
There was a problem hiding this comment.
Workflow for producing synthetic PDF documents + paired test questions as a Unity Catalog-resident dataset for Demos and RAG / unstructured-document retrieval evaluation on Databricks. The PDF-generation step uses standard local HTML → PDF tooling; the Databricks-specific value is the workflow shape — UC volume layout, paired question files, and integration with downstream Databricks retrieval / ai_extract / ai_parse_document evaluation.
| "databricks-unstructured-pdf-generation": { | ||
| "version": "0.0.1", | ||
| "description": "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets.", | ||
| "description": "Build RAG / unstructured-document evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval evaluation.", |
There was a problem hiding this comment.
Build RAG / unstructured-document for demos and evaluation datasets on Databricks: generate synthetic PDFs locally, upload to Unity Catalog volumes, and pair each document with test questions for retrieval
There was a problem hiding this comment.
manifest.json is auto-regenerated from SKILL.md frontmatter via scripts/skills.py generate — the new synopsis carries the broadened wording. Verified in 61d0d8b.
…ed-pdf # Conflicts: # manifest.json
Per @QuentinAmbard's review: a lot of real usage is generating synthetic PDFs for demos with Knowledge Assistant, not just eval datasets. Reword the frontmatter description, H1, and intro paragraph to name both surfaces explicitly. Manifest synopsis regenerates from the frontmatter. This PR was prepared by Claude.
Summary
Per Lennart's audit on #73, item #9:
databricks-unstructured-pdf-generationreads as "not very Databricks-specific" because the headline is local HTML → PDF generation, with the Databricks workflow (UC volume + RAG-eval dataset) buried.This reframes the skill to put the Databricks-specific value up front, without removing any content.
Changes
experimental/databricks-unstructured-pdf-generation/SKILL.md:descriptionnow leads with "Build RAG / unstructured-document evaluation datasets on Databricks". PDF generation is positioned as a step, not the headline.ai_extract/ai_parse_document/mlflow.genai.evaluate()), not the HTML → PDF tooling itself.Manifest regenerated to pick up the new description. No deletions; this is a framing change.
What this doesn't do
Two stronger alternatives in the audit are not implemented here:
databricks-rag-test-data. Has cross-PR implications (the a-d-k tombstone PR databricks-solutions/ai-dev-kit#546 references the current name) and changes the install command for existing users.If either is preferred over this lighter reframe, happy to open a follow-up.
Test plan
python3 scripts/skills.py generateclean.python3 scripts/skills.py validatepasses.@lennartkats-dbraised this;@dustinvannoy-dbcc'd).This pull request and its description were written by Claude.