diff --git a/databricks-skills/databricks-ai-functions/1-task-functions.md b/databricks-skills/databricks-ai-functions/1-task-functions.md index a94159ea..53afece7 100644 --- a/databricks-skills/databricks-ai-functions/1-task-functions.md +++ b/databricks-skills/databricks-ai-functions/1-task-functions.md @@ -1,6 +1,6 @@ # Task-Specific AI Functions — Full Reference -These functions require no model endpoint selection. They call pre-configured Foundation Model APIs optimized for each task. All require DBR 15.1+ (15.4 ML LTS for batch); `ai_parse_document` requires DBR 17.1+. +These functions require no model endpoint selection. They call pre-configured Foundation Model APIs optimized for each task. All require DBR 15.1+ (15.4 ML LTS for batch); `ai_parse_document` requires DBR 17.3+; `ai_prep_search` requires DBR 18.2+ (serverless env v3+). --- @@ -34,10 +34,13 @@ df.withColumn("sentiment", expr("ai_analyze_sentiment(review_text)")).display() - With descriptions: `'{"billing_error": "Payment, invoice, or refund issues", "product_defect": "Any malfunction or bug"}'` (descriptions up to 1000 chars each) - 2–500 labels, each 1–100 characters - `options`: optional MAP\: + - `version`: `"2.0"` (recommended) or `"1.0"` for backward compatibility - `instructions`: task context to improve accuracy (max 20,000 chars) - `multilabel`: `"true"` to return multiple matching labels (default `"false"`) -Returns VARIANT. Returns `NULL` if content is `NULL`. +Returns VARIANT `{"response": ["label", ...], "error_message": null}`. Returns `NULL` if content is `NULL`. + +**Constraints:** total input + labels context capped at **128,000 tokens**; not available on Databricks SQL Classic. ```sql -- simple labels @@ -91,12 +94,15 @@ df.withColumn( "line_items": {"type": "array", "items": {"type": "object", "properties": {...}}} } ``` - - Supported types: `string`, `integer`, `number`, `boolean`, `enum` - - Max 128 fields, 7 nesting levels, 500 enum values + - Supported types: `string`, `integer`, `number`, `boolean`, `enum`, `object` (with `properties`), `array` (with `items`) + - Max 128 fields, field names up to 150 chars, 7 nesting levels, 500 enum values, 128,000 token total context - `options`: optional MAP\: + - `version`: `"2.1"` (recommended) / `"2.0"` / `"1.0"` - `instructions`: task context to improve extraction quality (max 20,000 chars) + - `enableCitations`: `"true"` to attach `citation_ids` to each extracted field + - `enableConfidenceScores`: `"true"` to attach a per-field `confidence_score` (0–1) -Returns VARIANT `{"response": {...}, "error_message": null}`. Returns `NULL` if content is `NULL`. +Returns VARIANT `{"response": {...}, "error_message": null}`. With `enableCitations` or `enableConfidenceScores` enabled, each scalar field becomes an object `{"value": ..., "citation_ids": [...], "confidence_score": 0.x}` and a `metadata` block is added at the top level. Returns `NULL` if content is `NULL`. ```sql -- simple schema @@ -129,6 +135,32 @@ df = df.withColumn( df.display() ``` +### Version 2.1: citations and confidence scores + +Pass `version => 2.1` with `enableCitations` and/or `enableConfidenceScores` to attach provenance and reliability metadata to each extracted field. Useful for review queues and downstream filtering by confidence. + +```sql +SELECT ai_extract( + document_text, + '["invoice_id", "vendor_name", "total_amount"]', + MAP( + 'version', '2.1', + 'enableCitations', 'true', + 'enableConfidenceScores', 'true' + ) +) AS extracted +FROM parsed_documents; + +-- Each scalar field is now an object: {value, citation_ids, confidence_score} +-- Access: +SELECT + extracted:response:invoice_id:value::STRING AS invoice_id, + extracted:response:invoice_id:confidence_score::DOUBLE AS invoice_id_conf, + extracted:response:total_amount:value::DOUBLE AS total_amount, + extracted:metadata AS metadata +FROM extracted_invoices; +``` + --- ## `ai_fix_grammar` @@ -300,38 +332,44 @@ df.withColumn( **Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document -**Requires:** DBR 17.1+ +**Requires:** DBR 17.3+ (serverless env v3+ for VARIANT). Region-restricted — check feature availability. **Syntax:** `ai_parse_document(content [, options])` - `content`: BINARY — document content loaded from `read_files()` or `spark.read.format("binaryFile")` - `options`: MAP\ (optional) — parsing configuration -**Supported formats:** PDF, JPG/JPEG, PNG, DOCX, PPTX +**Supported formats:** PDF, JPG/JPEG, PNG, TIFF/TIF, DOC/DOCX, PPT/PPTX -Returns a VARIANT with pages, elements (text paragraphs, tables, figures, headers, footers), bounding boxes, and error metadata. +Returns a VARIANT with pages, elements (text, tables, figures, titles, captions, section headers, page headers/footers, page numbers, footnotes), bounding boxes, confidence scores, and error metadata. **Options:** | Key | Values | Description | |-----|--------|-------------| | `version` | `'2.0'` | Output schema version | -| `imageOutputPath` | Volume path | Save rendered page images | -| `descriptionElementTypes` | `''`, `'figure'`, `'*'` | AI-generated descriptions (default: `'*'` for all) | +| `imageOutputPath` | Volume path | Save rendered page images to a UC Volume | +| `descriptionElementTypes` | `''`, `'figure'`, `'*'` | AI-generated descriptions (default: `'*'` for all). Set to `''` to disable and reduce cost. | +| `pageRange` | e.g. `'1,3,5-10'` | Restrict parsing to a subset of pages (1-indexed) | -**Output schema:** +**Output schema (v2.0):** ``` document -├── pages[] -- page id, image_uri +├── pages[] -- id, image_uri └── elements[] -- extracted content - ├── type -- "text", "table", "figure", etc. + ├── id -- per-element id + ├── type -- text | table | figure | title | caption | section_header + │ -- | page_header | page_footer | page_number | footnote ├── content -- extracted text - ├── bbox -- bounding box coordinates - └── description -- AI-generated description -metadata -- file info, schema version -error_status[] -- errors per page (if any) + ├── confidence -- DOUBLE 0–1 + ├── bbox -- [{coord:[...], page_id}] + └── description -- AI-generated description (figures/tables when enabled) +metadata -- id, version, file_metadata +error_status[] -- {error_message, page_id} per page (if any) ``` +**Limits:** max 500 pages per document, max 100 MB file size. + ```sql -- Parse and extract text blocks SELECT @@ -353,6 +391,13 @@ SELECT ai_parse_document( ) ) AS parsed FROM read_files('/Volumes/catalog/schema/volume/invoices/', format => 'binaryFile'); + +-- Parse only specific pages (cheaper for large documents) +SELECT ai_parse_document( + content, + map('version', '2.0', 'pageRange', '1,3,5-10') +) AS parsed +FROM read_files('/Volumes/catalog/schema/volume/contracts/', format => 'binaryFile'); ``` ```python @@ -380,6 +425,106 @@ df.display() ``` **Limitations:** +- Max 500 pages per document, max 100 MB file size - Processing is slow for dense or low-resolution documents -- Suboptimal for non-Latin alphabets and digitally signed PDFs +- Suboptimal for non-Latin alphabets (e.g., Japanese, Korean in images) and digitally signed PDFs - Custom models not supported — always uses the built-in parsing model + +--- + +## `ai_prep_search` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_prep_search + +**Requires:** DBR **18.2+** (serverless env v3+ for VARIANT support). + +Takes the VARIANT output of `ai_parse_document` and returns RAG-ready chunks. The function performs: +1. **Semantic chunking** — splits document content into retrieval-sized chunks at natural boundaries (paragraphs, sections, tables). +2. **Context enrichment** — adds document title, section headers, page numbers, and captions to each chunk's embedding text so Vector Search can match on context, not just chunk content. + +Use this instead of hand-rolled `variant_get` + `explode` + `md5` chunking when feeding `ai_parse_document` output into Databricks Vector Search. + +**Syntax:** `ai_prep_search(parsed [, options])` +- `parsed`: VARIANT — output from `ai_parse_document` +- `options`: optional MAP\: + - `version`: output schema version (major.minor; minor upgrades are backward-compatible) + +**Returns:** VARIANT with chunks ready for Vector Search: + +``` +chunks[] +├── chunk_id -- unique id (document_id + position) — use as PK +├── chunk_position -- ordinal within the document +├── chunk_to_retrieve -- raw chunk text (return this to the LLM) +└── chunk_to_embed -- context-enriched text (use this as the embedding source) +pages[] -- page index + image_uri (when imageOutputPath was set on ai_parse_document) +source_uri -- input document path +error_status -- per-page error info, if any +``` + +**End-to-end SQL — parse, prep, persist for Vector Search:** + +```sql +CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS +WITH parsed AS ( + SELECT + path AS source_path, + ai_parse_document(content) AS parsed + FROM read_files('/Volumes/catalog/schema/docs/', format => 'binaryFile') +), +prepped AS ( + SELECT + source_path, + ai_prep_search(parsed) AS prep + FROM parsed +), +chunks AS ( + SELECT + source_path, + explode(variant_get(prep, '$.chunks', 'ARRAY')) AS chunk + FROM prepped +) +SELECT + variant_get(chunk, '$.chunk_id', 'STRING') AS chunk_id, + variant_get(chunk, '$.chunk_position', 'INT') AS chunk_position, + variant_get(chunk, '$.chunk_to_retrieve', 'STRING') AS chunk_to_retrieve, + variant_get(chunk, '$.chunk_to_embed', 'STRING') AS chunk_to_embed, + source_path, + current_timestamp() AS prepped_at +FROM chunks; + +-- Enable CDF so Vector Search Delta Sync picks up incremental changes +ALTER TABLE catalog.schema.parsed_chunks +SET TBLPROPERTIES (delta.enableChangeDataFeed = true); +``` + +**PySpark equivalent:** + +```python +from pyspark.sql.functions import expr, current_timestamp + +chunks_df = ( + spark.read.format("binaryFile") + .load("/Volumes/catalog/schema/docs/") + .withColumn("parsed", expr("ai_parse_document(content)")) + .withColumn("prep", expr("ai_prep_search(parsed)")) + .withColumn("chunk", expr("explode(variant_get(prep, '$.chunks', 'ARRAY'))")) + .selectExpr( + "variant_get(chunk, '$.chunk_id', 'STRING') AS chunk_id", + "variant_get(chunk, '$.chunk_position', 'INT') AS chunk_position", + "variant_get(chunk, '$.chunk_to_retrieve', 'STRING') AS chunk_to_retrieve", + "variant_get(chunk, '$.chunk_to_embed', 'STRING') AS chunk_to_embed", + "path AS source_path", + ) + .withColumn("prepped_at", current_timestamp()) +) + +chunks_df.write.format("delta").mode("overwrite").saveAsTable("catalog.schema.parsed_chunks") +``` + +**Vector Search integration:** Point a Delta Sync index at this table with `chunk_to_embed` as the embedding source column and `chunk_id` as the primary key. The `chunk_to_retrieve` column is what you return to the LLM at query time. + +**Tips:** +- Pass `imageOutputPath` on the upstream `ai_parse_document` call if you want page image URIs available in the prep output for multimodal retrieval. +- Schema is versioned major.minor; minor upgrades are backward-compatible — pin `version` only if you need to lock schema across deployments. +- On DBR < 18.2, fall back to manual chunking via `variant_get` + `explode` on `ai_parse_document` output. diff --git a/databricks-skills/databricks-ai-functions/4-document-processing-pipeline.md b/databricks-skills/databricks-ai-functions/4-document-processing-pipeline.md index 37498f49..4ff06da7 100644 --- a/databricks-skills/databricks-ai-functions/4-document-processing-pipeline.md +++ b/databricks-skills/databricks-ai-functions/4-document-processing-pipeline.md @@ -13,6 +13,7 @@ When processing documents with AI Functions, apply this order of preference for | Stage | Preferred function | Use `ai_query` when... | |---|---|---| | Parse binary docs (PDF, DOCX, images) | `ai_parse_document` | Need image-level reasoning | +| Prepare parsed docs for Vector Search | `ai_prep_search` (DBR 18.2+) | Need a custom chunking strategy or DBR < 18.2 | | Extract fields from text (flat or nested) | `ai_extract` | Schema exceeds 128 fields or 7 nesting levels | | Classify document type or status | `ai_classify` | More than 20 categories | | Score item similarity / matching | `ai_similarity` | Need cross-document reasoning | @@ -263,37 +264,39 @@ def processing_errors(): --- -## Custom RAG Pipeline — Parse → Chunk → Index → Query +## Custom RAG Pipeline — Parse → Prep → Index → Query -When the goal is retrieval-augmented generation rather than field extraction, use this pipeline to parse documents, chunk them into a Delta table, and index with Vector Search. +When the goal is retrieval-augmented generation rather than field extraction, use this pipeline: `ai_parse_document` to read binary files, `ai_prep_search` to chunk and enrich, then a Vector Search Delta Sync index over the result. -### Step 1 — Parse and Chunk into a Delta Table +**Requires DBR 18.2+** (for `ai_prep_search`). On older runtimes, see the legacy manual-chunking fallback at the end of this section. -`ai_parse_document` returns a VARIANT. Use `variant_get` with an explicit `ARRAY` cast before calling `explode`, since `explode()` does not accept raw VARIANT values. +### Step 1 — Parse and Prep into a Delta Table + +`ai_prep_search` takes the VARIANT output of `ai_parse_document` and returns RAG-ready chunks (`chunk_id`, `chunk_position`, `chunk_to_retrieve`, `chunk_to_embed`). The `chunk_to_embed` column is enriched with document title, section headers, page numbers, and captions — Vector Search will match on that context, not just chunk text. ```sql CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS WITH parsed AS ( SELECT - path, - ai_parse_document(content) AS doc + path AS source_path, + ai_parse_document(content) AS parsed FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile') ), -elements AS ( +prepped AS ( SELECT - path, - explode(variant_get(doc, '$.document.elements', 'ARRAY')) AS element + source_path, + ai_prep_search(parsed) AS prep FROM parsed ) SELECT - md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id, - path AS source_path, - variant_get(element, '$.content', 'STRING') AS content, - variant_get(element, '$.type', 'STRING') AS element_type, - current_timestamp() AS parsed_at -FROM elements -WHERE variant_get(element, '$.content', 'STRING') IS NOT NULL - AND length(trim(variant_get(element, '$.content', 'STRING'))) > 10; + variant_get(chunk, '$.chunk_id', 'STRING') AS chunk_id, + variant_get(chunk, '$.chunk_position', 'INT') AS chunk_position, + variant_get(chunk, '$.chunk_to_retrieve', 'STRING') AS chunk_to_retrieve, + variant_get(chunk, '$.chunk_to_embed', 'STRING') AS chunk_to_embed, + source_path, + current_timestamp() AS prepped_at +FROM prepped +LATERAL VIEW explode(variant_get(prep, '$.chunks', 'ARRAY')) c AS chunk; ``` ### Step 1a (Production) — Incremental Parsing with Structured Streaming @@ -310,7 +313,7 @@ from pyspark.sql.functions import col, current_timestamp, expr files_df = ( spark.readStream.format("binaryFile") - .option("pathGlobFilter", "*.{pdf,jpg,jpeg,png}") + .option("pathGlobFilter", "*.{pdf,jpg,jpeg,png,tif,tiff,docx,pptx}") .option("recursiveFileLookup", "true") .load("/Volumes/catalog/schema/volume/docs/") ) @@ -338,49 +341,46 @@ parsed_df = ( ) ``` -**Stage 2 — Extract text from parsed VARIANT (streaming):** +**Stage 2 — Prep chunks for Vector Search (streaming):** -Uses `transform()` to extract element content from the VARIANT array, and `try_cast` for safe access. Error rows are preserved but flagged. +`ai_prep_search` handles semantic chunking + context enrichment in one call. Skip rows that hit parse errors. ```python -from pyspark.sql.functions import col, concat_ws, expr, lit, when +from pyspark.sql.functions import col, expr, lit, when parsed_stream = spark.readStream.format("delta").table("catalog.schema.parsed_documents_raw") -text_df = ( +prepped_df = ( parsed_stream - .withColumn("text", - when( - expr("try_cast(parsed:error_status AS STRING)").isNotNull(), lit(None) - ).otherwise( - concat_ws("\n\n", expr(""" - transform( - try_cast(parsed:document:elements AS ARRAY), - element -> try_cast(element:content AS STRING) - ) - """)) - ) + .filter(expr("try_cast(parsed:error_status AS STRING) IS NULL")) + .withColumn("prep", expr("ai_prep_search(parsed)")) + .withColumn("chunk", expr("explode(variant_get(prep, '$.chunks', 'ARRAY'))")) + .selectExpr( + "variant_get(chunk, '$.chunk_id', 'STRING') AS chunk_id", + "variant_get(chunk, '$.chunk_position', 'INT') AS chunk_position", + "variant_get(chunk, '$.chunk_to_retrieve', 'STRING') AS chunk_to_retrieve", + "variant_get(chunk, '$.chunk_to_embed', 'STRING') AS chunk_to_embed", + "path AS source_path", + "parsed_at", ) - .withColumn("error_status", expr("try_cast(parsed:error_status AS STRING)")) - .select("path", "text", "error_status", "parsed_at") ) ( - text_df.writeStream.format("delta") + prepped_df.writeStream.format("delta") .outputMode("append") - .option("checkpointLocation", "/Volumes/catalog/schema/checkpoints/02_text") + .option("checkpointLocation", "/Volumes/catalog/schema/checkpoints/02_prep") .option("mergeSchema", "true") .trigger(availableNow=True) - .toTable("catalog.schema.parsed_documents_text") + .toTable("catalog.schema.parsed_chunks") ) ``` Key techniques: - **`repartition` by file hash** — parallelizes `ai_parse_document` across workers - **`trigger(availableNow=True)`** — processes all pending files then stops (batch-like) -- **Checkpoints** — exactly-once guarantee; no re-parsing on re-runs -- **`transform()` + `try_cast`** — safer than `explode` + `variant_get` for text extraction -- **Separate stages with independent checkpoints** — parse and text extraction can fail/retry independently +- **Checkpoints** — exactly-once guarantee; no re-parsing or re-prepping on re-runs +- **`ai_prep_search`** — handles semantic chunking + context enrichment; no manual `transform()` + length filters needed +- **Separate stages with independent checkpoints** — parse and prep can fail/retry independently ### Step 1b — Enable Change Data Feed @@ -393,16 +393,47 @@ SET TBLPROPERTIES (delta.enableChangeDataFeed = true); ### Step 2 — Create a Vector Search Index and Query It -Use the **[databricks-vector-search](../databricks-vector-search/SKILL.md)** skill to create a Delta Sync index on the chunked table and query it. Ensure CDF is enabled first (Step 1b above). +Use the **[databricks-vector-search](../databricks-vector-search/SKILL.md)** skill to create a Delta Sync index on `catalog.schema.parsed_chunks`: +- **Primary key:** `chunk_id` +- **Embedding source column:** `chunk_to_embed` (context-enriched text — do not embed `chunk_to_retrieve`) +- **Return column at query time:** `chunk_to_retrieve` (raw chunk text for the LLM) + +Ensure CDF is enabled first (Step 1b above). ### RAG-Specific Issues | Issue | Solution | |-------|----------| -| `explode()` fails with VARIANT | `explode()` requires ARRAY, not VARIANT. Use `variant_get(doc, '$.document.elements', 'ARRAY')` to cast before exploding | -| Short/noisy chunks | Filter with `length(trim(...)) > 10` — parsing produces tiny fragments (page numbers, headers) that pollute the index | -| Re-parsing unchanged documents | Use Structured Streaming with checkpoints — see Step 1a above | -| Region not supported | US/EU regions only, or enable cross-geography routing | +| `ai_prep_search` not found | Requires DBR **18.2+** (serverless env v3+). Use the legacy manual-chunking fallback below on older runtimes. | +| Embedding the wrong column | Embed `chunk_to_embed` (enriched with doc title/headers/page), **not** `chunk_to_retrieve`. Return `chunk_to_retrieve` to the LLM. | +| `explode()` fails with VARIANT | `explode()` requires ARRAY, not VARIANT. Cast first: `explode(variant_get(prep, '$.chunks', 'ARRAY'))` | +| Region not supported | `ai_parse_document` / `ai_prep_search` are region-restricted. Check feature availability or enable cross-geography routing. | + +### Legacy fallback — DBR < 18.2 (no `ai_prep_search`) + +If `ai_prep_search` is unavailable, fall back to manual chunking on `ai_parse_document` element output. Filter out short/noisy fragments (page numbers, headers) that pollute the index: + +```sql +CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS +WITH parsed AS ( + SELECT path, ai_parse_document(content) AS doc + FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile') +), +elements AS ( + SELECT path, explode(variant_get(doc, '$.document.elements', 'ARRAY')) AS element + FROM parsed +) +SELECT + md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id, + path AS source_path, + variant_get(element, '$.content', 'STRING') AS chunk_to_retrieve, + variant_get(element, '$.content', 'STRING') AS chunk_to_embed, -- no enrichment + variant_get(element, '$.type', 'STRING') AS element_type, + current_timestamp() AS parsed_at +FROM elements +WHERE variant_get(element, '$.content', 'STRING') IS NOT NULL + AND length(trim(variant_get(element, '$.content', 'STRING'))) > 10; +``` --- @@ -497,7 +528,7 @@ with mlflow.start_run(): ## Tips -1. **Parse first, enrich second** — always run `ai_parse_document` as the first stage. Feed its text output to task-specific functions; never pass raw binary to `ai_query`. +1. **Parse → prep → enrich** — run `ai_parse_document` first. For RAG, pipe its VARIANT into `ai_prep_search` (DBR 18.2+) for chunking + context enrichment. For extraction, feed its text output to task-specific functions. Never pass raw binary to `ai_query`. 2. **Flat or nested fields → `ai_extract`; deeply nested JSON exceeding 7 levels → `ai_query`** — pass `MAP('version', '2.0')` and access results through `:response`. 3. **`failOnError => false` is mandatory in batch** — write errors to a sidecar `_errors` table rather than crashing the pipeline. 4. **Truncate before sending to `ai_query`** — use `LEFT(text, 6000)` or chunk long documents to stay within context window limits. diff --git a/databricks-skills/databricks-ai-functions/SKILL.md b/databricks-skills/databricks-ai-functions/SKILL.md index 19897d8a..2cbf2b70 100644 --- a/databricks-skills/databricks-ai-functions/SKILL.md +++ b/databricks-skills/databricks-ai-functions/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-ai-functions -description: "Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query)." +description: "Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_prep_search, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → prep_search → index → query)." --- # Databricks AI Functions @@ -16,7 +16,7 @@ There are three categories: | Category | Functions | Use when | |---|---|---| -| **Task-specific** | `ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`, `ai_parse_document` | The task is well-defined — prefer these always | +| **Task-specific** | `ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`, `ai_parse_document`, `ai_prep_search` | The task is well-defined — prefer these always | | **General-purpose** | `ai_query` | Complex nested JSON, custom endpoints, multimodal — **last resort only** | | **Table-valued** | `ai_forecast` | Time series forecasting | @@ -34,13 +34,15 @@ There are three categories: | Free-form generation | `ai_gen` | Need structured JSON output | | Semantic similarity | `ai_similarity` | Never | | PDF / document parsing | `ai_parse_document` | Need image-level reasoning | +| RAG chunk preparation (from `ai_parse_document`) | `ai_prep_search` (semantic chunking + context enrichment) | Need custom chunking strategy or DBR < 18.2 | | Complex JSON / reasoning | — | **This is the intended use case for `ai_query`** | ## Prerequisites - Databricks SQL warehouse (**not Classic**) or cluster with DBR **15.1+** - DBR **15.4 ML LTS** recommended for batch workloads -- DBR **17.1+** required for `ai_parse_document` +- DBR **17.3+** required for `ai_parse_document` +- DBR **18.2+** required for `ai_prep_search` (serverless requires environment version **3+** for VARIANT support) - `ai_forecast` requires a **Pro or Serverless** SQL warehouse - Workspace in a supported AWS/Azure region for batch AI inference - Models run under Apache 2.0 or LLAMA 3.3 Community License — customers are responsible for compliance @@ -176,7 +178,7 @@ FROM ai_forecast( ## Reference Files -- [1-task-functions.md](1-task-functions.md) — Full syntax, parameters, SQL + PySpark examples for all 9 task-specific functions (`ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`) and `ai_parse_document` +- [1-task-functions.md](1-task-functions.md) — Full syntax, parameters, SQL + PySpark examples for the task-specific functions (`ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`), plus `ai_parse_document` and `ai_prep_search` - [2-ai-query.md](2-ai-query.md) — `ai_query` complete reference: all parameters, structured output with `responseFormat`, multimodal `files =>`, UDF patterns, and error handling - [3-ai-forecast.md](3-ai-forecast.md) — `ai_forecast` parameters, single-metric, multi-group, multi-metric, and confidence interval patterns - [4-document-processing-pipeline.md](4-document-processing-pipeline.md) — End-to-end batch document processing pipeline using AI Functions in a Lakeflow Declarative Pipeline; includes `config.yml` centralization, function selection logic, custom RAG pipeline (parse → chunk → Vector Search), and DSPy/LangChain guidance for near-real-time variants @@ -185,7 +187,8 @@ FROM ai_forecast( | Issue | Solution | |---|---| -| `ai_parse_document` not found | Requires DBR **17.1+**. Check cluster runtime. | +| `ai_parse_document` not found | Requires DBR **17.3+**. Check cluster runtime. | +| `ai_prep_search` not found | Requires DBR **18.2+** (serverless env v3+). On older runtimes, fall back to manual chunking via `variant_get` + `explode` on `ai_parse_document` output. | | `ai_forecast` fails | Requires **Pro or Serverless** SQL warehouse — not available on Classic or Starter. | | All functions return NULL | Input column is NULL. Filter with `WHERE col IS NOT NULL` before calling. | | `ai_translate` fails for a language | Supported: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Use `ai_query` with a multilingual model for others. |