A collection of examples demonstrating how to use Databricks AI Functions for LLM batch inference and document processing. Each example is self-contained, making it simple for Data Analysts and Data Engineers to adapt for their own use cases.
Databricks AI Functions are SQL-native functions that bring LLM capabilities directly into SQL queries and Spark pipelines — no model endpoint setup required. Always prefer a task-specific function over ai_query.
| Function | Purpose | v2 Enhancements |
|---|---|---|
ai_classify |
Zero-shot classification | Accepts VARIANT input (parsed docs) and descriptive label maps ({"label": "description"}) instead of plain arrays |
ai_extract |
Structured field extraction | Accepts VARIANT input and typed JSON schemas with string, number, array, object, enum types; supports nested structures and field descriptions |
ai_parse_document |
OCR + layout-aware document parsing | v2.0 output schema nests content under document:elements[] with bounding boxes; supports PDF, DOCX, PPTX, JPG, PNG |
ai_analyze_sentiment |
Sentiment analysis | |
ai_summarize |
Text summarization | |
ai_translate |
Translation (8 languages) | |
ai_mask |
PII redaction | |
ai_similarity |
Semantic similarity scoring | |
ai_fix_grammar |
Grammar correction | |
ai_gen |
Free-form text generation |
| Function | Purpose |
|---|---|
ai_query |
Custom prompts, nested JSON extraction, multimodal input, custom endpoints — use as last resort when no task-specific function fits |
ai_forecast |
Time series forecasting (requires Pro or Serverless SQL warehouse) |
Use-case walkthroughs using v1 function signatures (ai_query with string prompts, ai_classify with ARRAY labels, ai_summarize, ai_analyze_sentiment):
- Insurance Call Center Analysis — sentiment, compliance scoring, intent extraction, next-best-action
- ML Feature Engineering — generate categorical features from text using
ai_classifyandai_query
MLflow 3 evaluation notebook for assessing AI Function output quality.
Document intelligence pipelines showcasing v2 function signatures with Spark Declarative Pipelines. Demonstrates:
ai_parse_documentv2.0 — VARIANT output withdocument:elements[]structure and bounding boxesai_classifyv2 — VARIANT input + descriptive label maps ({"invoice": "Commercial invoice with..."}) for higher-accuracy classificationai_extractv2 — VARIANT input + typed JSON schemas with nested arrays, enums, and field descriptions
Implements medallion architecture (bronze/silver/gold) for processing PDFs across 4 document types (invoices, bank statements, contracts, SEC filings). Includes batch and streaming variants in both Python and SQL.
Production-ready Databricks Asset Bundle workflows for unstructured document processing with Structured Streaming:
- Unstructured IE — Parse, extract, analyze, and export structured entities as JSONL
- Parse-Translate-Classify — Multi-lingual document segmentation and classification
- Knowledge Base — 9-stage pipeline with diagram extraction, chunking, and visual enrichment for RAG
- IE Selected Pages — Page-level classification for selective extraction from large documents
The DAB workflows currently use ai_parse_document v2.0 but rely on ai_query for classification and extraction tasks that v2 task-specific functions can now handle directly.
-
ie-selected-pages/03_classify_pages.py— Replaceai_query(Yes/No string matching) withai_classifyv2. Eliminates custom prompt, string parsing, and label matching logic. -
ie-selected-pages/04_extract_info.py— Replaceai_querywithai_extractv2. Flat schema (pws_id,sample_category) is a textbookai_extractuse case.
-
unstructured-ie/03_extract_key_info.py— Replaceai_querywithai_extractv2 for bond data extraction. Nested schemas now supported in v2. Keep agent_bricks path as fallback. -
knowledge-base/04_2_extract_key_info.py— Same pattern as above for electronics datasheet extraction.
-
parse-translate-classify/02_translate_content.py— Considerai_translate(content, 'en')instead ofai_query. Simpler but loses the custom "retain original formatting" prompt instruction.
knowledge-base/04_1(diagram enrichment) — Requires multimodal image input; no task-specific function available.parse-translate-classify/03(segmentation) — Complex multi-step reasoning (segment + classify + transform in one pass);ai_queryis the right tool.