Skip to content

[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501

Open
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:datafusion_dialect
Open

[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:datafusion_dialect

Conversation

@robreeves
Copy link
Collaborator

@robreeves robreeves commented Mar 13, 2026

Summary

Add a custom SQLGlot dialect for DataFusion and a to_datafusion_sql function that transpiles SQL from any supported source dialect to DataFusion SQL.

This is the first step in decoupling the TableTransformer API from DataFusion internals. Instead of returning a DataFusion DataFrame (leaking the execution engine to users), the TableTransformer will return a SQL string and its dialect. The data loader will then use SQLGlot to translate that SQL to DataFusion for execution. It will be used in #496.

We maintain the DataFusion dialect in-repo rather than contributing it upstream to SQLGlot because the SQLGlot maintainers don't have capacity to review more community dialects right now (source).

Context: #496 (comment)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

DataFusion dialect (datafusion_sql.py): custom SQLGlot dialect with DataFusion-specific function mappings (e.g. SIZEcardinality, ARRAY()make_array, CURRENT_TIMESTAMP()now()), type mappings (e.g. CHAR/TEXTVARCHAR, BINARYBYTEA), and identifier/normalization rules.

SQL translator (datafusion_sql.py): to_datafusion_sql(sql, source_dialect) accepts any supported source dialect (spark, postgres, mysql, etc.) and transpiles to DataFusion. When source_dialect is "datafusion" it returns the SQL unchanged. Validates the dialect with a clear error listing all supported options.

Dependency: added sqlglot>=29.0.0.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Parametrized transpilation tests cover spark, mysql, postgres, and datafusion identity. Edge case tests for unsupported dialects and multi-statement errors. E2E test executes transpiled SQL against DataFusion and validates output data.

make check  # All checks passed (ruff, mypy)
make test   # 19 dialect tests pass

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

This is the first PR. Follow-up PRs will integrate the translator into the TableTransformer API and data loader pipeline.

robreeves and others added 13 commits March 13, 2026 10:56
Adds a comprehensive DataFusion dialect as a SQLGlot plugin, enabling
transpilation from Spark SQL (and other dialects) to DataFusion SQL.
Includes parser function mappings, generator type/function transforms,
a SparkToDataFusionSQLTranslator helper, and 36 tests covering
translation, identity round-trips, type mappings, and execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SparkToDataFusionSQLTranslator to DataFusionSQLTranslator with a
required source_dialect parameter. Validates the dialect on construction
and provides a clear error listing all supported dialects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidate the translator function into datafusion_dialect.py and
remove the separate sql_translator.py file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…usion_sql

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages

Return SQL unchanged when source_dialect is already datafusion. Include
parsed statements in the multi-statement error for debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _spark_to_df helper with to_datafusion_sql calls. Remove
duplicate test cases between TestTranslator and other test classes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ized test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rized test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves robreeves marked this pull request as ready for review March 13, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant