[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation by robreeves · Pull Request #501 · linkedin/openhouse

robreeves · 2026-03-13T19:22:21Z

Summary

Add a custom SQLGlot dialect for DataFusion and a to_datafusion_sql function that transpiles SQL from any supported source dialect to DataFusion SQL.

This is the first step in decoupling the TableTransformer API from DataFusion internals. Instead of returning a DataFusion DataFrame (leaking the execution engine to users), the TableTransformer will return a SQL string and its dialect. The data loader will then use SQLGlot to translate that SQL to DataFusion for execution. It will be used in #496.

We maintain the DataFusion dialect in-repo rather than contributing it upstream to SQLGlot because the SQLGlot maintainers don't have capacity to review more community dialects right now (source).

Context: #496 (comment)

Changes

DataFusion dialect (datafusion_sql.py): custom SQLGlot dialect with DataFusion-specific function mappings (e.g. SIZE → cardinality, ARRAY() → make_array, CURRENT_TIMESTAMP() → now()), type mappings (e.g. CHAR/TEXT → VARCHAR, BINARY → BYTEA), and identifier/normalization rules.

SQL translator (datafusion_sql.py): to_datafusion_sql(sql, source_dialect) accepts any supported source dialect (spark, postgres, mysql, etc.) and transpiles to DataFusion. When source_dialect is "datafusion" it returns the SQL unchanged. Validates the dialect with a clear error listing all supported options.

Dependency: added sqlglot>=29.0.0.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Parametrized transpilation tests cover spark, mysql, postgres, and datafusion identity. Edge case tests for unsupported dialects and multi-statement errors. E2E test executes transpiled SQL against DataFusion and validates output data.

make check  # All checks passed (ruff, mypy)
make test   # 19 dialect tests pass

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

This is the first PR. Follow-up PRs will integrate the translator into the TableTransformer API and data loader pipeline.

Adds a comprehensive DataFusion dialect as a SQLGlot plugin, enabling transpilation from Spark SQL (and other dialects) to DataFusion SQL. Includes parser function mappings, generator type/function transforms, a SparkToDataFusionSQLTranslator helper, and 36 tests covering translation, identity round-trips, type mappings, and execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename SparkToDataFusionSQLTranslator to DataFusionSQLTranslator with a required source_dialect parameter. Validates the dialect on construction and provides a clear error listing all supported dialects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consolidate the translator function into datafusion_dialect.py and remove the separate sql_translator.py file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…usion_sql Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ages Return SQL unchanged when source_dialect is already datafusion. Include parsed statements in the multi-statement error for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace _spark_to_df helper with to_datafusion_sql calls. Remove duplicate test cases between TestTranslator and other test classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves and others added 13 commits March 13, 2026 10:56

[DataLoader] Move translate_to_datafusion into datafusion_dialect module

0c9222f

Consolidate the translator function into datafusion_dialect.py and remove the separate sql_translator.py file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Rename module to datafusion_sql and function to to_dataf…

a50df1b

…usion_sql Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Handle datafusion as noop dialect and improve error mess…

e12bb4d

…ages Return SQL unchanged when source_dialect is already datafusion. Include parsed statements in the multi-statement error for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Assert full output queries in dialect tests

6106a1c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Use to_datafusion_sql in tests and deduplicate

b39308d

Replace _spark_to_df helper with to_datafusion_sql calls. Remove duplicate test cases between TestTranslator and other test classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Consolidate tests: merge type mappings, single e2e execu…

91456b9

…tion test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Consolidate all transpilation tests into single parametr…

af32561

…ized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Inline SUPPORTED_SOURCE_DIALECTS into error message

88b0fc5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Move identity round-trip into test_transpilation paramet…

a4b131c

…rized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Remove SPARK constant, inline string literals in tests

f1a9ea5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[DataLoader] Bump sqlglot minimum version to 29.0.0

0305583

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves marked this pull request as ready for review March 13, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501

[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:datafusion_dialect

robreeves commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robreeves commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robreeves commented Mar 13, 2026 •

edited

Loading