[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501
Open
robreeves wants to merge 13 commits intolinkedin:mainfrom
Open
[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves wants to merge 13 commits intolinkedin:mainfrom
Conversation
Adds a comprehensive DataFusion dialect as a SQLGlot plugin, enabling transpilation from Spark SQL (and other dialects) to DataFusion SQL. Includes parser function mappings, generator type/function transforms, a SparkToDataFusionSQLTranslator helper, and 36 tests covering translation, identity round-trips, type mappings, and execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SparkToDataFusionSQLTranslator to DataFusionSQLTranslator with a required source_dialect parameter. Validates the dialect on construction and provides a clear error listing all supported dialects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidate the translator function into datafusion_dialect.py and remove the separate sql_translator.py file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…usion_sql Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages Return SQL unchanged when source_dialect is already datafusion. Include parsed statements in the multi-statement error for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _spark_to_df helper with to_datafusion_sql calls. Remove duplicate test cases between TestTranslator and other test classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a custom SQLGlot dialect for DataFusion and a
to_datafusion_sqlfunction that transpiles SQL from any supported source dialect to DataFusion SQL.This is the first step in decoupling the
TableTransformerAPI from DataFusion internals. Instead of returning a DataFusion DataFrame (leaking the execution engine to users), theTableTransformerwill return a SQL string and its dialect. The data loader will then use SQLGlot to translate that SQL to DataFusion for execution. It will be used in #496.We maintain the DataFusion dialect in-repo rather than contributing it upstream to SQLGlot because the SQLGlot maintainers don't have capacity to review more community dialects right now (source).
Context: #496 (comment)
Changes
DataFusion dialect (
datafusion_sql.py): custom SQLGlot dialect with DataFusion-specific function mappings (e.g.SIZE→cardinality,ARRAY()→make_array,CURRENT_TIMESTAMP()→now()), type mappings (e.g.CHAR/TEXT→VARCHAR,BINARY→BYTEA), and identifier/normalization rules.SQL translator (
datafusion_sql.py):to_datafusion_sql(sql, source_dialect)accepts any supported source dialect (spark, postgres, mysql, etc.) and transpiles to DataFusion. When source_dialect is"datafusion"it returns the SQL unchanged. Validates the dialect with a clear error listing all supported options.Dependency: added
sqlglot>=29.0.0.Testing Done
Parametrized transpilation tests cover spark, mysql, postgres, and datafusion identity. Edge case tests for unsupported dialects and multi-statement errors. E2E test executes transpiled SQL against DataFusion and validates output data.
Additional Information
This is the first PR. Follow-up PRs will integrate the translator into the
TableTransformerAPI and data loader pipeline.