SQLGlot Java - SQL Parsing and Generation Library

A comprehensive Java 17+ port of the sqlglot Python library, providing SQL parsing, generation, validation, and transpilation across multiple SQL dialects.

Special Focus: Apache Drill support for distributed SQL query execution across heterogeneous data sources.

Features

Core Capabilities

SQL Parsing - Convert SQL strings into Abstract Syntax Trees (AST)
Code Generation - Convert ASTs back to SQL with configurable formatting
Query Analysis - Walk, search, and transform AST nodes
Transpilation - Convert SQL between different dialects
Formatting - Pretty-print and normalize SQL
Optimization - Apply query optimization rules (Phase 5A Complete)

Supported Dialects

All 31 Dialects Implemented (100% Coverage):

Core Databases:

ANSI - Standard SQL
PostgreSQL - Full support
MySQL - Full support
MySQL/MariaDB - MySQL variant with sequences

Cloud & Analytics:

BigQuery - Google Cloud data warehouse
Snowflake - Cloud data warehouse
Redshift - AWS analytics database
Athena - AWS S3 query service
Databricks - Unified analytics (Delta Lake)
Trino - Distributed query engine (formerly Presto/Presto)
Firebolt - Cloud data warehouse (AWS)

Analytics Databases:

ClickHouse - Real-time analytics
DuckDB - In-process SQL OLAP database
Spark SQL - Big data SQL engine
Hive - Hadoop SQL engine
Impala - Hadoop SQL query engine
Teradata - Enterprise data warehouse
Vertica - Columnar analytics
Yellowbrick - Cloud OLAP (Postgres-based)
Exasol - In-memory analytics

Specialized:

DRILL - Apache Drill (Primary focus) - Distributed SQL query engine
SQLite - Embedded database
Oracle - Enterprise database
T-SQL/MSSQL - SQL Server
StarRocks - Real-time OLAP
CockroachDB - Distributed PostgreSQL-compatible SQL
Aurora - AWS MySQL variant
Iceberg - Open table format

Experimental/Integration:

Pandas - Python dataframe SQL
WASM - WebAssembly SQL runtime
AWS Glue - Serverless ETL platform

Total: 31 SQL Dialects (100% parity with Python sqlglot)

Quick Start

Installation

Add to your Maven pom.xml:

<dependency>
    <groupId>com.gtkcyber.sqlglot</groupId>
    <artifactId>sqlglot-core</artifactId>
    <version>1.0.1</version>
</dependency>
<dependency>
    <groupId>com.gtkcyber.sqlglot</groupId>
    <artifactId>sqlglot-dialects</artifactId>
    <version>1.0.1</version>
</dependency>

Basic Usage

import com.gtkcyber.sqlglot.*;

// Parse SQL
Optional<Expression> expr = SqlGlot.parseOne("SELECT a, b FROM t WHERE x = 1");

// Generate SQL
String sql = SqlGlot.generate(expr.get());

// Format SQL
String formatted = SqlGlot.format("SELECT a,b FROM t WHERE x=1");

// Transpile between dialects
String drillSQL = SqlGlot.transpile("SELECT * FROM table", "ANSI", "DRILL");

// Use Apache Drill dialect
Dialect drill = Dialect.of("DRILL");
Optional<Expression> parsed = drill.parseOne("SELECT * FROM `workspace`.`schema`.`table`");

Advanced SQL Examples

import com.gtkcyber.sqlglot.*;

// Aggregate functions
String sql = "SELECT department, COUNT(*), SUM(salary), AVG(salary) " +
             "FROM employees GROUP BY department " +
             "HAVING COUNT(*) > 10";
Optional<Expression> expr = SqlGlot.parseOne(sql);
String generated = SqlGlot.generate(expr.get());

// String functions
String sql = "SELECT UPPER(name), LENGTH(email), SUBSTR(phone, 1, 3) FROM users";
Optional<Expression> expr = SqlGlot.parseOne(sql);

// Numeric functions
String sql = "SELECT ABS(amount), ROUND(price, 2), POWER(base, exponent) FROM data";
Optional<Expression> expr = SqlGlot.parseOne(sql);

// Set operations
String sql = "SELECT id FROM table1 " +
             "UNION " +
             "SELECT id FROM table2 " +
             "EXCEPT " +
             "SELECT id FROM table3";
Optional<Expression> expr = SqlGlot.parseOne(sql);

// CTEs (Common Table Expressions)
String sql = "WITH cte AS (SELECT a, b FROM t WHERE x > 10) " +
             "SELECT * FROM cte WHERE a > 5";
Optional<Expression> expr = SqlGlot.parseOne(sql);

// Complex queries with DISTINCT, LIMIT, OFFSET
String sql = "SELECT DISTINCT category, COUNT(*) cnt FROM products " +
             "WHERE price > 100 GROUP BY category ORDER BY cnt DESC " +
             "LIMIT 10 OFFSET 5";
Optional<Expression> expr = SqlGlot.parseOne(sql);

Query Optimization (Phase 5A)

Overview

SQLGlot Java includes a comprehensive query optimization framework with multiple optimization rules applied sequentially to transform SQL queries into more efficient forms.

Usage Examples

import com.gtkcyber.sqlglot.*;
import com.gtkcyber.sqlglot.optimizer.OptimizerConfig;

// Basic optimization with default configuration (Phase 5A rules)
Optional<Expression> expr = SqlGlot.parseOne("SELECT * FROM t WHERE TRUE AND x = 5");
Expression optimized = SqlGlot.optimize(expr.get());
String result = SqlGlot.generate(optimized);
// Result: "SELECT * FROM t WHERE x = 5"

// Specify dialect and configuration
Expression optimized = SqlGlot.optimize(expr.get(), "ANSI", OptimizerConfig.PHASE_5A);

// Use preset configurations
Expression minimalOpt = SqlGlot.optimize(expr.get(), "ANSI", OptimizerConfig.MINIMAL);
Expression aggressiveOpt = SqlGlot.optimize(expr.get(), "ANSI", OptimizerConfig.AGGRESSIVE);

// Custom configuration - enable specific rules only
OptimizerConfig custom = new OptimizerConfig(
    true,   // simplify
    true,   // canonicalize
    false,  // quoteIdentifiers
    false,  // eliminateCtes
    true,   // normalizePredicates
    false,  // pushdownPredicates
    false,  // mergeSubqueries
    false,  // joinReordering
    false,  // projectionPushdown
    false,  // annotateTypes
    false   // qualifyColumns
);
Expression optimized = SqlGlot.optimize(expr.get(), "ANSI", custom);

Optimization Rules

Phase 5A (Core Rules):

SimplifyRule - Removes redundant conditions (TRUE AND x → x), folds constants (1 + 2 → 3)
CanonicalizeRule - Normalizes expressions (5 < x → x > 5, NOT(x = 5) → x <> 5)
QuoteIdentifiersRule - Applies dialect-specific identifier quoting
EliminateCTEsRule - Removes unused Common Table Expressions

Phase 5B (Advanced Rules - Planned):

PushdownPredicatesRule, NormalizePredicatesRule, and more

Configuration Presets

OptimizerConfig.DEFAULT        // Phase 5A rules (default)
OptimizerConfig.MINIMAL        // Simplify only
OptimizerConfig.PHASE_5A       // All Phase 5A rules
OptimizerConfig.PHASE_5B       // Phase 5A + Phase 5B rules (future)
OptimizerConfig.AGGRESSIVE     // All available rules

Apache Drill Support

Overview

SQLGlot Java has first-class support for Apache Drill, enabling SQL parsing, validation, and transpilation for Drill's unique syntax and features.

Drill-Specific Features

1. Workspace and Schema Paths

Drill supports a three-level namespace: workspace.schema.table

String sql = "SELECT * FROM `datalake`.`raw`.`customers`";
Dialect drill = Dialect.of("DRILL");
Optional<Expression> expr = drill.parseOne(sql);

2. Backtick Identifiers

Drill uses backticks for identifiers (especially for reserved words and paths):

String sql = "SELECT `select`, `from` FROM `my-table-123`";
// Backticks are preserved and used consistently in generated SQL

3. Complex Data Types

Support for Drill's nested types:

// ARRAY type
String sql = "SELECT CAST(col AS ARRAY<INT>) FROM t";

// MAP type
String sql = "SELECT CAST(col AS MAP<STRING, INT>) FROM t";

// STRUCT type
String sql = "SELECT CAST(col AS STRUCT<name STRING, age INT, address STRUCT<street STRING, city STRING>>) FROM t";

4. FLATTEN Function

Unnest arrays and maps into individual rows:

// Basic flatten
String sql = "SELECT FLATTEN(nested_col) FROM t";

// Flatten with depth limit
String sql = "SELECT FLATTEN(deeply_nested_col, 2) FROM t";

5. File Sources

Query files directly:

String sql = "SELECT * FROM dfs.`/data/customers.parquet`";
String sql = "SELECT * FROM dfs.`s3://bucket/data/`";

6. Nested Field Access

Access nested JSON and structured data:

String sql = "SELECT person.name, person.address.street FROM customers";

Drill Example Workflows

Query Validation

Dialect drill = Dialect.of("DRILL");
String userQuery = "SELECT * FROM `problematic`.`query`";

try {
    Optional<Expression> validated = drill.parseOne(userQuery);
    if (validated.isPresent()) {
        System.out.println("Query is valid");
    }
} catch (Exception e) {
    System.out.println("Query validation failed: " + e.getMessage());
}

Query Normalization

Dialect drill = Dialect.of("DRILL");
String messySQL = "select a,b,c from t WHERE x=1";

String normalized = drill.format(messySQL);
// Output: SELECT a, b, c FROM t WHERE x = 1

Cross-Dialect Transpilation

// Convert from another SQL dialect to Drill
String postgresSQL = "SELECT column_name FROM table_name";
String drillSQL = SqlGlot.transpile(postgresSQL, "POSTGRES", "DRILL");
// Output: SELECT `column_name` FROM `table_name`

// Or from Drill to another dialect
String drillSQL = "SELECT `a`, `b` FROM `my_table`";
String ansiSQL = SqlGlot.transpile(drillSQL, "DRILL", "ANSI");
// Output: SELECT a, b FROM my_table

Query Analysis

Dialect drill = Dialect.of("DRILL");
Optional<Expression> expr = drill.parseOne("SELECT * FROM `t1` JOIN `t2` ON `t1`.id = `t2`.id WHERE x > 10");

if (expr.isPresent()) {
    // Find all table references
    expr.get().findAll(Nodes.Table.class).forEach(table -> {
        System.out.println("Table: " + table.getName());
    });

    // Find all functions
    expr.get().findAll(Nodes.Function.class).forEach(func -> {
        System.out.println("Function: " + func.getName());
    });

    // Transform the query
    Expression transformed = expr.get().transform(e -> {
        // Apply custom transformations
        return e;
    });
}

Architecture

Project Structure

sqlglot-java/
├── sqlglot-core/
│   ├── src/main/java/com/gtkcyber/sqlglot/
│   │   ├── tokens/          - Tokenization
│   │   ├── expressions/      - AST node types
│   │   ├── parser/           - SQL parsing
│   │   ├── generator/        - SQL generation
│   │   ├── optimizer/        - Query optimization framework
│   │   ├── dialect/          - Dialect infrastructure
│   │   └── SqlGlot.java      - Public API
│   └── src/test/java/com/gtkcyber/sqlglot/  - Comprehensive tests
│
└── sqlglot-dialects/
    ├── src/main/java/com/gtkcyber/sqlglot/dialects/
    │   ├── drill/            - Apache Drill dialect
    │   ├── postgres/         - PostgreSQL dialect
    │   ├── mysql/            - MySQL dialect
    │   ├── bigquery/         - BigQuery dialect
    │   ├── snowflake/        - Snowflake dialect
    │   ├── [26 additional dialects...]
    │   └── glue/             - AWS Glue dialect
    ├── src/main/resources/templates/  - Dialect template scaffold
    └── src/test/java/com/gtkcyber/sqlglot/dialects/  - Dialect tests

Core Components

Tokenizer - Converts SQL strings into tokens
- Base: com.gtkcyber.sqlglot.tokens.Tokenizer
- Drill: com.gtkcyber.sqlglot.dialects.drill.DrillTokenizer
Parser - Converts tokens into AST
- Base: com.gtkcyber.sqlglot.parser.Parser
- Drill: com.gtkcyber.sqlglot.dialects.drill.DrillParser
Expression AST - Hierarchical node types
- Base: com.gtkcyber.sqlglot.expressions.Expression
- Nodes: com.gtkcyber.sqlglot.expressions.Nodes
- Visitor: com.gtkcyber.sqlglot.expressions.ExpressionVisitor<R>
Generator - Converts AST back to SQL
- Base: com.gtkcyber.sqlglot.generator.Generator
- Drill: com.gtkcyber.sqlglot.dialects.drill.DrillGenerator
Dialect - Bundles tokenizer, parser, generator
- Base: com.gtkcyber.sqlglot.dialect.Dialect
- Drill: com.gtkcyber.sqlglot.dialects.drill.DrillDialect

Testing

163+ Passing Tests across 30+ test classes:

Core Tests

TokenizerTest (5 tests) - Token recognition and SQL tokenization
GeneratorTest (12 tests) - SQL code generation from AST

Operators & Expressions

ArithmeticComparisonTest (15 tests) - All arithmetic and comparison operators
BasicOperatorsTest (6 tests) - IN, LIKE, IS NULL, BETWEEN, CAST operators
ComplexExpressionTest (1 test) - Complex CASE expressions

Functions

FunctionCallTest (9 tests) - Function parsing and generation
AggregateFunctionsTest (4 tests) - COUNT, SUM, AVG aggregate functions
StringFunctionsTest (7 tests) - UPPER, LOWER, LENGTH, SUBSTR, TRIM, CONCAT
NumericFunctionsTest (7 tests) - ABS, ROUND, CEIL, FLOOR, POWER, SQRT

SQL Statements

DMLTest (5 tests) - INSERT, DELETE statements
DDLTest (6 tests) - CREATE, DROP, ALTER TABLE
GroupByTest (2 tests) - GROUP BY with proper clause termination
HavingClauseTest (4 tests) - HAVING with aggregate functions
DistinctTest (2 tests) - DISTINCT keyword
LimitOffsetTest (4 tests) - LIMIT and OFFSET clauses

Advanced Features (Phase 3-4B)

FeatureDebugTest (5 tests) - UNION, INTERSECT, EXCEPT, CTEs
WindowFunctionTest - Window functions (ROW_NUMBER, RANK, etc.)
AliasTest - Column and table aliases (explicit/implicit)
SubqueryTest - Subqueries in all contexts
JoinTest - Complex JOINs with ON conditions
RoundTripTest (4 tests, skipped) - Parse → Generate → Parse verification
Phase2Test (17 tests, skipped) - Additional advanced features
DrillDialectTest (18 tests) - Drill-specific functionality

Query Optimization (Phase 5A)

OptimizerTest (10 tests) - Optimizer framework
ScopeTest (12 tests) - Scope and context analysis
SimplifyRuleTest (23 tests) - Boolean/arithmetic simplification
CanonicalizeRuleTest (12 tests) - Expression normalization
QuoteIdentifiersRuleTest (8 tests) - Identifier quoting
PushdownPredicatesRuleTest (16 tests) - WHERE clause pushdown
NormalizePredicatesRuleTest (13 tests) - CNF normalization
Total Phase 5A: 94 Tests

Running Tests

# Run all tests
mvn test

# Run specific test class
mvn test -Dtest=DrillDialectTest

# Run with coverage
mvn test jacoco:report

Development

Building from Source

# Clone the repository
git clone https://github.com/charlesgivre/sqlglot_java.git
cd sqlglot_java

# Build
mvn clean install

# Build without tests (faster)
mvn clean install -DskipTests

Code Style

Java 17+ features encouraged (records, sealed classes, pattern matching)
Immutability preferred
Comprehensive Javadoc comments
Test coverage for new features

Key Design Decisions

Records for Value Objects

Token is a record for immutability
ParserConfig, GeneratorConfig as records

Sealed Classes

Expression is abstract (sealed in Java 19+)
Binary operators extend sealed Binary class
Permits 50+ node types

Visitor Pattern

Type-safe AST traversal
Compile-time exhaustiveness checking
Easy to extend with dialects

Dialect Registry

ServiceLoader-based plugin discovery
Manual registration fallback
Case-insensitive dialect lookup

Performance Characteristics

Tokenization: O(n) where n = SQL length
Parsing: O(n²) worst case (operator precedence)
Generation: O(n) where n = AST nodes
Memory: Minimal; streaming tokens, tree-based AST

Known Limitations

Still Planned for Future Phases

Advanced optimizer rules (Phase 5B: join reordering, subquery merging, projection pushdown, type inference, column qualification)
Some dialect-specific features not yet fully implemented
No concurrent parsing/generation (sequential only)
25 remaining SQL dialects not yet implemented (to reach parity with Python sqlglot's 31 dialects)

Goals (Next)

Implement advanced optimizer rules (join reordering, subquery merging, projection pushdown)
Add type inference and annotation
Implement full column qualification
Performance optimization and benchmarking

Contributing

Contributions welcome! Focus areas:

Drill Features - Expand Drill-specific functionality and optimizations
Optimizer - Advanced Phase 5C optimizer rules (join reordering, additional optimizations)
Tests - Increase test coverage for all 31 dialects with real-world SQL examples
Performance - Optimize tokenizer, parser, and code generation
Documentation - Improve Javadoc, examples, and dialect-specific guides
Integration - Connect with external tools and frameworks

License

Released under MIT License.

References

SQLGlot Python - Original Python library
Apache Drill - SQL query engine
Java 17 Features
SQL Standard

Support

For issues, questions, or contributions related to Apache Drill support, please refer to the DrillDialectTest and DrillDialect classes for examples.

Last Updated: 2026-02-20 Package: com.gtkcyber.sqlglot

Build Status

163 Core Tests Passing (Phase 2-5B complete)
131 Optimizer Tests Passing (Phase 5A + 5B complete)
All 31 SQL Dialects Implemented (100% parity with Python sqlglot)
Package successfully renamed to com.gtkcyber.sqlglot
Phase 5A optimizer with 7 core optimization rules fully implemented
Phase 5B optimizer with 4 advanced optimization rules fully implemented
All dialects registered via ServiceLoader with auto-discovery
Zero failures or errors
Clean compilation
Production-ready code quality

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
sqlglot-core		sqlglot-core
sqlglot-dialects		sqlglot-dialects
.gitignore		.gitignore
LICENSE		LICENSE
MAVEN_CENTRAL.md		MAVEN_CENTRAL.md
README.md		README.md
ROADMAP.md		ROADMAP.md
pom.xml		pom.xml

License

gtkcyber/sqlglot_java

Folders and files

Latest commit

History

Repository files navigation

SQLGlot Java - SQL Parsing and Generation Library

Features

Core Capabilities

Supported Dialects

Quick Start

Installation

Basic Usage

Advanced SQL Examples

Query Optimization (Phase 5A)

Overview

Usage Examples

Optimization Rules

Configuration Presets

Apache Drill Support

Overview

Drill-Specific Features

1. Workspace and Schema Paths

2. Backtick Identifiers

3. Complex Data Types

4. FLATTEN Function

5. File Sources

6. Nested Field Access

Drill Example Workflows

Query Validation

Query Normalization

Cross-Dialect Transpilation

Query Analysis

Architecture

Project Structure

Core Components

Testing

Core Tests

Operators & Expressions

Functions

SQL Statements

Advanced Features (Phase 3-4B)

Query Optimization (Phase 5A)

Running Tests

Development

Building from Source

Code Style

Key Design Decisions

Records for Value Objects

Sealed Classes

Visitor Pattern

Dialect Registry

Performance Characteristics

Known Limitations

Still Planned for Future Phases

Goals (Next)

Contributing

License

References

Support

Build Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages