Skip to content

qualify_columns fails on correlated subqueries with unqualified column references #68

@goldmedal

Description

@goldmedal

Problem

qualify_columns uses a bottom-up transform_recursive traversal: inner SELECTs are qualified before outer SELECTs. When an inner scope encounters a column that belongs to the outer scope (a correlated reference), it raises UnknownColumn because the inner scope has no access to the outer scope's sources.

Example: correlated scalar subquery

SELECT t1_id, (SELECT AVG(val) FROM t2 WHERE t2_id = t1_id) AS avg_val FROM t1

t1_id in the inner WHERE clause refers to t1 (outer scope), but the inner scope only knows about t2. Result: UnknownColumn("t1_id").

Example: correlated EXISTS (TPC-H Q4 pattern)

SELECT o_orderpriority FROM orders
WHERE EXISTS (SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)

o_orderkey in the inner WHERE refers to orders (outer scope). Result: UnknownColumn("o_orderkey").

Root cause

qualify_columns calls build_scope() independently for each SELECT it encounters during bottom-up traversal. Each scope is a fresh root scope with no parent context, so can_be_correlated is always false and there is no way to resolve columns against outer scopes.

Approaches considered

1. Schema-wide heuristic (column_exists_in_outer_schema_table)

Check if the unresolvable column exists in any schema table not in the current scope. Rejected: produces false negatives — a typo like typpo_col silently passes if any other table in the schema happens to have that column.

2. Error-and-retry at Subquery/Exists boundary

When transform_fn(Subquery) sees an UnknownColumn error from the inner SELECT, clear the error and re-qualify the inner SELECT with allow_partial=true. The outer scope's qualify_columns_in_expression then recurses into the subquery and either qualifies the correlated ref or re-raises the error.

Rejected: requires intercepting every Expression variant that can wrap a correlated subquery (Subquery, Exists, In, etc.); relies on the assumption that failed qualification doesn't partially mutate the AST; runs the full qualification pipeline twice per subquery scope.

3. Scope-tree-based top-down qualification (proper fix)

Refactor qualify_columns to use traverse_scope (or a custom top-down traversal) with parent scope context, similar to sqlglot's Python implementation. Inner scopes can resolve columns against parent scopes in a single pass.

Estimated cost: ~40-60 hours, ~700-1000 lines changed. Requires adding parent pointers to Scope (or arena-based indexing) and rewriting the main qualification loop to not use transform_recursive.

Test cases

These tests document the expected behavior once the issue is resolved:

#[test]
fn test_qualify_columns_correlated_scalar_subquery_with_unqualified_columns() {
    let expr =
        parse("SELECT t1_id, (SELECT AVG(val) FROM t2 WHERE t2_id = t1_id) AS avg_val FROM t1");

    let mut schema = MappingSchema::new();
    schema
        .add_table(
            "t1",
            &[("t1_id".to_string(), DataType::BigInt { length: None })],
            None,
        )
        .expect("schema setup");
    schema
        .add_table(
            "t2",
            &[
                ("t2_id".to_string(), DataType::BigInt { length: None }),
                ("val".to_string(), DataType::BigInt { length: None }),
            ],
            None,
        )
        .expect("schema setup");

    let result =
        qualify_columns(expr, &schema, &QualifyColumnsOptions::new()).expect("qualify");
    let sql = gen(&result);

    assert!(sql.contains("t1.t1_id"), "outer column should be qualified: {sql}");
    assert!(sql.contains("t2.t2_id"), "inner column should be qualified: {sql}");
    assert!(sql.contains("t2.val"), "inner column should be qualified: {sql}");
}

#[test]
fn test_qualify_columns_correlated_exists_subquery() {
    // TPC-H Q4 pattern
    let expr = parse(
        "SELECT o_orderpriority FROM orders \
         WHERE EXISTS (SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)",
    );

    let mut schema = MappingSchema::new();
    schema
        .add_table(
            "orders",
            &[
                ("o_orderpriority".to_string(), DataType::Text),
                ("o_orderkey".to_string(), DataType::BigInt { length: None }),
            ],
            None,
        )
        .expect("schema setup");
    schema
        .add_table(
            "lineitem",
            &[("l_orderkey".to_string(), DataType::BigInt { length: None })],
            None,
        )
        .expect("schema setup");

    let result =
        qualify_columns(expr, &schema, &QualifyColumnsOptions::new()).expect("qualify");
    let sql = gen(&result);

    assert!(sql.contains("orders.o_orderpriority"), "outer column should be qualified: {sql}");
    assert!(sql.contains("lineitem.l_orderkey"), "inner column should be qualified: {sql}");
    assert!(sql.contains("orders.o_orderkey"), "correlated outer column should be qualified: {sql}");
}

Related Discussion

#67 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions