-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Problem
qualify_columns uses a bottom-up transform_recursive traversal: inner SELECTs are qualified before outer SELECTs. When an inner scope encounters a column that belongs to the outer scope (a correlated reference), it raises UnknownColumn because the inner scope has no access to the outer scope's sources.
Example: correlated scalar subquery
SELECT t1_id, (SELECT AVG(val) FROM t2 WHERE t2_id = t1_id) AS avg_val FROM t1t1_id in the inner WHERE clause refers to t1 (outer scope), but the inner scope only knows about t2. Result: UnknownColumn("t1_id").
Example: correlated EXISTS (TPC-H Q4 pattern)
SELECT o_orderpriority FROM orders
WHERE EXISTS (SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)o_orderkey in the inner WHERE refers to orders (outer scope). Result: UnknownColumn("o_orderkey").
Root cause
qualify_columns calls build_scope() independently for each SELECT it encounters during bottom-up traversal. Each scope is a fresh root scope with no parent context, so can_be_correlated is always false and there is no way to resolve columns against outer scopes.
Approaches considered
1. Schema-wide heuristic (column_exists_in_outer_schema_table)
Check if the unresolvable column exists in any schema table not in the current scope. Rejected: produces false negatives — a typo like typpo_col silently passes if any other table in the schema happens to have that column.
2. Error-and-retry at Subquery/Exists boundary
When transform_fn(Subquery) sees an UnknownColumn error from the inner SELECT, clear the error and re-qualify the inner SELECT with allow_partial=true. The outer scope's qualify_columns_in_expression then recurses into the subquery and either qualifies the correlated ref or re-raises the error.
Rejected: requires intercepting every Expression variant that can wrap a correlated subquery (Subquery, Exists, In, etc.); relies on the assumption that failed qualification doesn't partially mutate the AST; runs the full qualification pipeline twice per subquery scope.
3. Scope-tree-based top-down qualification (proper fix)
Refactor qualify_columns to use traverse_scope (or a custom top-down traversal) with parent scope context, similar to sqlglot's Python implementation. Inner scopes can resolve columns against parent scopes in a single pass.
Estimated cost: ~40-60 hours, ~700-1000 lines changed. Requires adding parent pointers to Scope (or arena-based indexing) and rewriting the main qualification loop to not use transform_recursive.
Test cases
These tests document the expected behavior once the issue is resolved:
#[test]
fn test_qualify_columns_correlated_scalar_subquery_with_unqualified_columns() {
let expr =
parse("SELECT t1_id, (SELECT AVG(val) FROM t2 WHERE t2_id = t1_id) AS avg_val FROM t1");
let mut schema = MappingSchema::new();
schema
.add_table(
"t1",
&[("t1_id".to_string(), DataType::BigInt { length: None })],
None,
)
.expect("schema setup");
schema
.add_table(
"t2",
&[
("t2_id".to_string(), DataType::BigInt { length: None }),
("val".to_string(), DataType::BigInt { length: None }),
],
None,
)
.expect("schema setup");
let result =
qualify_columns(expr, &schema, &QualifyColumnsOptions::new()).expect("qualify");
let sql = gen(&result);
assert!(sql.contains("t1.t1_id"), "outer column should be qualified: {sql}");
assert!(sql.contains("t2.t2_id"), "inner column should be qualified: {sql}");
assert!(sql.contains("t2.val"), "inner column should be qualified: {sql}");
}
#[test]
fn test_qualify_columns_correlated_exists_subquery() {
// TPC-H Q4 pattern
let expr = parse(
"SELECT o_orderpriority FROM orders \
WHERE EXISTS (SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)",
);
let mut schema = MappingSchema::new();
schema
.add_table(
"orders",
&[
("o_orderpriority".to_string(), DataType::Text),
("o_orderkey".to_string(), DataType::BigInt { length: None }),
],
None,
)
.expect("schema setup");
schema
.add_table(
"lineitem",
&[("l_orderkey".to_string(), DataType::BigInt { length: None })],
None,
)
.expect("schema setup");
let result =
qualify_columns(expr, &schema, &QualifyColumnsOptions::new()).expect("qualify");
let sql = gen(&result);
assert!(sql.contains("orders.o_orderpriority"), "outer column should be qualified: {sql}");
assert!(sql.contains("lineitem.l_orderkey"), "inner column should be qualified: {sql}");
assert!(sql.contains("orders.o_orderkey"), "correlated outer column should be qualified: {sql}");
}