Skip to content

RFC-0025 Derived column#61

Open
ScrapCodes wants to merge 1 commit into
prestodb:mainfrom
ScrapCodes:derived-column-spec
Open

RFC-0025 Derived column#61
ScrapCodes wants to merge 1 commit into
prestodb:mainfrom
ScrapCodes:derived-column-spec

Conversation

@ScrapCodes
Copy link
Copy Markdown
Contributor

@ScrapCodes ScrapCodes commented Apr 24, 2026

Currently WIP

@prestodb-ci prestodb-ci added the from:IBM PRs from IBM label Apr 24, 2026
@prestodb-ci prestodb-ci requested review from a team, BryanCutler and infvg and removed request for a team April 24, 2026 07:35
@ScrapCodes ScrapCodes marked this pull request as draft April 24, 2026 07:35
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch 4 times, most recently from 474cf06 to 221e8c0 Compare April 24, 2026 11:37
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 221e8c0 to 8690c4e Compare May 4, 2026 09:02
@ScrapCodes ScrapCodes changed the title [WIP] RFC-0025 Derived column RFC-0025 Derived column May 4, 2026
@ScrapCodes ScrapCodes marked this pull request as ready for review May 4, 2026 09:58
@prestodb-ci prestodb-ci requested review from a team, infvg and wanglinsong and removed request for a team May 4, 2026 09:58
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch 3 times, most recently from 3bdd1ef to 2448a60 Compare May 4, 2026 12:23
"udfSpecList": [
{
"functionName": "_UDF_1",
"params": [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is params separated from arguments ? Are you using any special Iceberg or Presto UDF format here ?

Copy link
Copy Markdown
Contributor Author

@ScrapCodes ScrapCodes May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit , the way we can identify a UDF in presto is:
catalog.schema.function_name and it's params type.

Once UDF is found, next step in checking it for replacement is, it's arguments should match too. Then we also need the arguments to match with !

Also, once Iceberg UDF spec is implemented, we should be able to uniquely identify and load a UDF by same search i.e. function name (id) and params.

Interestingly, their entire UDF metadata spec does not store a function name :)

@jja725 jja725 self-requested a review May 5, 2026 18:25
@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 2448a60 to 8c6c4a4 Compare May 6, 2026 16:39
Copy link
Copy Markdown

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that how write work would be the main concern here with compatibility with all the engine


1. Load the UDF from UDF spec provided in table properties.
2. Scan the filter predicate and look for the UDF
3. If the UDF defined in the spec is found, rewrite the filter predicate by replacing the UDF call with derived column.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like we can also rewrite the projection

Copy link
Copy Markdown
Contributor Author

@ScrapCodes ScrapCodes May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can rewrite projections too. However, the performance gain is negligible in most cases. There may be some cases, where this would be helpful e.g. UDF itself does a very compute intensive operation.
Will try to cover this, thanks for suggesting!


Alter table syntax
```sql
ALTER TABLE table_name {ADD | ALTER} COLUMN column_name data_type [DERIVED {[udf("column_name1")] | [SQL EXPRESSION]}]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need versioning for UDF?

@ScrapCodes ScrapCodes force-pushed the derived-column-spec branch from 8c6c4a4 to 03935be Compare May 7, 2026 16:11
@ScrapCodes
Copy link
Copy Markdown
Contributor Author

@tdcmeehan has volunteered to be a co-author ! Yay!

@ScrapCodes
Copy link
Copy Markdown
Contributor Author

Here is a PR https://github.com/prestodb/presto/pull/27746/changes with Proof of concept !

A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of
rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an
example `SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'`. While optimizers can easily push down the filter predicate, however, it can not be used
in filtering at TableScan, simply because the underlying system is unaware of the UDF lower. As a result, we end up scanning a large number of rows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately, while we will push down the lower into the table scan and filter at the row level, as we do any other expression, what we can't do is use this expression with range (min/max) index values. Some examples include Iceberg manifest statistics and Parquet rowgroup statistics.

```

1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but
table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UDF can be evaluated, what can't be evaluated is this expression over lower and upper bounds, since those statistics are calculated on the base values.


1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but
table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds.
In a jdbc based connector the underlying connector may be able to evaluate the udf if there exist a udf with same name in the underlying database system. Since we do not know,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is orthogonal to JDBC connector function pushdown.

Comment thread RFC-0025-derived-column-support.md
-> "format-version" = '3',
-> location = 's3://test-bucket/perf_test/test_table9',
-> "derived-columns" = Array['c2_derived'],
-> "derived-columns.spec.udf.json" = JSON '{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the granularity is wrong here. Right now we're scoping this to UDFs. UDFs are functions, and what's implicit is that this should be a function call. A function call is just one type of expression. I think what should be stored here instead is an expression, because that would capture the use cases better. For example, the derived column might be an IF statement, which is not a UDF.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point !

I have interpretted this in following two possible ways

  1. We store expression as string and do a string match for derived column replacement !
  2. We store expression as expression tree of some sort and compute a canonical form of user queries expressions and then try to match them.

While, I think you are alluding to first point more than second, the result of both is same. User's expressions get swapped when they are "exactly same". In later approach, the chance of semantically checking and even if user's expression occurs as sub expression - achieving a replacement is possible.

Let's review the UDF approach again,

  1. Can we create a UDF out of any user provided expression ? If yes, then we can have a UDF with just IF statement and setup derived column.
  2. We can achieve rewrite even if UDF occurs as a subexpression, for example : if we have a derived column setup on lower(c2), but in our query we have: explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa'; we can still achieve correct rewrite.
presto:perf_test> explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa';
                                                                                                                                                       Query Plan                                                                                                                                                        
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[PlanNodeId 6][c1, c2] => [c1:bigint, c2:varchar]                                                                                                                                                                                                                                                               
         Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?}                                                                                                                                                                                                                    
     - RemoteStreamingExchange[PlanNodeId 219][GATHER - COLUMNAR] => [c1:bigint, c2:varchar]                                                                                                                                                                                                                             
             Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?}                                                                                                                                                                                                                
         - ScanFilter[PlanNodeId 0,252][table = TableHandle {connectorId='iceberg', connectorHandle='test_table2$data@7072973623560046354', layout='Optional[test_table2$data@7072973623560046354]'}, filterPredicate = (concat(VARCHAR'A', c2_derived)) = (VARCHAR'Aa')] => [c2:varchar, c2_derived:varchar, c1:bigint] 
                 Estimates: {source: CostBasedSourceInfo, rows: 3 (357B), cpu: 357.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: 0.00}                                                                                                                 
                 c2 := 2:c2:varchar (1:28)                                                                                                                                                                                                                                                                               
                 c2_derived := 4:c2_derived:varchar                                                                                                                                                                                                                                                                      
                 c1 := 1:c1:bigint (1:28)                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                         
(1 row)

Query 20260511_125853_00003_yt5q2, FINISHED, 1 node
http://127.0.0.1:8080/ui/query.html?20260511_125853_00003_yt5q2
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 75ms, server-side: 65ms] [0 rows, 0B] [0 rows/s, 0B/s]

Also in string matching, a user's query may not benefit - even if there are differences in spaces while writing the expression. While it's fair to say that is user error, the udf approach currently proposed does not suffer from it.

The advantages of expression and string matching are following :

  1. We are consistent with rdbms !
  2. It's easy to implement

If consistency with RDBMS is strict requirement, then we should definitely support expressions. A full fledged expression matching will be much more elaborate effort to implement and risk of missing a canonical form is still higher than UDF approach, and if we go with string matching we lose out on many possible matches due to just differences in naming/canonical form/spaces.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScrapCodes I don't mean we should swap in full text strings. What we can do is treat derived column expressions as views, except scoped at the expression level, whereas views are scoped at the query level. We store the raw text in the connector, same as we do for views. During analysis, we parse the text and this feeds into the planner, which then will use this to create projections where appropriate when the plan is created. It won't be full text matched, it will be matched either at the AST level, or at the plan level (my preference is plan level).

The most compelling reason to go this route is that it's simply how it's defined in the SQL standard, and we typically defer to the standard when it weighs in on something.

<select sublist>   ::= <derived column> | <qualified asterisk>
<derived column>   ::= <value expression> [ <as clause> ]
<as clause>        ::= [ AS ] <column name>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is also matching via generating Plan, however it will require us to write a query with SELECT and FILTER/PROJECT and then convert it into plan node. (running as the current user's permissions). Because, we cannot simply run/analyze/plan a headless expression.

Once the Plan is generated we can extract the plan node for expression and try to do matching.

Also, we will need to account additional time required to compute plan which will contribute to the overall query running time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but the planning time shouldn't be an issue.


### Other Computing Engines

#### Spark and Trino
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many relational databases support this feature. We should refer to them. For example from MariaDB: https://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns

Comment on lines +105 to +111
### Table properties

Following table properties are added.

1. `derived-columns` : List of columns that are derived columns.
2. `derived-columns.spec.udf.json`: Full JSON specification for each derived column :

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want the user to manually specify a JSON blob to use the derived expression. Here is how it should work, very high level:

  1. We add parser support for GENERATED ALWAYS, see example here: http://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns#syntax

  2. Parse and analyze the expression

  3. During createtable, we store metadata (raw SQL expression) for the derived column to be persisted by the connector

  4. The connector returns the metadata when it retrieves the table definition

  5. This is analyzed by the query which queries the table, and there is an optimizer rule to leverage it where appropriate (very simple, exact match of expression)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan , we will still need a mapping between "user's expression" -> "derived_column_name". Plain strings cannot hold this info for a table with more than one derived column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes--the Iceberg connector can create and manage this mapping. It need not be user-managed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PRs from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants