RFC-0025 Derived column by ScrapCodes · Pull Request #61 · prestodb/rfcs

ScrapCodes · 2026-04-24T07:35:00Z

Currently WIP

aditi-pandit · 2026-05-04T21:32:50Z

+  "udfSpecList": [
+    {
+      "functionName": "_UDF_1",
+      "params": [


Why is params separated from arguments ? Are you using any special Iceberg or Presto UDF format here ?

@aditi-pandit , the way we can identify a UDF in presto is:
catalog.schema.function_name and it's params type.

Once UDF is found, next step in checking it for replacement is, it's arguments should match too. Then we also need the arguments to match with !

Also, once Iceberg UDF spec is implemented, we should be able to uniquely identify and load a UDF by same search i.e. function name (id) and params.

Interestingly, their entire UDF metadata spec does not store a function name :)

jja725

Agree that how write work would be the main concern here with compatibility with all the engine

jja725 · 2026-05-07T04:08:37Z

+
+1. Load the UDF from UDF spec provided in table properties.
+2. Scan the filter predicate and look for the UDF
+3. If the UDF defined in the spec is found, rewrite the filter predicate by replacing the UDF call with derived column.


feel like we can also rewrite the projection

Yes, we can rewrite projections too. However, the performance gain is negligible in most cases. There may be some cases, where this would be helpful e.g. UDF itself does a very compute intensive operation.
Will try to cover this, thanks for suggesting!

jja725 · 2026-05-07T04:18:31Z

+
+Alter table syntax
+```sql
+ALTER TABLE table_name {ADD | ALTER} COLUMN column_name data_type [DERIVED {[udf("column_name1")] | [SQL EXPRESSION]}]


do we need versioning for UDF?

ScrapCodes · 2026-05-07T16:17:55Z

@tdcmeehan has volunteered to be a co-author ! Yay!

ScrapCodes · 2026-05-07T16:20:49Z

Here is a PR https://github.com/prestodb/presto/pull/27746/changes with Proof of concept !

tdcmeehan · 2026-05-07T15:01:26Z

+A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of
+rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an
+example `SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'`. While optimizers can easily push down the filter predicate, however, it can not be used
+in filtering at TableScan, simply because the underlying system is unaware of the UDF lower. As a result, we end up scanning a large number of rows.


More accurately, while we will push down the lower into the table scan and filter at the row level, as we do any other expression, what we can't do is use this expression with range (min/max) index values. Some examples include Iceberg manifest statistics and Parquet rowgroup statistics.

tdcmeehan · 2026-05-07T15:02:29Z

+```
+
+1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but
+   table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds.


The UDF can be evaluated, what can't be evaluated is this expression over lower and upper bounds, since those statistics are calculated on the base values.

tdcmeehan · 2026-05-07T15:02:58Z

+
+1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but
+   table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds.
+   In a jdbc based connector the underlying connector may be able to evaluate the udf if there exist a udf with same name in the underlying database system. Since we do not know,


I believe this is orthogonal to JDBC connector function pushdown.

tdcmeehan · 2026-05-08T12:32:09Z

+               ->     "format-version" = '3',
+               ->     location = 's3://test-bucket/perf_test/test_table9',
+               ->     "derived-columns" = Array['c2_derived'],
+               ->     "derived-columns.spec.udf.json" = JSON '{


I think the granularity is wrong here. Right now we're scoping this to UDFs. UDFs are functions, and what's implicit is that this should be a function call. A function call is just one type of expression. I think what should be stored here instead is an expression, because that would capture the use cases better. For example, the derived column might be an IF statement, which is not a UDF.

Great point !

I have interpretted this in following two possible ways

We store expression as string and do a string match for derived column replacement !

We store expression as expression tree of some sort and compute a canonical form of user queries expressions and then try to match them.

While, I think you are alluding to first point more than second, the result of both is same. User's expressions get swapped when they are "exactly same". In later approach, the chance of semantically checking and even if user's expression occurs as sub expression - achieving a replacement is possible.

Let's review the UDF approach again,

Can we create a UDF out of any user provided expression ? If yes, then we can have a UDF with just IF statement and setup derived column.

We can achieve rewrite even if UDF occurs as a subexpression, for example : if we have a derived column setup on lower(c2), but in our query we have: explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa'; we can still achieve correct rewrite.

presto:perf_test> explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa'; Query Plan ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- - Output[PlanNodeId 6][c1, c2] => [c1:bigint, c2:varchar] Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?} - RemoteStreamingExchange[PlanNodeId 219][GATHER - COLUMNAR] => [c1:bigint, c2:varchar] Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?} - ScanFilter[PlanNodeId 0,252][table = TableHandle {connectorId='iceberg', connectorHandle='test_table2$data@7072973623560046354', layout='Optional[test_table2$data@7072973623560046354]'}, filterPredicate = (concat(VARCHAR'A', c2_derived)) = (VARCHAR'Aa')] => [c2:varchar, c2_derived:varchar, c1:bigint] Estimates: {source: CostBasedSourceInfo, rows: 3 (357B), cpu: 357.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: 0.00} c2 := 2:c2:varchar (1:28) c2_derived := 4:c2_derived:varchar c1 := 1:c1:bigint (1:28) (1 row) Query 20260511_125853_00003_yt5q2, FINISHED, 1 node http://127.0.0.1:8080/ui/query.html?20260511_125853_00003_yt5q2 Splits: 1 total, 1 done (100.00%) [Latency: client-side: 75ms, server-side: 65ms] [0 rows, 0B] [0 rows/s, 0B/s]

Also in string matching, a user's query may not benefit - even if there are differences in spaces while writing the expression. While it's fair to say that is user error, the udf approach currently proposed does not suffer from it.

The advantages of expression and string matching are following :

We are consistent with rdbms !

It's easy to implement

If consistency with RDBMS is strict requirement, then we should definitely support expressions. A full fledged expression matching will be much more elaborate effort to implement and risk of missing a canonical form is still higher than UDF approach, and if we go with string matching we lose out on many possible matches due to just differences in naming/canonical form/spaces.

@ScrapCodes I don't mean we should swap in full text strings. What we can do is treat derived column expressions as views, except scoped at the expression level, whereas views are scoped at the query level. We store the raw text in the connector, same as we do for views. During analysis, we parse the text and this feeds into the planner, which then will use this to create projections where appropriate when the plan is created. It won't be full text matched, it will be matched either at the AST level, or at the plan level (my preference is plan level).

The most compelling reason to go this route is that it's simply how it's defined in the SQL standard, and we typically defer to the standard when it weighs in on something.

<select sublist> ::= <derived column> | <qualified asterisk> <derived column> ::= <value expression> [ <as clause> ] <as clause> ::= [ AS ] <column name>

My preference is also matching via generating Plan, however it will require us to write a query with SELECT and FILTER/PROJECT and then convert it into plan node. (running as the current user's permissions). Because, we cannot simply run/analyze/plan a headless expression.

Once the Plan is generated we can extract the plan node for expression and try to do matching.

Also, we will need to account additional time required to compute plan which will contribute to the overall query running time.

Agreed, but the planning time shouldn't be an issue.

tdcmeehan · 2026-05-08T12:33:02Z

+
+### Other Computing Engines
+
+#### Spark and Trino


Many relational databases support this feature. We should refer to them. For example from MariaDB: https://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns

tdcmeehan · 2026-05-08T12:37:32Z

+### Table properties 
+
+Following table properties are added.
+
+1. `derived-columns` : List of columns that are derived columns. 
+2. `derived-columns.spec.udf.json`: Full JSON specification for each derived column :
+


We do not want the user to manually specify a JSON blob to use the derived expression. Here is how it should work, very high level:

We add parser support for GENERATED ALWAYS, see example here: http://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns#syntax

Parse and analyze the expression

During createtable, we store metadata (raw SQL expression) for the derived column to be persisted by the connector

The connector returns the metadata when it retrieves the table definition

This is analyzed by the query which queries the table, and there is an optimizer rule to leverage it where appropriate (very simple, exact match of expression)

@tdcmeehan , we will still need a mapping between "user's expression" -> "derived_column_name". Plain strings cannot hold this info for a table with more than one derived column.

Yes--the Iceberg connector can create and manage this mapping. It need not be user-managed.

prestodb-ci added the from:IBM PRs from IBM label Apr 24, 2026

prestodb-ci requested review from a team, BryanCutler and infvg and removed request for a team April 24, 2026 07:35

ScrapCodes marked this pull request as draft April 24, 2026 07:35

ScrapCodes removed request for BryanCutler and infvg April 24, 2026 07:35

ScrapCodes force-pushed the derived-column-spec branch 4 times, most recently from 474cf06 to 221e8c0 Compare April 24, 2026 11:37

ScrapCodes force-pushed the derived-column-spec branch from 221e8c0 to 8690c4e Compare May 4, 2026 09:02

ScrapCodes changed the title ~~[WIP] RFC-0025 Derived column~~ RFC-0025 Derived column May 4, 2026

ScrapCodes marked this pull request as ready for review May 4, 2026 09:58

prestodb-ci requested review from a team, infvg and wanglinsong and removed request for a team May 4, 2026 09:58

ScrapCodes force-pushed the derived-column-spec branch 3 times, most recently from 3bdd1ef to 2448a60 Compare May 4, 2026 12:23

aditi-pandit reviewed May 4, 2026

View reviewed changes

jja725 self-requested a review May 5, 2026 18:25

ScrapCodes force-pushed the derived-column-spec branch from 2448a60 to 8c6c4a4 Compare May 6, 2026 16:39

jja725 reviewed May 7, 2026

View reviewed changes

RFC-0025 Derived column

03935be

ScrapCodes force-pushed the derived-column-spec branch from 8c6c4a4 to 03935be Compare May 7, 2026 16:11

tdcmeehan reviewed May 7, 2026

View reviewed changes

tdcmeehan reviewed May 8, 2026

View reviewed changes

Conversation

ScrapCodes commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jja725 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes commented May 7, 2026

Uh oh!

ScrapCodes commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ScrapCodes commented Apr 24, 2026 •

edited

Loading

ScrapCodes May 5, 2026 •

edited

Loading

ScrapCodes May 7, 2026 •

edited

Loading