RFC-0025 Derived column#61
Conversation
474cf06 to
221e8c0
Compare
221e8c0 to
8690c4e
Compare
3bdd1ef to
2448a60
Compare
| "udfSpecList": [ | ||
| { | ||
| "functionName": "_UDF_1", | ||
| "params": [ |
There was a problem hiding this comment.
Why is params separated from arguments ? Are you using any special Iceberg or Presto UDF format here ?
There was a problem hiding this comment.
@aditi-pandit , the way we can identify a UDF in presto is:
catalog.schema.function_name and it's params type.
Once UDF is found, next step in checking it for replacement is, it's arguments should match too. Then we also need the arguments to match with !
Also, once Iceberg UDF spec is implemented, we should be able to uniquely identify and load a UDF by same search i.e. function name (id) and params.
Interestingly, their entire UDF metadata spec does not store a function name :)
2448a60 to
8c6c4a4
Compare
jja725
left a comment
There was a problem hiding this comment.
Agree that how write work would be the main concern here with compatibility with all the engine
|
|
||
| 1. Load the UDF from UDF spec provided in table properties. | ||
| 2. Scan the filter predicate and look for the UDF | ||
| 3. If the UDF defined in the spec is found, rewrite the filter predicate by replacing the UDF call with derived column. |
There was a problem hiding this comment.
feel like we can also rewrite the projection
There was a problem hiding this comment.
Yes, we can rewrite projections too. However, the performance gain is negligible in most cases. There may be some cases, where this would be helpful e.g. UDF itself does a very compute intensive operation.
Will try to cover this, thanks for suggesting!
|
|
||
| Alter table syntax | ||
| ```sql | ||
| ALTER TABLE table_name {ADD | ALTER} COLUMN column_name data_type [DERIVED {[udf("column_name1")] | [SQL EXPRESSION]}] |
8c6c4a4 to
03935be
Compare
|
@tdcmeehan has volunteered to be a co-author ! Yay! |
|
Here is a PR https://github.com/prestodb/presto/pull/27746/changes with Proof of concept ! |
| A compute engine like Presto can easily push down a filter predicate e.g. `SELECT col1, col2, FROM table T1 WHERE col1='constant_value'` , this allows for pruning the number of | ||
| rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an | ||
| example `SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'`. While optimizers can easily push down the filter predicate, however, it can not be used | ||
| in filtering at TableScan, simply because the underlying system is unaware of the UDF lower. As a result, we end up scanning a large number of rows. |
There was a problem hiding this comment.
More accurately, while we will push down the lower into the table scan and filter at the row level, as we do any other expression, what we can't do is use this expression with range (min/max) index values. Some examples include Iceberg manifest statistics and Parquet rowgroup statistics.
| ``` | ||
|
|
||
| 1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but | ||
| table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds. |
There was a problem hiding this comment.
The UDF can be evaluated, what can't be evaluated is this expression over lower and upper bounds, since those statistics are calculated on the base values.
|
|
||
| 1. `lower('N')` is constant folded and pushed down along with the predicate `filterPredicate = (lower(store_and_fwd_flag)) <> (VARCHAR'n')` to the Table scan node, but | ||
| table scan cannot perform this filtering as the underlying connector does not know about the UDF lower or simply does not have the ability to evaluate it on lower and upper bounds. | ||
| In a jdbc based connector the underlying connector may be able to evaluate the udf if there exist a udf with same name in the underlying database system. Since we do not know, |
There was a problem hiding this comment.
I believe this is orthogonal to JDBC connector function pushdown.
| -> "format-version" = '3', | ||
| -> location = 's3://test-bucket/perf_test/test_table9', | ||
| -> "derived-columns" = Array['c2_derived'], | ||
| -> "derived-columns.spec.udf.json" = JSON '{ |
There was a problem hiding this comment.
I think the granularity is wrong here. Right now we're scoping this to UDFs. UDFs are functions, and what's implicit is that this should be a function call. A function call is just one type of expression. I think what should be stored here instead is an expression, because that would capture the use cases better. For example, the derived column might be an IF statement, which is not a UDF.
There was a problem hiding this comment.
Great point !
I have interpretted this in following two possible ways
- We store expression as string and do a string match for derived column replacement !
- We store expression as expression tree of some sort and compute a canonical form of user queries expressions and then try to match them.
While, I think you are alluding to first point more than second, the result of both is same. User's expressions get swapped when they are "exactly same". In later approach, the chance of semantically checking and even if user's expression occurs as sub expression - achieving a replacement is possible.
Let's review the UDF approach again,
- Can we create a UDF out of any user provided expression ? If yes, then we can have a UDF with just IF statement and setup derived column.
- We can achieve rewrite even if UDF occurs as a subexpression, for example : if we have a derived column setup on
lower(c2), but in our query we have:explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa';we can still achieve correct rewrite.
presto:perf_test> explain SELECT c1, c2 FROM test_table2 WHERE (concat('A', lower(c2)) ) = 'Aa';
Query Plan
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Output[PlanNodeId 6][c1, c2] => [c1:bigint, c2:varchar]
Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?}
- RemoteStreamingExchange[PlanNodeId 219][GATHER - COLUMNAR] => [c1:bigint, c2:varchar]
Estimates: {source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: ?}
- ScanFilter[PlanNodeId 0,252][table = TableHandle {connectorId='iceberg', connectorHandle='test_table2$data@7072973623560046354', layout='Optional[test_table2$data@7072973623560046354]'}, filterPredicate = (concat(VARCHAR'A', c2_derived)) = (VARCHAR'Aa')] => [c2:varchar, c2_derived:varchar, c1:bigint]
Estimates: {source: CostBasedSourceInfo, rows: 3 (357B), cpu: 357.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: ? (?), cpu: 714.00, memory: 0.00, network: 0.00}
c2 := 2:c2:varchar (1:28)
c2_derived := 4:c2_derived:varchar
c1 := 1:c1:bigint (1:28)
(1 row)
Query 20260511_125853_00003_yt5q2, FINISHED, 1 node
http://127.0.0.1:8080/ui/query.html?20260511_125853_00003_yt5q2
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 75ms, server-side: 65ms] [0 rows, 0B] [0 rows/s, 0B/s]
Also in string matching, a user's query may not benefit - even if there are differences in spaces while writing the expression. While it's fair to say that is user error, the udf approach currently proposed does not suffer from it.
The advantages of expression and string matching are following :
- We are consistent with rdbms !
- It's easy to implement
If consistency with RDBMS is strict requirement, then we should definitely support expressions. A full fledged expression matching will be much more elaborate effort to implement and risk of missing a canonical form is still higher than UDF approach, and if we go with string matching we lose out on many possible matches due to just differences in naming/canonical form/spaces.
There was a problem hiding this comment.
@ScrapCodes I don't mean we should swap in full text strings. What we can do is treat derived column expressions as views, except scoped at the expression level, whereas views are scoped at the query level. We store the raw text in the connector, same as we do for views. During analysis, we parse the text and this feeds into the planner, which then will use this to create projections where appropriate when the plan is created. It won't be full text matched, it will be matched either at the AST level, or at the plan level (my preference is plan level).
The most compelling reason to go this route is that it's simply how it's defined in the SQL standard, and we typically defer to the standard when it weighs in on something.
<select sublist> ::= <derived column> | <qualified asterisk>
<derived column> ::= <value expression> [ <as clause> ]
<as clause> ::= [ AS ] <column name>
There was a problem hiding this comment.
My preference is also matching via generating Plan, however it will require us to write a query with SELECT and FILTER/PROJECT and then convert it into plan node. (running as the current user's permissions). Because, we cannot simply run/analyze/plan a headless expression.
Once the Plan is generated we can extract the plan node for expression and try to do matching.
Also, we will need to account additional time required to compute plan which will contribute to the overall query running time.
There was a problem hiding this comment.
Agreed, but the planning time shouldn't be an issue.
|
|
||
| ### Other Computing Engines | ||
|
|
||
| #### Spark and Trino |
There was a problem hiding this comment.
Many relational databases support this feature. We should refer to them. For example from MariaDB: https://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns
| ### Table properties | ||
|
|
||
| Following table properties are added. | ||
|
|
||
| 1. `derived-columns` : List of columns that are derived columns. | ||
| 2. `derived-columns.spec.udf.json`: Full JSON specification for each derived column : | ||
|
|
There was a problem hiding this comment.
We do not want the user to manually specify a JSON blob to use the derived expression. Here is how it should work, very high level:
-
We add parser support for GENERATED ALWAYS, see example here: http://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns#syntax
-
Parse and analyze the expression
-
During createtable, we store metadata (raw SQL expression) for the derived column to be persisted by the connector
-
The connector returns the metadata when it retrieves the table definition
-
This is analyzed by the query which queries the table, and there is an optimizer rule to leverage it where appropriate (very simple, exact match of expression)
There was a problem hiding this comment.
@tdcmeehan , we will still need a mapping between "user's expression" -> "derived_column_name". Plain strings cannot hold this info for a table with more than one derived column.
There was a problem hiding this comment.
Yes--the Iceberg connector can create and manage this mapping. It need not be user-managed.
Currently WIP