Incremental GROUP BY _block_num and DISTINCT BY _block_num#1877
Incremental GROUP BY _block_num and DISTINCT BY _block_num#1877
GROUP BY _block_num and DISTINCT BY _block_num#1877Conversation
LNSD
left a comment
There was a problem hiding this comment.
Please, check my comments 🙂
crates/core/common/src/lib.rs
Outdated
| pub use datafusion::{arrow, parquet}; | ||
| pub use datasets_common::{block_num::BlockNum, block_range::BlockRange, end_block::EndBlock}; | ||
|
|
||
| pub mod block_num_udf; |
There was a problem hiding this comment.
Can we rename the module to just block_num? maybe we can move all the UDFs under common::udfs (e.g., common::udfs::evm::* or common::udfs::block_num)
f9ae9a5 to
ed06c86
Compare
There was a problem hiding this comment.
I meant to move common::block_num to common::udfs::block_num (from crates/core/common/src/block_num.rs to crates/core/common/src/udfs/block_num.rs)
|
The CI failure is related to the changes: https://github.com/edgeandnode/amp/actions/runs/22639732283/job/65612193391?pr=1877 |
e73514b to
3948177
Compare
|
@leoyvens, if this is still WIP, could you mark the PR as a draft? |
|
@LNSD yes CI is failing for legit reasons, marked as draft |
Thanks. The issue was that I was flooded with notifications to review the PR, and it wasn't ready. Using the "draft mode" can help better signal when someone wants the PR to be reviewed. |
64f28c7 to
a0ab4ad
Compare
|
@Theodus This now ready for review |
Signed-off-by: Leonardo Yvens <leoyvens@gmail.com>
Allow GROUP BY queries that include _block_num as a group key to work with incremental processing instead of being rejected. - Handle Aggregate in BlockNumPropagator by setting next_block_num_expr - Remove Aggregate from the unsupported-node error arm Signed-off-by: Leonardo Yvens <leoyvens@gmail.com>
Signed-off-by: Leo <leo@edgeandnode.com>
…gation Adds a `block_num()` sentinel UDF that lets users explicitly request the propagated `_block_num` value in projections and DISTINCT ON expressions, particularly in join contexts where the bare `_block_num` column would be ambiguous. Key changes: - Register BlockNumUdf in builtin_udfs() in session_state - BlockNumPropagator now replaces block_num() UDF with the correct greatest(left._block_num, right._block_num) expression from the join - forbid_underscore_prefixed_aliases enhanced to check all node types (not just Projection) and to reject bare _block_num in multi-table projections - incremental_op_kind uses expr_outputs_block_num for Aggregate/Distinct::On first-key checks, accepting post-propagation expressions derived from _block_num
… not input qualifiers
…rojection; update tests
…wildcard hint Signed-off-by: Leo <leo@edgeandnode.com>
a0ab4ad to
bdca7e3
Compare
bdca7e3 to
85f869c
Compare
This implements two related features:
DISTINCT ONandGROUP BY, when the key is_block_numblock_num()to more easily refer to the block number anywhere in the query.Our execution semantics essentially already support the special case where an aggregation is restricted to within a block, thanks to the assumption that data for a same block never spans more than a single microbatch. So executing the aggregation in isolation on each microbatch yields correct results. The necessary changes were around incremental query validation checks and block number propagation.