perf(join): build hash on the smaller side for radix inner joins#251
Merged
Conversation
Add ray_join_no_build_swap (bool knob) and ray_join_build_swaps (uint64_t counter) to src/ops/join.c, with extern declarations in src/ops/internal.h next to the existing ray_opt_no_group_pushdown pair. Add test/test_join_buildside.c: - jb_table1/jb_inner_join helpers that exercise the two-table C-API join graph shape (ray_const_table for each side, ray_scan per key, ray_join, ray_optimize, ray_execute) - jb_results_equal multiset comparator (sort-permutation + cell-by-cell) - test_jb_baseline_radix_inner: right=70536 rows (> RAY_PARALLEL_THRESHOLD), runs join twice with knob on/off; today both paths are identical so the diff passes trivially — pins the harness for Task 2 Register join_buildside_entries in test/main.c. No behavior change. Suite: 3433/3435 passed (2 pre-existing skips, 0 failed).
…eaks Replace file-scope jb_cmp_cols/jb_cmp_ncols + jb_row_cmp with a self-contained iterative bottom-up merge sort (jb_sort_rows) that threads cols/ncols as parameters — no globals, O(n log n). Use RAY_PARALLEL_THRESHOLD (visible via ops/ops.h) instead of bare literal in the fixture. All allocations in jb_results_equal freed on every exit path via a single goto cleanup label.
…ar-equal, no-match
…age unchanged all_match: left 50→1 (right=T+2000 unchanged; HT-grow + swap still fire) near_equal_no_swap: n T+5000→T+1000, key mod 1000→10000 (both sides equal and >threshold; swap still suppressed; output rows 4.4M→443k) many_to_many: left 2000→500 (right=T+5000 unchanged; m:n fanout + swap fire) join_buildside group wall time: 43.9s → 13.4s; suite 3441/3443 (2 skipped, 0 failed).
Add test/rfl/ops/join_buildside.rfl: 2000-row left × 70000-row right inner-join on modular keys (% 1000), producing 140000 output rows and a payload sum of 4899930000. Both assertions are order-insensitive (count and sum) because the radix path does not guarantee row order. The right-table size (>65536) triggers the radix path; left<right satisfies the swap condition, so the build hash is on the left side. Existing rfl/integration/joins and rfl/ops/join_branch_cov (chained path) are unchanged and green.
bench/join_buildside/main.c: 3-case perf gate (WIN/CONTROL/MANY-TO-MANY), 9 reps interleaved swap vs legacy, CLOCK_MONOTONIC around ray_execute, mechanism assert on ray_join_build_swaps counter. Sanitizer-free (nm | grep -ci asan = 0). Makefile target bench-join-buildside mirrors bench-idx-route flags. bench-join-buildside added to .gitignore. Raw results in bench/bottleneck/join_buildside_compare.md.
…on-scaling probe 15-rep re-run at load 1.70 (vs 4.8–5.9 in round 1). Adds HEAVY-DUP-WIN case (right=10M key i%1000 → 10000 dup/key, left=10K key i%1000) alongside existing WIN. Reports median+min per case. Appends ROUND 2 section to join_buildside_compare.md.
Update joins.md and architecture/pipeline.md to reflect that the radix-parallel inner-join path selects the build side at runtime using actual materialized row counts (smaller input = build side). Note LEFT/FULL/ANTI and the small-input (chained) path always build on the right. Note that inner-join output order on the radix path is partition- and thread-dependent and not guaranteed stable. Note that build-side selection helps most when the larger side has many rows per key and is neutral-to-small on near-unique keys. No stale "always build right" hard-claim was found in either doc; the prior text named no build side at all, so this is purely additive.
… cardinality parity
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Build the radix-parallel inner-join hash table on the smaller input (chosen by actual materialized row counts at exec time) instead of always on the right. Implemented as a build/probe role indirection inside the radix block; the radix worker functions are unchanged, output
(probe_idx, build_idx)pairs are remapped to logical(left_idx, right_idx)at consolidation. INNER joins only — LEFT/FULL/ANTI and the small-input chained path always build on the right (the chained path is byte-untouched; its left-index output order is a tested contract).Optimizer roadmap sub-project 4, piece 1 of 4 (the "join optimization" slice).
Perf (gate verdict: KEEP)
The win tracks large-side key duplication, not size — radix partitioning already makes per-partition builds cache-resident, so the benefit is avoiding an O(n×dup) build on a heavily-duplicated large side:
No regression on near-equal sizes (swap correctly doesn't fire). Full data + the duplication-scaling analysis in `bench/bottleneck/join_buildside_compare.md`.
Test Plan