Skip to content

perf(join): build hash on the smaller side for radix inner joins#251

Merged
singaraiona merged 12 commits into
masterfrom
perf/join-build-side
Jun 13, 2026
Merged

perf(join): build hash on the smaller side for radix inner joins#251
singaraiona merged 12 commits into
masterfrom
perf/join-build-side

Conversation

@singaraiona

Copy link
Copy Markdown
Collaborator

Summary

Build the radix-parallel inner-join hash table on the smaller input (chosen by actual materialized row counts at exec time) instead of always on the right. Implemented as a build/probe role indirection inside the radix block; the radix worker functions are unchanged, output (probe_idx, build_idx) pairs are remapped to logical (left_idx, right_idx) at consolidation. INNER joins only — LEFT/FULL/ANTI and the small-input chained path always build on the right (the chained path is byte-untouched; its left-index output order is a tested contract).

Optimizer roadmap sub-project 4, piece 1 of 4 (the "join optimization" slice).

Perf (gate verdict: KEEP)

The win tracks large-side key duplication, not size — radix partitioning already makes per-partition builds cache-resident, so the benefit is avoiding an O(n×dup) build on a heavily-duplicated large side:

  • ~10 rows/key (near-unique / PK joins): swap loses ~4% best-case
  • ~100 rows/key:
  • ~10,000 rows/key (fact×dimension): ~8× (9.8s → 1.2s)

No regression on near-equal sizes (swap correctly doesn't fire). Full data + the duplication-scaling analysis in `bench/bottleneck/join_buildside_compare.md`.

Test Plan

  • Differential C tests (`test_join_buildside.c`): swap-enabled vs knob-forced-no-swap, multiset equality (radix order isn't a contract) — baseline, swap-fires, m:n, no-match, all-match, null keys, multi-key, two no-swap controls
  • rfl end-to-end large inner join (order-insensitive)
  • Existing chained-path order-pinned joins.rfl stays byte-green
  • Full suite green under ASan+UBSan (3442/3444, 2 pre-existing skips)
  • Release build warning-free; bench binary sanitizer-free

Add ray_join_no_build_swap (bool knob) and ray_join_build_swaps (uint64_t
counter) to src/ops/join.c, with extern declarations in src/ops/internal.h
next to the existing ray_opt_no_group_pushdown pair.

Add test/test_join_buildside.c:
- jb_table1/jb_inner_join helpers that exercise the two-table C-API join
  graph shape (ray_const_table for each side, ray_scan per key, ray_join,
  ray_optimize, ray_execute)
- jb_results_equal multiset comparator (sort-permutation + cell-by-cell)
- test_jb_baseline_radix_inner: right=70536 rows (> RAY_PARALLEL_THRESHOLD),
  runs join twice with knob on/off; today both paths are identical so the
  diff passes trivially — pins the harness for Task 2

Register join_buildside_entries in test/main.c.  No behavior change.
Suite: 3433/3435 passed (2 pre-existing skips, 0 failed).
…eaks

Replace file-scope jb_cmp_cols/jb_cmp_ncols + jb_row_cmp with a
self-contained iterative bottom-up merge sort (jb_sort_rows) that
threads cols/ncols as parameters — no globals, O(n log n).

Use RAY_PARALLEL_THRESHOLD (visible via ops/ops.h) instead of bare
literal in the fixture.  All allocations in jb_results_equal freed on
every exit path via a single goto cleanup label.
…age unchanged

all_match: left 50→1 (right=T+2000 unchanged; HT-grow + swap still fire)
near_equal_no_swap: n T+5000→T+1000, key mod 1000→10000 (both sides equal
  and >threshold; swap still suppressed; output rows 4.4M→443k)
many_to_many: left 2000→500 (right=T+5000 unchanged; m:n fanout + swap fire)

join_buildside group wall time: 43.9s → 13.4s; suite 3441/3443 (2 skipped, 0 failed).
Add test/rfl/ops/join_buildside.rfl: 2000-row left × 70000-row right
inner-join on modular keys (% 1000), producing 140000 output rows and
a payload sum of 4899930000. Both assertions are order-insensitive
(count and sum) because the radix path does not guarantee row order.

The right-table size (>65536) triggers the radix path; left<right satisfies
the swap condition, so the build hash is on the left side. Existing
rfl/integration/joins and rfl/ops/join_branch_cov (chained path) are
unchanged and green.
bench/join_buildside/main.c: 3-case perf gate (WIN/CONTROL/MANY-TO-MANY),
9 reps interleaved swap vs legacy, CLOCK_MONOTONIC around ray_execute,
mechanism assert on ray_join_build_swaps counter.  Sanitizer-free
(nm | grep -ci asan = 0).  Makefile target bench-join-buildside mirrors
bench-idx-route flags.  bench-join-buildside added to .gitignore.
Raw results in bench/bottleneck/join_buildside_compare.md.
…on-scaling probe

15-rep re-run at load 1.70 (vs 4.8–5.9 in round 1).  Adds HEAVY-DUP-WIN case
(right=10M key i%1000 → 10000 dup/key, left=10K key i%1000) alongside existing WIN.
Reports median+min per case.  Appends ROUND 2 section to join_buildside_compare.md.
Update joins.md and architecture/pipeline.md to reflect that the
radix-parallel inner-join path selects the build side at runtime using
actual materialized row counts (smaller input = build side).

Note LEFT/FULL/ANTI and the small-input (chained) path always build on
the right. Note that inner-join output order on the radix path is
partition- and thread-dependent and not guaranteed stable. Note that
build-side selection helps most when the larger side has many rows per
key and is neutral-to-small on near-unique keys.

No stale "always build right" hard-claim was found in either doc; the
prior text named no build side at all, so this is purely additive.
@singaraiona singaraiona merged commit 1f16b92 into master Jun 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant