Replace variant sed blocks with override files for CPU/GPU Presto configs by misiugodfrey · Pull Request #334 · rapidsai/velox-testing

misiugodfrey · 2026-05-05T19:41:39Z

Summary

Replaces ad-hoc sed/echo post-processing in generate_presto_config.sh with a structured override file system: variant-specific files under docker/config/template/overrides/{cpu,gpu}/ are appended to the pbench-generated configs after rendering
Moves CPU-specific settings (exchange tuning, split sizing, pushdown filters, cudf.enabled=false) and GPU-specific settings (cudf.*, parquet.reader.*, hive.file-splittable=false, optimizer flags) out of the shared templates and into their respective override directories
Base templates now contain only the settings common to all variants; adding new per-variant config options no longer requires touching the generator script
Added new options observed to speed up presto-cpu benchmarking on 8-node 1k/3k runs.

Test plan

Run generate_presto_config.sh with VARIANT_TYPE=gpu — verify generated config files match those from previous benchmark runs.
Run with VARIANT_TYPE=cpu — verify generated config files match those for in updated POC benchmarks
Run with VARIANT_TYPE=java — verify no regression (java has no override directory, so only the base template is used)

misiugodfrey · 2026-05-05T19:46:26Z

-# file compression/format isn't splittable to avoid read failures. TPCH Parquet
-# test data commonly uses SNAPPY compression that isn't splittable at the file
-# level here, hence this must be false.
-hive.file-splittable=false


We want this option on for presto-cpu, but off for gpu - moved to overrides.

misiugodfrey · 2026-05-05T19:47:04Z


 # Optimizer flags
-# New option in 0.282 to control inference of NOT NULL columns in joins (undocumented).
-#optimizer.joins-not-null-inference-strategy=USE_FUNCTION_METADATA


These options had previously been disabled, but appear to help in cpu-mode. Moved to overrides.

misiugodfrey · 2026-05-05T19:47:48Z

-
-# Parquet read options
-# Limit (in bytes) on total number of bytes to be returned per read, or 0 if there is no limit
-parquet.reader.chunk-read-limit=0


The parquet.reader* options are cudf-exclusive. Moved to overrides.

misiugodfrey · 2026-05-05T19:48:20Z

 # overwritten to "false" in multi-node settings.
 single-node-execution-enabled=true

-# Enable cuDF (CPU mode will ignore this setting)


All the cudf.* options are ignored in by cpu-workers. Moved to overrides.

misiugodfrey · 2026-05-05T19:49:41Z

+hive.split-loader-concurrency=32
+hive.pushdown-filter-enabled=true
+
+hive.parquet.pushdown-filter-enabled=true


The hive.parquet* options are only used by the coordinator. Right now our etc_common/catalog/hive paths get overwritten instead of appended to, so the options above this are duplicated in both the coordinator and worker paths. Not sure if we want to change that - but I think that's outside the scope of this PR.

misiugodfrey · 2026-05-05T19:51:17Z

+# cuDF has no effect in CPU mode; disable to suppress startup warnings.
+cudf.enabled=false
+
+exchange.http-client.enable-connection-pool=true


These new options were all found to help in benchmarking (and are recommended by IBM). They were not found to have much impact until we boosted the max-buffer-size on a busy cluster.

misiugodfrey · 2026-05-05T19:52:51Z

  if [[ ${NUM_WORKERS} -gt 1 ]]; then
-    sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config}
-    sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${worker_native_config}
+    sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config} ${worker_native_config}


Use the same sed multi-file syntax as is used below.

misiugodfrey · 2026-05-05T19:53:51Z

    sed -i "s|hive.metastore.catalog.dir=.*|hive.metastore.uri=${HIVE_METASTORE_URI}|" "${CONFIG_DIR}/etc_coordinator/catalog/hive.properties" "${CONFIG_DIR}/etc_worker/catalog/hive.properties"
  fi

-  COORD_CONFIG="${CONFIG_DIR}/etc_coordinator/config_native.properties"


Removed these sed commands in favour of the file overrides.

It looks like we are picking from sed commands or file overrides to determine what config gets used. Reading this, its a bit hard to follow where one should look to see a definitive config given a mode.

Have we considered storing the configuration we want in a python dictionary then writing it to a config file? A nice property of this is that json files naturally map to python dictionaries and vice-versa, so it gives a concise way to store these representations on disk as well.

simoneves

LGTM

misiugodfrey · 2026-05-05T19:55:31Z

A quick note: Because the overrides directory is in templates it gets parsed and duplicated by pbench - creating a duplicate folder in generated. We could move the folder to prevent this... but I think it may be useful in the future in case we want to use any pbench variables in the override files. I'm fine with going either way on this.

simoneves · 2026-05-05T19:57:42Z

+      if [[ -f "${dest_file}" ]]; then
+        cat "${src_file}" >> "${dest_file}"
+      fi
+    done < <(find "${OVERRIDES_DIR}" -type f -print0)


Ew. But OK.

I'm open to an alternate syntax if we want to change it.

Maybe at least do the find in advance into a variable with an explanatory name so that it's not so complex a one-liner

Add overrides for cpu vs gpu configurations

10bb9e7

misiugodfrey requested a review from a team as a code owner May 5, 2026 19:41

misiugodfrey requested review from kjmph and paul-aiyedun May 5, 2026 19:44

pre-commit

eccb1ac

misiugodfrey commented May 5, 2026

View reviewed changes

simoneves approved these changes May 5, 2026

View reviewed changes

simoneves reviewed May 5, 2026

View reviewed changes

Conversation

misiugodfrey commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simoneves left a comment

Choose a reason for hiding this comment

Uh oh!

misiugodfrey commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

misiugodfrey commented May 5, 2026 •

edited

Loading