Skip to content

Replace variant sed blocks with override files for CPU/GPU Presto configs#334

Open
misiugodfrey wants to merge 2 commits into
mainfrom
misiug/PrestoCpuConfigs
Open

Replace variant sed blocks with override files for CPU/GPU Presto configs#334
misiugodfrey wants to merge 2 commits into
mainfrom
misiug/PrestoCpuConfigs

Conversation

@misiugodfrey
Copy link
Copy Markdown
Contributor

@misiugodfrey misiugodfrey commented May 5, 2026

Summary

  • Replaces ad-hoc sed/echo post-processing in generate_presto_config.sh with a structured override file system: variant-specific files under docker/config/template/overrides/{cpu,gpu}/ are appended to the pbench-generated configs after rendering
  • Moves CPU-specific settings (exchange tuning, split sizing, pushdown filters, cudf.enabled=false) and GPU-specific settings (cudf.*, parquet.reader.*, hive.file-splittable=false, optimizer flags) out of the shared templates and into their respective override directories
  • Base templates now contain only the settings common to all variants; adding new per-variant config options no longer requires touching the generator script
  • Added new options observed to speed up presto-cpu benchmarking on 8-node 1k/3k runs.

Test plan

  • Run generate_presto_config.sh with VARIANT_TYPE=gpu — verify generated config files match those from previous benchmark runs.
  • Run with VARIANT_TYPE=cpu — verify generated config files match those for in updated POC benchmarks
  • Run with VARIANT_TYPE=java — verify no regression (java has no override directory, so only the base template is used)

@misiugodfrey misiugodfrey requested a review from a team as a code owner May 5, 2026 19:41
@misiugodfrey misiugodfrey requested review from kjmph and paul-aiyedun May 5, 2026 19:44
# file compression/format isn't splittable to avoid read failures. TPCH Parquet
# test data commonly uses SNAPPY compression that isn't splittable at the file
# level here, hence this must be false.
hive.file-splittable=false
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this option on for presto-cpu, but off for gpu - moved to overrides.


# Optimizer flags
# New option in 0.282 to control inference of NOT NULL columns in joins (undocumented).
#optimizer.joins-not-null-inference-strategy=USE_FUNCTION_METADATA
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options had previously been disabled, but appear to help in cpu-mode. Moved to overrides.


# Parquet read options
# Limit (in bytes) on total number of bytes to be returned per read, or 0 if there is no limit
parquet.reader.chunk-read-limit=0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet.reader* options are cudf-exclusive. Moved to overrides.

# overwritten to "false" in multi-node settings.
single-node-execution-enabled=true

# Enable cuDF (CPU mode will ignore this setting)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the cudf.* options are ignored in by cpu-workers. Moved to overrides.

hive.split-loader-concurrency=32
hive.pushdown-filter-enabled=true

hive.parquet.pushdown-filter-enabled=true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hive.parquet* options are only used by the coordinator. Right now our etc_common/catalog/hive paths get overwritten instead of appended to, so the options above this are duplicated in both the coordinator and worker paths. Not sure if we want to change that - but I think that's outside the scope of this PR.

# cuDF has no effect in CPU mode; disable to suppress startup warnings.
cudf.enabled=false

exchange.http-client.enable-connection-pool=true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new options were all found to help in benchmarking (and are recommended by IBM). They were not found to have much impact until we boosted the max-buffer-size on a busy cluster.

if [[ ${NUM_WORKERS} -gt 1 ]]; then
sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config}
sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${worker_native_config}
sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config} ${worker_native_config}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same sed multi-file syntax as is used below.

sed -i "s|hive.metastore.catalog.dir=.*|hive.metastore.uri=${HIVE_METASTORE_URI}|" "${CONFIG_DIR}/etc_coordinator/catalog/hive.properties" "${CONFIG_DIR}/etc_worker/catalog/hive.properties"
fi

COORD_CONFIG="${CONFIG_DIR}/etc_coordinator/config_native.properties"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these sed commands in favour of the file overrides.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are picking from sed commands or file overrides to determine what config gets used. Reading this, its a bit hard to follow where one should look to see a definitive config given a mode.

Have we considered storing the configuration we want in a python dictionary then writing it to a config file? A nice property of this is that json files naturally map to python dictionaries and vice-versa, so it gives a concise way to store these representations on disk as well.

Copy link
Copy Markdown
Contributor

@simoneves simoneves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@misiugodfrey
Copy link
Copy Markdown
Contributor Author

A quick note: Because the overrides directory is in templates it gets parsed and duplicated by pbench - creating a duplicate folder in generated. We could move the folder to prevent this... but I think it may be useful in the future in case we want to use any pbench variables in the override files. I'm fine with going either way on this.

if [[ -f "${dest_file}" ]]; then
cat "${src_file}" >> "${dest_file}"
fi
done < <(find "${OVERRIDES_DIR}" -type f -print0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ew. But OK.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to an alternate syntax if we want to change it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at least do the find in advance into a variable with an explanatory name so that it's not so complex a one-liner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants