Replace variant sed blocks with override files for CPU/GPU Presto configs#334
Replace variant sed blocks with override files for CPU/GPU Presto configs#334misiugodfrey wants to merge 2 commits into
Conversation
| # file compression/format isn't splittable to avoid read failures. TPCH Parquet | ||
| # test data commonly uses SNAPPY compression that isn't splittable at the file | ||
| # level here, hence this must be false. | ||
| hive.file-splittable=false |
There was a problem hiding this comment.
We want this option on for presto-cpu, but off for gpu - moved to overrides.
|
|
||
| # Optimizer flags | ||
| # New option in 0.282 to control inference of NOT NULL columns in joins (undocumented). | ||
| #optimizer.joins-not-null-inference-strategy=USE_FUNCTION_METADATA |
There was a problem hiding this comment.
These options had previously been disabled, but appear to help in cpu-mode. Moved to overrides.
|
|
||
| # Parquet read options | ||
| # Limit (in bytes) on total number of bytes to be returned per read, or 0 if there is no limit | ||
| parquet.reader.chunk-read-limit=0 |
There was a problem hiding this comment.
The parquet.reader* options are cudf-exclusive. Moved to overrides.
| # overwritten to "false" in multi-node settings. | ||
| single-node-execution-enabled=true | ||
|
|
||
| # Enable cuDF (CPU mode will ignore this setting) |
There was a problem hiding this comment.
All the cudf.* options are ignored in by cpu-workers. Moved to overrides.
| hive.split-loader-concurrency=32 | ||
| hive.pushdown-filter-enabled=true | ||
|
|
||
| hive.parquet.pushdown-filter-enabled=true |
There was a problem hiding this comment.
The hive.parquet* options are only used by the coordinator. Right now our etc_common/catalog/hive paths get overwritten instead of appended to, so the options above this are duplicated in both the coordinator and worker paths. Not sure if we want to change that - but I think that's outside the scope of this PR.
| # cuDF has no effect in CPU mode; disable to suppress startup warnings. | ||
| cudf.enabled=false | ||
|
|
||
| exchange.http-client.enable-connection-pool=true |
There was a problem hiding this comment.
These new options were all found to help in benchmarking (and are recommended by IBM). They were not found to have much impact until we boosted the max-buffer-size on a busy cluster.
| if [[ ${NUM_WORKERS} -gt 1 ]]; then | ||
| sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config} | ||
| sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${worker_native_config} | ||
| sed -i "s+single-node-execution-enabled.*+single-node-execution-enabled=false+g" ${coord_native_config} ${worker_native_config} |
There was a problem hiding this comment.
Use the same sed multi-file syntax as is used below.
| sed -i "s|hive.metastore.catalog.dir=.*|hive.metastore.uri=${HIVE_METASTORE_URI}|" "${CONFIG_DIR}/etc_coordinator/catalog/hive.properties" "${CONFIG_DIR}/etc_worker/catalog/hive.properties" | ||
| fi | ||
|
|
||
| COORD_CONFIG="${CONFIG_DIR}/etc_coordinator/config_native.properties" |
There was a problem hiding this comment.
Removed these sed commands in favour of the file overrides.
There was a problem hiding this comment.
It looks like we are picking from sed commands or file overrides to determine what config gets used. Reading this, its a bit hard to follow where one should look to see a definitive config given a mode.
Have we considered storing the configuration we want in a python dictionary then writing it to a config file? A nice property of this is that json files naturally map to python dictionaries and vice-versa, so it gives a concise way to store these representations on disk as well.
|
A quick note: Because the |
| if [[ -f "${dest_file}" ]]; then | ||
| cat "${src_file}" >> "${dest_file}" | ||
| fi | ||
| done < <(find "${OVERRIDES_DIR}" -type f -print0) |
There was a problem hiding this comment.
I'm open to an alternate syntax if we want to change it.
There was a problem hiding this comment.
Maybe at least do the find in advance into a variable with an explanatory name so that it's not so complex a one-liner
Summary
sed/echopost-processing ingenerate_presto_config.shwith a structured override file system: variant-specific files underdocker/config/template/overrides/{cpu,gpu}/are appended to the pbench-generated configs after renderingcudf.enabled=false) and GPU-specific settings (cudf.*,parquet.reader.*,hive.file-splittable=false, optimizer flags) out of the shared templates and into their respective override directoriesTest plan
generate_presto_config.shwithVARIANT_TYPE=gpu— verify generated config files match those from previous benchmark runs.VARIANT_TYPE=cpu— verify generated config files match those for in updated POC benchmarksVARIANT_TYPE=java— verify no regression (java has no override directory, so only the base template is used)