Add distinct max_len and max_tokens parameters#7
Conversation
- Add --max_len (default 8192) for truncating input text (instructions, completions) - Add --max_tokens (default 32768) for limiting model generation output - Separate these concepts which were previously conflated - Update defaults consistently across generate.py and evaluate.py - Fix bug: --result_folder CLI arg was parsed but not passed to CliArgs
geoalgo
left a comment
There was a problem hiding this comment.
Awesome, thanks for catching and fixing this. I have only one comment about the naming.
openjury/generate_and_evaluate.py
Outdated
| " `[result_folder]/[evaluation_name]`.", | ||
| ) | ||
| parser.add_argument( | ||
| "--max_len", |
There was a problem hiding this comment.
max_len and max_tokens do not convey what the parameters do, could you replace by a better name?
Perhaps max_token_completion and max_token_judge would be better?
There was a problem hiding this comment.
Thanks for the feedback @geoalgo. I've been thinking about this more and realized that max_token_completion and max_token_judge don't fully capture what's happening, since truncation occurs at the character level, not based on tokens.
Here's what I would suggest:
--max_out_tokens_models: max tokens models A/B can generate
--max_out_tokens_judge: max tokens the judge can generate
--truncate_all_input_chars: max chars to truncate all input text (instructions before A/B, completions before judge)
I considered splitting the last one into separate params (--max_in_chars_models for instructions and --max_in_chars_judge for completions), but I couldn't think of a practical use case where you'd want different truncation limits for each. The common scenarios are "both short" (save costs) or "both long" (thorough eval) I would say.
Let me know if this naming works for you, or if you'd prefer something different.
- Rename → (truncates instructions before A/B, completions before judge) - Split into: - (output limit for models A/B) - (output limit for judge) - Fix bug in generate_base where max_len was used instead of max_tokens - Update function signatures in generate.py and evaluate.py
Overview
as discussed with @geoalgo today, this PR adds CLI arguments for controlling input truncation and model generation limits:
--max_len(default 8192): Maximum character length for truncating input text (instructions, completions) before sending to models. Prevents exceeding context limits (was previously hard-set to 200), effectively leading to judges that would notice cut-off completions and therefore base their decisions on that.--max_tokens(default 32768): Maximum number of tokens all models (A, B, and judge) can generate in their responses (was previously hard-coded to 32k)fixed minor bug:
--results_folderparsed but never passed to the CliArgs dataclassThe first two parameters were previously hard-coded with inconsistent values (200 and 4096) and conflated together. This PR separates them as distinct concepts.
Changes
generate_and_evaluate.py: Added--max_lenand--max_tokensCLI arguments, fixed--result_foldernot being passedgenerate.py: Separatedmax_len(truncation) frommax_tokens(generation) ingenerate_instructions()andgenerate_base()evaluate.py: Updated defaultmax_lento 8192Usage
python -m openjury.generate_and_evaluate
--dataset alpaca-eval
--model_A ...
--model_B ...
--judge_model ...
--max_len 16384
--max_tokens 8192