ds4-eval (fix): q13 provides wrong answer#233
Merged
Conversation
the answer was outside of the claimed energy precision. the evaluation after the fix (with smooth distribution over the tokens) ``` $ ./ds4-eval --temp 3.0 --min-p 0.25 --nothink ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: context buffers 479.38 MiB (ctx=16777, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=4196) ds4-eval: 17/92 passed, 1 failed, runtime 00h:34m # state prompt gen total given correct test 1 PASSED 201 733 934 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 87 236 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 574 655 70 70 AIME2025/aime2025-01 4 PASSED 313 239 552 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 177 449 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1140 1286 468 468 AIME2025/aime2025-16 7 PASSED 156 646 802 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 52 179 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 4780 5413 588 588 AIME2025/aime2025-02 10 PASSED 182 322 504 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 68 205 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 747 912 16 16 AIME2025/aime2025-03 13 PASSED 149 672 821 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 68 235 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 4837 5142 86 82 AIME2025/aime2025-18 16 PASSED 131 671 802 D D GPQA Diamond/recNFJjE5PPTqVJGv 17 PASSED 175 67 242 I I SuperGPQA/7ca71b86327744b78e93185a45bc5cef 18 PASSED 102 1199 1301 117 117 AIME2025/aime2025-04 19 STOPPED 187 80 267 - B GPQA Diamond/rec2UlKqC6RFHdcro 20 PENDING 0 0 0 - E SuperGPQA/d44b94f7749345a39a65f6312bda8764 21 PENDING 0 0 0 - 106 AIME2025/aime2025-19 22 PENDING 0 0 0 - B GPQA Diamond/recv7GsQg3f0fvB1f 23 PENDING 0 0 0 - B SuperGPQA/febe406f44d74a40b50bb5b7c69d5dc1 ```
Owner
|
Thanks, there are a lot broken entires, I had to remove several when I started the eval thing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The original answer was outside of the claimed energy precision.
The model tries really hard to get the answer - which is impossible.
This PR changes the answer to the correct value.
The evaluation after the fix (using smooth distribution over the tokens configuration)