Skip to content

Compiler statistics (--benchmark)#67

Merged
Fiwo735 merged 25 commits intomainfrom
code_stats
Apr 8, 2026
Merged

Compiler statistics (--benchmark)#67
Fiwo735 merged 25 commits intomainfrom
code_stats

Conversation

@Fiwo735
Copy link
Copy Markdown
Collaborator

@Fiwo735 Fiwo735 commented Mar 24, 2026

Addresses #52.

Simple compiler statistics measured over new test cases in tests/benchmark/, hidden behind --benchmark flag. The terminal output is:

Passed 86/86 found test cases
Benchmark results:
matmul_sum: compilation time = 0.1 s, execution time = 31.3 us, binary size = 210 B

Current order of operation:

  1. Run tests as before, but skip tests in tests/benchmark/
  2. If benchmarking, run tests only in tests/benchmark/
  3. Measure statistics

The measured statistics are:

  1. Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
  2. Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
  3. Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

  • benchmark programs will be difficult, so we don't include them in the assessed tests
  • normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
  • to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

With hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.

@Fiwo735 Fiwo735 requested a review from dwRchyngqxs March 24, 2026 13:29
@Fiwo735 Fiwo735 self-assigned this Mar 24, 2026
@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

I will review after #65 is merged.

Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is measuring or reporting the right data.

Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Mar 25, 2026

@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels?

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases, which check only 1 tiny feature at a time. It seems to me that might be the best approach, we could even add/move 1 or 2 more complex test cases for that reason.

As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development.

@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

dwRchyngqxs commented Mar 25, 2026

I worry that comparing against GCC might be a bit "scary" to students

If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.

I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change.

Marginal change relative to gcc is marginal absolute change:
change = abs(new_perf - old_perf)
relative_perf = perf / gcc_perf
relative_change = abs(new_perf / gcc_perf - old_perf / gcc_perf) = abs((new_perf - old_perf) / gcc_perf) = change / gcc_perf
So if we expect marginal change in any case what even is the point of measuring and showing perf?

I guess a solution would be showing stats wrt to different GCC optimisation levels?

I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller".

I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.

We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases.

It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.

we could even add/move 1 or 2 more complex test cases for that reason.

We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.

As for what we actually measure - could you summarise what are your suggestions?

  1. Binary size: the sum of the size of the section .text, .data, and .rodata from the object file generated using the assembly produced by the compiler; this way only the assembling pass of gcc interferes with the measure.
  2. Compile time: wall clock time of running build/c_compiler, student using parallelism to get better perf is valid even though test.py -m interferes atm.
  3. Run time: executed instruction count ideally, spike should provide it, if not wall clock time of spike pk.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Mar 26, 2026

Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated.

Current order of operation:

  1. Run tests as before, but skip tests in tests/benchmark/
  2. If benchmarking, run tests only in tests/benchmark/
  3. Measure statistics

The measured statistics are:

  1. Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
  2. Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
  3. Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

  • benchmark programs will be difficult, so we don't include them in the assessed tests
  • normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
  • to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

@Fiwo735 Fiwo735 changed the title Compiler statistics Compiler statistics (--benchmark) Mar 27, 2026
@Fiwo735 Fiwo735 mentioned this pull request Apr 1, 2026
Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a specific function for running benchmark instead of reusing run_component. I would put the benchmark test outside the tests folder and not run benchmark on the seen tests. I would have --benchmark imply --optimised.

Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

dwRchyngqxs commented Apr 2, 2026

normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.

I would not make them mutually exclusive. Validating benchmark files is something we or the students want to do if benchmark files are ever added, and it's a way to get gcc's -O0 values to compare to.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Apr 2, 2026

I would add a specific function for running benchmark instead of reusing run_component.

  1. My goal with benchmarking is to exploit our existing framework and be very light-weight. Hence, I reuse run_component(...) for the sole cost of always prepending /usr/bin/time, which shouldn't pose any issues.

I would put the benchmark test outside the tests folder and not run benchmark on the seen tests.

  1. See above + benchmark/ following the existing tests structure is also very light-weight for the cost of 1 LoC to exclude it from normal runs. The additional benefit is that the seen tests can be treated as unit tests by students who already achieve a high pass rate and start working on some optimisations. In other words, they'd run with --benchmark, analyse their compiler stats while ensuring no core functionality has been broken.

I would have --benchmark imply --optimised.

  1. Makes sense - just to confirm, that's in order to build students' compiler with -O3, right?

I would not make them [--benchmark and --validate_test] mutually exclusive...

  1. Good points, I'll keep it as it is then

  2. DISCLAIMER: The currently added benchmark program (matmul_sum) is just an example to work on test.py functionality. I will address what/how many .c benchmarks we should have in a separate PR (Compiled code statistics #52 will stay open for now).

@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

dwRchyngqxs commented Apr 3, 2026

  1. I wouldn't worry about being lightweight in code addition when the logic differs non trivially: we need to run the student compiler in a loop without keeping logs (redirect to DEVNULL) to get accurate measurement. Also we should force --jobs 1 to get even more reliable data.
  2. If we exclude it from marking I'm fine with it. I did not see the logic for excluding the benchmark folder from seen tests, might have missed it.
  3. Yes, if we measure compile time we should give students their best chance by compiling with -O3.
  4. Ok.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Apr 3, 2026

  1. Two aspects:
    1. "Reliable data with --jobs 1 - that would certainly make sense if we decided to assess quantitively instead of qualitatively, however, for now the point of extensions/benchmarking would be to let students do something challenging in a way they envision. I'd say it's reasonable to e.g. not enforce --jobs 1, because if the students report unrealistic speed-up of their compiler, that's on them and we'd catch it during the oral. What do you think about this approach?
    2. "Running in a loop" - I have one more light way idea for that, I'll push that along with applying your suggestions soon.
  2. Yup, that's the idea, see here.
  3. Agreed, with the same catch as 1.i above - students only compare against their own (previous) work, so we could say again it's on them if they accidentally assume their compiler is so much faster when in reality they just compiled with -O3.

@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

dwRchyngqxs commented Apr 3, 2026

  1. i. Sure, can we at least print a warning?
    ii. I'm looking forward to this idea.
  2. Nice.
  3. Ok, maybe it warrants a warning, maybe not; I'm fine either way.

@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Apr 3, 2026

    1. Sure, I've added a warning. In the future, we should add a small benchmarking explanation (probably in extensions.md?), which talks about correct methodology, which includes thinking about --jobs and --optimise.
    2. Please see the updated logic for compiler repetitions. It's based on a value obtained from --benchmark along with a slight modification to student_compiler(...) to conditionally include a bash loop. It seems to me the method is very safe and accurate, while fully utilising the existing code (further exploiting partial(...)). What do you think?
  1. See 1.i + here specifically I've decided to not show a warning as comparing compilation time (with and without extensions) using the same optimisation level is valid, so I'd say it should just be mentioned in the benchmarking explanation.

Compilation repetitions can appear to be a misnomer when it comes to 0 vs 1 repetition, but the logic has been done on purpose. Current implementation allows for the following repetitions options in student_compiler(...):

  • 0: execute 1 compilation, don't measure time
  • 1: execute 1 compilation, measure time
  • N: execute N compilations, measure time

Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Apr 7, 2026

First commit addresses some bugs (repetitions in measure_compile_time(...), test_file in collect_benchmark_data(...), status when verbosity is QUIET conflicting with Progress, repetitions CLI flag not being optional). It also add further simplifications (e.g. get_sanitizer_files_from_stem_parent(...), benchmarking printing "N/A").

Second commit simplifies run_tests(...) nested with clauses, JUnitXML file handling (that was needed for a long time) to avoid with clause around each run_tests(...) call. I've also decided to reuse some of the logic between normal and benchmark modes - that saves 100+ LoC and should be easier to understand than before thanks to your suggestions and improvements. I've experimented with partial(...) again, this time for run_make_rule(...) and run_tests(...) - I think logically it makes a lot of sense, but it doesn't actually save that much space, so please let me know what you think about that.

Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with most of the reverts (even without your reasons, but I appreciate that you took the time to justify them anyways).

  • I like the changes to run_tests with clause.
  • We lost testing the compiler with -O1 on all tests to check their optimisations are correct. Is there a way to restore that? (I think using exclude_dir to our advantage is going to be the way here)
  • Running only one build optimisation mode is fine. But I think the script shouldn't measure or report time when not building with optimisations. I don't want the students to rely on this figure because they don't know how the test script works. I don't like giving footguns to students. See my comments.

Comment thread test.py
Comment thread test.py
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py Outdated
Comment thread test.py
run_tests,
jobs=args.jobs,
output_dir=output_dir,
report_path=args.report,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--report + --benchmark doesn't make sense at the moment. We're not running benchmarks in CI.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might though? I was thinking of adding test.py dependent CI, which tests permutations of --optimise, --benchmark, --verbosity, etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that would be triggered by student pushes? I don't think we can afford any kind of regular useful benchmark in CI. And even then it would be for a minority of student who finished their compiler so they can run it manually.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean such CLI would be triggered by pushes with changes to test.py, so presumably triggered by us in the majority of cases. The idea is that with growing functionality of test.py, even a simple CI which just calls the program with different parameters and checks if the execution was successful, could help us spot some conflicts.

Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I recommend we have an undocumented flag for compiler_path to give riscv32-unknown-elf-gcc. Validate tests is nice to validate files in tests/** but it's not nice to validate test.py. Also we wouldn't need a report at the same time as benchmark, we will just look at command line results for crashes and non 0 return status with --validate_tests.

Comment thread test.py
@Fiwo735
Copy link
Copy Markdown
Collaborator Author

Fiwo735 commented Apr 8, 2026

Thanks for the suggestions. I've addressed all of them with the commits above, including rerunning seen tests with -O1. As for students using potentially impactful options when measuring compile time, I've added a warning + it'll be mentioned in the docs.

At some point you've also suggested a check for 100% seen pass rate to allow benchmarking. At first I kept it, but changed it to an arbitrary ratio, but then I've decided to remove that completely. My thinking can be seen in the comments above, but tldr; I think docs should be enough given the qualitative nature of the extensions.

@dwRchyngqxs
Copy link
Copy Markdown
Collaborator

At some point you've also suggested a check for 100% seen pass rate to allow benchmarking. At first I kept it, but changed it to an arbitrary ratio, but then I've decided to remove that completely.

I also changed opinion on that. if students do not want a better grade but still want to have fun learn something useful (as in they would not have fun implementing yet another C language feature) they should be allowed to.

Comment thread test.py Outdated
Comment thread test.py
cmd = [compiler_path, opt_flag, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")]
cmd = [compiler_path, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")]
if opt_flag is not None:
cmd.insert(1, opt_flag)
Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt_flag is a fine way to do it. Did you not like the base_cmd version where instead of giving path + opt_flag we pass a command (like ["build/c_compiler", "-O1"] or ["riscv32-unknown-elf-gcc", "-pedantic", ..., "-O2"] if we want to test the script without a working compiler).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've been using --validate_tests to test without a working compiler. That's for the benchmarking mode, as the normal mode can be tested with the provided compiler

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--validate_tests is nice to validate files in tests/** but it's not nice to validate test.py because it uses special logic and leaves some parts of the code untested.

Copy link
Copy Markdown
Collaborator

@dwRchyngqxs dwRchyngqxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some changes you said you will implement, you should be able to merge.

@Fiwo735 Fiwo735 merged commit f4e42c4 into main Apr 8, 2026
2 checks passed
@Fiwo735 Fiwo735 deleted the code_stats branch April 8, 2026 14:08
Fiwo735 added a commit that referenced this pull request Apr 9, 2026
* Compiler statistics (compile time, asm size and static instructions count)

* Compiler stats V2

* Changed --gather_stats to --benchmark

* Removed not used imports

* Compatibility with langproc-marking

* Jobs warning + executed instructions using ASM rdinstret

* New method for compiler repetitions

* Improved time logging

* Improved instructions log reading

* ISA flag explained (_zicntr)

* shlex.join instead of str join

* During benchmarking, run with and without optimisations enabled

* Moved rdinstret64 into benchmark.h

* My attempt

* Forgot a line

* Not making an empty folder

* Reduce indent, small changes

* Bugfixes + simplifications

* Further simplifications + normal and benchmark mode reuse logic again

* Removed sanitizer during timed execution

* Test again with -O1 when benchmarking

* Assessed disclaimer

* Applying suggestions

---------

Co-authored-by: Fiwo735 <fiwo725@gmail.com>
Co-authored-by: dwRchyngqxs <q.corradi22@imperial.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants