Compiler statistics (`--benchmark`) by Fiwo735 · Pull Request #67 · LangProc/langproc-cw

Fiwo735 · 2026-03-24T13:29:24Z

Addresses #52.

Simple compiler statistics measured over new test cases in tests/benchmark/, hidden behind --benchmark flag. The terminal output is:

Passed 86/86 found test cases
Benchmark results:
matmul_sum: compilation time = 0.1 s, execution time = 31.3 us, binary size = 210 B

Current order of operation:

Run tests as before, but skip tests in tests/benchmark/
If benchmarking, run tests only in tests/benchmark/
Measure statistics

The measured statistics are:

Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

benchmark programs will be difficult, so we don't include them in the assessed tests
normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

With hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.

…ount)

dwRchyngqxs · 2026-03-24T14:10:45Z

I will review after #65 is merged.

…ode_stats

dwRchyngqxs

I don't think this is measuring or reporting the right data.

Fiwo735 · 2026-03-25T15:44:35Z

@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels?

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases, which check only 1 tiny feature at a time. It seems to me that might be the best approach, we could even add/move 1 or 2 more complex test cases for that reason.

As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development.

dwRchyngqxs · 2026-03-25T16:27:46Z

I worry that comparing against GCC might be a bit "scary" to students

If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.

I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change.

Marginal change relative to gcc is marginal absolute change:
change = abs(new_perf - old_perf)
relative_perf = perf / gcc_perf
relative_change = abs(new_perf / gcc_perf - old_perf / gcc_perf) = abs((new_perf - old_perf) / gcc_perf) = change / gcc_perf
So if we expect marginal change in any case what even is the point of measuring and showing perf?

I guess a solution would be showing stats wrt to different GCC optimisation levels?

I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.

My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller".

I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.

We could also measure the statistics across some selected (more complex) test cases, e.g. tests/programs/* to avoid polluting results (both raw averages and GCC relative averages) with very simple test cases.

It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.

we could even add/move 1 or 2 more complex test cases for that reason.

We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.

As for what we actually measure - could you summarise what are your suggestions?

Binary size: the sum of the size of the section .text, .data, and .rodata from the object file generated using the assembly produced by the compiler; this way only the assembling pass of gcc interferes with the measure.
Compile time: wall clock time of running build/c_compiler, student using parallelism to get better perf is valid even though test.py -m interferes atm.
Run time: executed instruction count ideally, spike should provide it, if not wall clock time of spike pk.

Fiwo735 · 2026-03-26T14:12:17Z

Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated.

Current order of operation:

Run tests as before, but skip tests in tests/benchmark/
If benchmarking, run tests only in tests/benchmark/
Measure statistics

The measured statistics are:

Compilation time: /usr/bin/time is prepended to student_compiler(...), so it's always possible to check the run time by reading .c_compiler.stderr.log. Current limitation - /usr/bin/time is not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to run student_compiler(...) in a loop and use time.perf_counter() as any Python overhead would be completely insignificant for e.g. 10000 repetitions.
Execution time: /usr/bin/time is prepended to running spike, so it's always possible to check the run time by reading .simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body.
Binary size: sum of .text + .data + .rodata sections of ELF file, read with riscv32-unknown-elf-size -A <elf_file.o>

The reason for splitting tests running into 2:

benchmark programs will be difficult, so we don't include them in the assessed tests
normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.
to address current compilation time limitation explained above, we could pass a different compiler, which repeats the compilation X number of times to get an accurate measurement - I can implement that once you're happy with the current approach.

Note: /usr/bin/time would need to be added to the environment, see updated Dockerfile

dwRchyngqxs

I would add a specific function for running benchmark instead of reusing run_component. I would put the benchmark test outside the tests folder and not run benchmark on the seen tests. I would have --benchmark imply --optimised.

dwRchyngqxs · 2026-04-02T22:09:22Z

normally, we'd make --benchmark and --validate_test mutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporary try... except... block for reading c_compiler compilation time, as with --validate_test such file isn't created.

I would not make them mutually exclusive. Validating benchmark files is something we or the students want to do if benchmark files are ever added, and it's a way to get gcc's -O0 values to compare to.

Fiwo735 · 2026-04-02T23:35:54Z

I would add a specific function for running benchmark instead of reusing run_component.

My goal with benchmarking is to exploit our existing framework and be very light-weight. Hence, I reuse run_component(...) for the sole cost of always prepending /usr/bin/time, which shouldn't pose any issues.

I would put the benchmark test outside the tests folder and not run benchmark on the seen tests.

See above + benchmark/ following the existing tests structure is also very light-weight for the cost of 1 LoC to exclude it from normal runs. The additional benefit is that the seen tests can be treated as unit tests by students who already achieve a high pass rate and start working on some optimisations. In other words, they'd run with --benchmark, analyse their compiler stats while ensuring no core functionality has been broken.

I would have --benchmark imply --optimised.

Makes sense - just to confirm, that's in order to build students' compiler with -O3, right?

I would not make them [--benchmark and --validate_test] mutually exclusive...

Good points, I'll keep it as it is then
DISCLAIMER: The currently added benchmark program (matmul_sum) is just an example to work on test.py functionality. I will address what/how many .c benchmarks we should have in a separate PR (Compiled code statistics #52 will stay open for now).

dwRchyngqxs · 2026-04-03T00:02:49Z

I wouldn't worry about being lightweight in code addition when the logic differs non trivially: we need to run the student compiler in a loop without keeping logs (redirect to DEVNULL) to get accurate measurement. Also we should force --jobs 1 to get even more reliable data.
If we exclude it from marking I'm fine with it. I did not see the logic for excluding the benchmark folder from seen tests, might have missed it.
Yes, if we measure compile time we should give students their best chance by compiling with -O3.
Ok.

Fiwo735 · 2026-04-03T01:27:50Z

Two aspects:
1. "Reliable data with --jobs 1 - that would certainly make sense if we decided to assess quantitively instead of qualitatively, however, for now the point of extensions/benchmarking would be to let students do something challenging in a way they envision. I'd say it's reasonable to e.g. not enforce --jobs 1, because if the students report unrealistic speed-up of their compiler, that's on them and we'd catch it during the oral. What do you think about this approach?
2. "Running in a loop" - I have one more light way idea for that, I'll push that along with applying your suggestions soon.
Yup, that's the idea, see here.
Agreed, with the same catch as 1.i above - students only compare against their own (previous) work, so we could say again it's on them if they accidentally assume their compiler is so much faster when in reality they just compiled with -O3.

dwRchyngqxs · 2026-04-03T03:32:01Z

i. Sure, can we at least print a warning?
ii. I'm looking forward to this idea.
Nice.
Ok, maybe it warrants a warning, maybe not; I'm fine either way.

Fiwo735 · 2026-04-03T15:42:01Z

1. Sure, I've added a warning. In the future, we should add a small benchmarking explanation (probably in extensions.md?), which talks about correct methodology, which includes thinking about --jobs and --optimise.
2. Please see the updated logic for compiler repetitions. It's based on a value obtained from --benchmark along with a slight modification to student_compiler(...) to conditionally include a bash loop. It seems to me the method is very safe and accurate, while fully utilising the existing code (further exploiting partial(...)). What do you think?
See 1.i + here specifically I've decided to not show a warning as comparing compilation time (with and without extensions) using the same optimisation level is valid, so I'd say it should just be mentioned in the benchmarking explanation.

Compilation repetitions can appear to be a misnomer when it comes to 0 vs 1 repetition, but the logic has been done on purpose. Current implementation allows for the following repetitions options in student_compiler(...):

0: execute 1 compilation, don't measure time
1: execute 1 compilation, measure time
N: execute N compilations, measure time

Fiwo735 · 2026-04-07T06:42:34Z

First commit addresses some bugs (repetitions in measure_compile_time(...), test_file in collect_benchmark_data(...), status when verbosity is QUIET conflicting with Progress, repetitions CLI flag not being optional). It also add further simplifications (e.g. get_sanitizer_files_from_stem_parent(...), benchmarking printing "N/A").

Second commit simplifies run_tests(...) nested with clauses, JUnitXML file handling (that was needed for a long time) to avoid with clause around each run_tests(...) call. I've also decided to reuse some of the logic between normal and benchmark modes - that saves 100+ LoC and should be easier to understand than before thanks to your suggestions and improvements. I've experimented with partial(...) again, this time for run_make_rule(...) and run_tests(...) - I think logically it makes a lot of sense, but it doesn't actually save that much space, so please let me know what you think about that.

dwRchyngqxs

I'm okay with most of the reverts (even without your reasons, but I appreciate that you took the time to justify them anyways).

I like the changes to run_tests with clause.
We lost testing the compiler with -O1 on all tests to check their optimisations are correct. Is there a way to restore that? (I think using exclude_dir to our advantage is going to be the way here)
Running only one build optimisation mode is fine. But I think the script shouldn't measure or report time when not building with optimisations. I don't want the students to rely on this figure because they don't know how the test script works. I don't like giving footguns to students. See my comments.

dwRchyngqxs · 2026-04-07T16:53:56Z

+        run_tests,
+        jobs=args.jobs,
+        output_dir=output_dir,
+        report_path=args.report,


--report + --benchmark doesn't make sense at the moment. We're not running benchmarks in CI.

We might though? I was thinking of adding test.py dependent CI, which tests permutations of --optimise, --benchmark, --verbosity, etc.

You mean that would be triggered by student pushes? I don't think we can afford any kind of regular useful benchmark in CI. And even then it would be for a minority of student who finished their compiler so they can run it manually.

No, I mean such CLI would be triggered by pushes with changes to test.py, so presumably triggered by us in the majority of cases. The idea is that with growing functionality of test.py, even a simple CI which just calls the program with different parameters and checks if the execution was successful, could help us spot some conflicts.

In that case I recommend we have an undocumented flag for compiler_path to give riscv32-unknown-elf-gcc. Validate tests is nice to validate files in tests/** but it's not nice to validate test.py. Also we wouldn't need a report at the same time as benchmark, we will just look at command line results for crashes and non 0 return status with --validate_tests.

Fiwo735 · 2026-04-08T02:14:49Z

Thanks for the suggestions. I've addressed all of them with the commits above, including rerunning seen tests with -O1. As for students using potentially impactful options when measuring compile time, I've added a warning + it'll be mentioned in the docs.

At some point you've also suggested a check for 100% seen pass rate to allow benchmarking. At first I kept it, but changed it to an arbitrary ratio, but then I've decided to remove that completely. My thinking can be seen in the comments above, but tldr; I think docs should be enough given the qualitative nature of the extensions.

dwRchyngqxs · 2026-04-08T10:48:54Z

At some point you've also suggested a check for 100% seen pass rate to allow benchmarking. At first I kept it, but changed it to an arbitrary ratio, but then I've decided to remove that completely.

I also changed opinion on that. if students do not want a better grade but still want to have fun learn something useful (as in they would not have fun implementing yet another C language feature) they should be allowed to.

dwRchyngqxs · 2026-04-08T11:00:18Z

-        cmd = [compiler_path, opt_flag, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")]
+        cmd = [compiler_path, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")]
+        if opt_flag is not None:
+            cmd.insert(1, opt_flag)


opt_flag is a fine way to do it. Did you not like the base_cmd version where instead of giving path + opt_flag we pass a command (like ["build/c_compiler", "-O1"] or ["riscv32-unknown-elf-gcc", "-pedantic", ..., "-O2"] if we want to test the script without a working compiler).

Hmm, I've been using --validate_tests to test without a working compiler. That's for the benchmarking mode, as the normal mode can be tested with the provided compiler

--validate_tests is nice to validate files in tests/** but it's not nice to validate test.py because it uses special logic and leaves some parts of the code untested.

dwRchyngqxs

After some changes you said you will implement, you should be able to merge.

* Compiler statistics (compile time, asm size and static instructions count) * Compiler stats V2 * Changed --gather_stats to --benchmark * Removed not used imports * Compatibility with langproc-marking * Jobs warning + executed instructions using ASM rdinstret * New method for compiler repetitions * Improved time logging * Improved instructions log reading * ISA flag explained (_zicntr) * shlex.join instead of str join * During benchmarking, run with and without optimisations enabled * Moved rdinstret64 into benchmark.h * My attempt * Forgot a line * Not making an empty folder * Reduce indent, small changes * Bugfixes + simplifications * Further simplifications + normal and benchmark mode reuse logic again * Removed sanitizer during timed execution * Test again with -O1 when benchmarking * Assessed disclaimer * Applying suggestions --------- Co-authored-by: Fiwo735 <fiwo725@gmail.com> Co-authored-by: dwRchyngqxs <q.corradi22@imperial.ac.uk>

Compiler statistics (compile time, asm size and static instructions c…

3041771

…ount)

Fiwo735 requested a review from dwRchyngqxs March 24, 2026 13:29

Fiwo735 self-assigned this Mar 24, 2026

Merge branch 'main' of https://github.com/LangProc/langproc-cw into c…

e676a21

…ode_stats

dwRchyngqxs requested changes Mar 25, 2026

View reviewed changes

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Compiler stats V2

70c8ca9

Changed --gather_stats to --benchmark

0dbeaf0

Fiwo735 changed the title ~~Compiler statistics~~ Compiler statistics (--benchmark) Mar 27, 2026

Fiwo735 mentioned this pull request Apr 1, 2026

That other PR but updated #68

Merged

Fiwo735 added 3 commits April 1, 2026 20:52

Merge

eae11f5

Removed not used imports

ed0d23a

Compatibility with langproc-marking

7511ebe

dwRchyngqxs reviewed Apr 2, 2026

View reviewed changes

Comment thread test.py Outdated

Comment thread test.py Outdated

Comment thread test.py Outdated

Fiwo735 added 5 commits April 3, 2026 14:31

Jobs warning + executed instructions using ASM rdinstret

637f31b

New method for compiler repetitions

24ece8b

Improved time logging

ad6f576

Improved instructions log reading

552b378

ISA flag explained (_zicntr)

dccd3da

dwRchyngqxs reviewed Apr 3, 2026

View reviewed changes

Comment thread test.py Outdated

dwRchyngqxs reviewed Apr 3, 2026

View reviewed changes

Comment thread test.py Outdated

dwRchyngqxs reviewed Apr 3, 2026

View reviewed changes

Comment thread test.py Outdated

Fiwo735 and others added 9 commits April 3, 2026 19:18

shlex.join instead of str join

da18a59

During benchmarking, run with and without optimisations enabled

e77a67b

Moved rdinstret64 into benchmark.h

7c39d43

My attempt

4ea09d5

Forgot a line

3201bab

Not making an empty folder

fc64212

Reduce indent, small changes

f035e24

Bugfixes + simplifications

eaf58dd

Further simplifications + normal and benchmark mode reuse logic again

59ac644

dwRchyngqxs reviewed Apr 7, 2026

View reviewed changes

Comment thread test.py

Fiwo735 added 3 commits April 7, 2026 21:16

Removed sanitizer during timed execution

4e08799

Test again with -O1 when benchmarking

1b2cb92

Assessed disclaimer

c20cda7

dwRchyngqxs reviewed Apr 8, 2026

View reviewed changes

Comment thread test.py Outdated

dwRchyngqxs reviewed Apr 8, 2026

View reviewed changes

dwRchyngqxs approved these changes Apr 8, 2026

View reviewed changes

Applying suggestions

36ee31a

Fiwo735 merged commit f4e42c4 into main Apr 8, 2026
2 checks passed

Fiwo735 deleted the code_stats branch April 8, 2026 14:08

Fiwo735 mentioned this pull request Apr 8, 2026

Compiled code statistics #52

Open

Conversation

Fiwo735 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Mar 24, 2026

Uh oh!

dwRchyngqxs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fiwo735 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiwo735 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwRchyngqxs commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiwo735 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiwo735 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwRchyngqxs commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fiwo735 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fiwo735 commented Apr 7, 2026

Uh oh!

dwRchyngqxs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dwRchyngqxs Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Fiwo735 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

dwRchyngqxs Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Fiwo735 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

dwRchyngqxs Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fiwo735 commented Mar 24, 2026 •

edited

Loading

Fiwo735 commented Mar 25, 2026 •

edited

Loading

dwRchyngqxs commented Mar 25, 2026 •

edited

Loading

Fiwo735 commented Mar 26, 2026 •

edited

Loading

dwRchyngqxs commented Apr 2, 2026 •

edited

Loading

Fiwo735 commented Apr 2, 2026 •

edited

Loading

dwRchyngqxs commented Apr 3, 2026 •

edited

Loading

Fiwo735 commented Apr 3, 2026 •

edited

Loading

dwRchyngqxs commented Apr 3, 2026 •

edited

Loading

Fiwo735 commented Apr 3, 2026 •

edited

Loading

dwRchyngqxs left a comment •

edited

Loading

dwRchyngqxs Apr 8, 2026 •

edited

Loading

dwRchyngqxs Apr 8, 2026 •

edited

Loading