Conversation
|
I will review after #65 is merged. |
dwRchyngqxs
left a comment
There was a problem hiding this comment.
I don't think this is measuring or reporting the right data.
|
@dwRchyngqxs I see the point of going from raw averages to average differences/ratios compared to GCG. The advantage is a more meaningful statistic, but I worry that comparing against GCC might be a bit "scary" to students. I'd assume most students wouldn't go beyond one optimisation, e.g. better register allocation - so they'd see a very marginal change. I guess a solution would be showing stats wrt to different GCC optimisation levels? My idea with averages over seen tests is that we treat these as our "benchmark", so while the test cases are heterogenous, it'd still be meaningful to observe e.g. "smarter register allocation makes the code X% smaller". We could also measure the statistics across some selected (more complex) test cases, e.g. As for what we actually measure - could you summarise what are your suggestions? I'd like some orthogonal stats so that students can observe trade-offs like "compiler takes X% longer to compile, but the code is Y% faster". Ideally the measurement would be simple to implement, so that we don't bloat the code base with a feature that's in early development. |
If you're really worried about that we can use absolute perf and store the best student perf each time we measure it so that they try to improve personal best.
Marginal change relative to
I don't think reference assembly is optimised. I was thinking about perf relative to reference assembly.
I see the point of benchmarking, but your code isn't achieving it. You can get 50% smaller code by no longer passing more complicated tests; that's what I mean by heterogeneous, I think that wasn't clear with retrospect.
It is a good solution to the point I raised. It is also a good solution IMO because students shouldn't even look at perf before being able to pass 80% of the tests.
We could even add 1 or 2 unmarked tests. Then have 10 runs for each test and get real estimates of perf data.
|
|
Thanks for your thoughts, they helped me reach a significantly more reasonable V2. Previous changes have been overwritten, so you can check out the overall diff as "Files changed" at the top. Don't mind the actual code style, as the code will be polished once the methods are agreed upon + changes made in #68 are integrated. Current order of operation:
The measured statistics are:
The reason for splitting tests running into 2:
Note: |
dwRchyngqxs
left a comment
There was a problem hiding this comment.
I would add a specific function for running benchmark instead of reusing run_component. I would put the benchmark test outside the tests folder and not run benchmark on the seen tests. I would have --benchmark imply --optimised.
I would not make them mutually exclusive. Validating benchmark files is something we or the students want to do if benchmark files are ever added, and it's a way to get gcc's |
|
|
|
|
Compilation repetitions can appear to be a misnomer when it comes to 0 vs 1 repetition, but the logic has been done on purpose. Current implementation allows for the following
|
|
First commit addresses some bugs ( Second commit simplifies |
There was a problem hiding this comment.
I'm okay with most of the reverts (even without your reasons, but I appreciate that you took the time to justify them anyways).
- I like the changes to
run_testswithclause. - We lost testing the compiler with
-O1on all tests to check their optimisations are correct. Is there a way to restore that? (I think usingexclude_dirto our advantage is going to be the way here) - Running only one build optimisation mode is fine. But I think the script shouldn't measure or report time when not building with optimisations. I don't want the students to rely on this figure because they don't know how the test script works. I don't like giving footguns to students. See my comments.
| run_tests, | ||
| jobs=args.jobs, | ||
| output_dir=output_dir, | ||
| report_path=args.report, |
There was a problem hiding this comment.
--report + --benchmark doesn't make sense at the moment. We're not running benchmarks in CI.
There was a problem hiding this comment.
We might though? I was thinking of adding test.py dependent CI, which tests permutations of --optimise, --benchmark, --verbosity, etc.
There was a problem hiding this comment.
You mean that would be triggered by student pushes? I don't think we can afford any kind of regular useful benchmark in CI. And even then it would be for a minority of student who finished their compiler so they can run it manually.
There was a problem hiding this comment.
No, I mean such CLI would be triggered by pushes with changes to test.py, so presumably triggered by us in the majority of cases. The idea is that with growing functionality of test.py, even a simple CI which just calls the program with different parameters and checks if the execution was successful, could help us spot some conflicts.
There was a problem hiding this comment.
In that case I recommend we have an undocumented flag for compiler_path to give riscv32-unknown-elf-gcc. Validate tests is nice to validate files in tests/** but it's not nice to validate test.py. Also we wouldn't need a report at the same time as benchmark, we will just look at command line results for crashes and non 0 return status with --validate_tests.
|
Thanks for the suggestions. I've addressed all of them with the commits above, including rerunning seen tests with At some point you've also suggested a check for 100% seen pass rate to allow benchmarking. At first I kept it, but changed it to an arbitrary ratio, but then I've decided to remove that completely. My thinking can be seen in the comments above, but tldr; I think docs should be enough given the qualitative nature of the extensions. |
I also changed opinion on that. if students do not want a better grade but still want to have fun learn something useful (as in they would not have fun implementing yet another C language feature) they should be allowed to. |
| cmd = [compiler_path, opt_flag, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")] | ||
| cmd = [compiler_path, "-S", input_file, "-o", append_suffix_to_stem(output_stem, "s")] | ||
| if opt_flag is not None: | ||
| cmd.insert(1, opt_flag) |
There was a problem hiding this comment.
opt_flag is a fine way to do it. Did you not like the base_cmd version where instead of giving path + opt_flag we pass a command (like ["build/c_compiler", "-O1"] or ["riscv32-unknown-elf-gcc", "-pedantic", ..., "-O2"] if we want to test the script without a working compiler).
There was a problem hiding this comment.
Hmm, I've been using --validate_tests to test without a working compiler. That's for the benchmarking mode, as the normal mode can be tested with the provided compiler
There was a problem hiding this comment.
--validate_tests is nice to validate files in tests/** but it's not nice to validate test.py because it uses special logic and leaves some parts of the code untested.
dwRchyngqxs
left a comment
There was a problem hiding this comment.
After some changes you said you will implement, you should be able to merge.
* Compiler statistics (compile time, asm size and static instructions count) * Compiler stats V2 * Changed --gather_stats to --benchmark * Removed not used imports * Compatibility with langproc-marking * Jobs warning + executed instructions using ASM rdinstret * New method for compiler repetitions * Improved time logging * Improved instructions log reading * ISA flag explained (_zicntr) * shlex.join instead of str join * During benchmarking, run with and without optimisations enabled * Moved rdinstret64 into benchmark.h * My attempt * Forgot a line * Not making an empty folder * Reduce indent, small changes * Bugfixes + simplifications * Further simplifications + normal and benchmark mode reuse logic again * Removed sanitizer during timed execution * Test again with -O1 when benchmarking * Assessed disclaimer * Applying suggestions --------- Co-authored-by: Fiwo735 <fiwo725@gmail.com> Co-authored-by: dwRchyngqxs <q.corradi22@imperial.ac.uk>
Addresses #52.
Simple compiler statistics measured over new test cases in
tests/benchmark/, hidden behind--benchmarkflag. The terminal output is:Current order of operation:
tests/benchmark/tests/benchmark/The measured statistics are:
/usr/bin/timeis prepended tostudent_compiler(...), so it's always possible to check the run time by reading.c_compiler.stderr.log. Current limitation -/usr/bin/timeis not very accurate, only reports seconds to 2 decimal places (i.e., precision of 0.01 s). A better solution would be to runstudent_compiler(...)in a loop and usetime.perf_counter()as any Python overhead would be completely insignificant for e.g. 10000 repetitions./usr/bin/timeis prepended to runningspike, so it's always possible to check the run time by reading.simulation.stderr.log. The benchmark program driver is designed to include a loop with 100,000 repetitions, so this method is safe. I've added a logic in the benchmark driver, which hopefully prevents GCC from optimising out the loop body..text+.data+.rodatasections of ELF file, read withriscv32-unknown-elf-size -A <elf_file.o>The reason for splitting tests running into 2:
--benchmarkand--validate_testmutually exclusive or skip gathering stats in case validating tests. Currently, it's actually possible to combine these 2 flags - that's on purpose, so you can test the new logic without an implemented compiler. That's also the reason for the temporarytry... except...block for readingc_compilercompilation time, as with--validate_testsuch file isn't created.Note:
/usr/bin/timewould need to be added to the environment, see updatedDockerfileWith hopefully slightly more emphasis on extensions in the coming years, easy statistics tracking could motivate students to think about introducing optimisations into their design.