Run a benchmark of versions 1 and 2 with the following command:
cargo b 1 2 -r 8The -r flag specifies the number of repetitions.
Run cargo b -h for more information on the available flags.
The benchmark will interleave the runs of each version in order to avoid any bias due to the order of execution. It will output the minimum, average, and maximum execution times of each version.
A basic implementation using rayon for parallelism and making heavy use of std in the hot code.
This usually runs in about 7.5 seconds on my machine when given 8 cores.
Replaces the iteration based parsing with custom parser logic, including some bit manipulation in find_delimiter_long to efficiently find the semicolon.
The custom parsing logic uses many properties of the input that the std version cannot take advantage of.
This version has by far the greatest performance improvement over the previous with almost a 2x speedup on most runs.
The previous version had a HashMap to map station names to indices and a separate Vec to store the station information.
Combining this into a single HashMap gave a small (~5%) but consistent speedup.
Hashing is done separately from the HashMap lookup in v2.
This is not an issue in itself, except that the HashMap rehashes the hash.
To fix this, we use the nohash-hasher crate to get the HashMap to use the hash directly.
This gives a tiny (~3%) but consistent speedup.
This is a version of v3 with all asserts made into debug_asserts.
This is reliably faster, but the difference is small enough that I prefer to keep the asserts in the code for now.
This is an attempt at batching parsing with each thread to get some SIMD action or parallelize some memory fetches.
I don't understand either of those that well so I'm not sure if this is the right way to go about it.
It performs significantly worse than v3 on my machine, but the flame graph shows some signs that less time is spent on cache misses,
just not enough to make up for the overhead of the batching.
I tried to parse the values with bit manipulation, but this was slower than the previous implementation.