Improve Taylor1 performance#415
Conversation
|
Nice work @PerezHz! This could be a great improvement for packages upstream. I'm curious, will this PR also affect the performance of |
|
@LuEdRaMo in principle yes, this should help improve performance for nested |
|
Some copy-pasteable benchmarks: using TaylorSeries, BenchmarkTools
const TS = TaylorSeries
ord = 30
a = Taylor1(rand(ord + 1), ord)
b = Taylor1(rand(ord + 1), ord)
c = zero(a)
aux = zero(a)
function test_add_kloop!(ord, a, b, c)
@inbounds for k in 0:ord
TS.add!(c, a, b, k)
end
return c
end
function test_muladd_kloop!(ord, a, b, c)
fill!(c.coeffs, 0.0)
@inbounds for k in 0:ord
TS.muladd!(c, a, b, k)
end
return c
end
function test_pow_kloop!(ord, a, aux, c, r)
@inbounds for k in 0:ord
TS.pow!(c, a, aux, r, k)
end
return c
end
@btime test_add_kloop!($ord, $a, $b, $c);
@btime TS.add!($c, $a, $b);
@btime TS.mul!($c, $a, $b);
@btime test_muladd_kloop!($ord, $a, $b, $c);
@btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
@btime $a + $b;
@btime $a * $b;
@btime $a ^ 1.5;This PR on my laptop: julia> @btime test_add_kloop!($ord, $a, $b, $c);
9.926 ns (0 allocations: 0 bytes)
julia> @btime TS.add!($c, $a, $b); # doesn't exist in master
4.791 ns (0 allocations: 0 bytes)
julia> @btime TS.mul!($c, $a, $b);
74.632 ns (0 allocations: 0 bytes)
julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
80.489 ns (0 allocations: 0 bytes)
julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
278.347 ns (0 allocations: 0 bytes)
julia> @btime $a + $b;
19.475 ns (1 allocation: 272 bytes)
julia> @btime $a * $b;
86.589 ns (1 allocation: 272 bytes)
julia> @btime $a ^ 1.5;
265.975 ns (2 allocations: 544 bytes)
julia> @btime test_add_kloop!($ord, $a, $b, $c);
10.135 ns (0 allocations: 0 bytes)
julia> # @btime TS.add!($c, $a, $b); # does not exist on master
julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
185.098 ns (0 allocations: 0 bytes)
julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
182.570 ns (0 allocations: 0 bytes)
julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
396.557 ns (0 allocations: 0 bytes)
julia> @btime $a + $b;
22.401 ns (1 allocation: 272 bytes)
julia> @btime $a * $b;
193.639 ns (1 allocation: 272 bytes)
julia> @btime $a ^ 1.5;
443.182 ns (3 allocations: 816 bytes) |
|
I took a closer look at nested Taylor1 operations and found a couple of places where the all-orders-at-once versions of add! and mul! for non-nested Taylor1s can benefit the nested case. So I added some changes in the last commit which should help improve performance. |
|
I tested this branch using NEOs' impact monitoring benchmark (which mostly relies on # master
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
Range (min … max): 5.914 s … 10.634 s ┊ GC (min … max): 0.28% … 0.59%
Time (median): 6.564 s ┊ GC (median): 0.34%
Time (mean ± σ): 7.534 s ± 1.832 s ┊ GC (mean ± σ): 0.36% ± 0.19%
█
█▁▁▁▁▇▁▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
5.91 s Histogram: frequency by time 10.6 s <
Memory estimate: 143.05 MiB, allocs estimate: 1834386.
# this PR
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
Range (min … max): 3.946 s … 4.228 s ┊ GC (min … max): 0.41% … 0.41%
Time (median): 3.998 s ┊ GC (median): 0.32%
Time (mean ± σ): 4.046 s ± 106.174 ms ┊ GC (mean ± σ): 0.35% ± 0.06%
███ █ █ █ █ █ █ █
███▁▁▁█▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
3.95 s Histogram: frequency by time 4.23 s <
Memory estimate: 139.81 MiB, allocs estimate: 1818534. |
|
@LuEdRaMo many thanks for checking this, sounds great! Do you find differences which are larger than expected numerical noise? |
…nctions, arithmetic helpers, zero, one, identity, scalar sqrt!, nested differentiation/integration and some evaluation methods (Horner)
|
Fixes #411 |
|
Also, extended this concept to |
|
I'm also running downstream tests (TaylorIntegration, NEOs, PlanetaryEphemeris) using this branch |
|
@lbenet this is now ready for review. Worth noting that downstream tests for TaylorIntegration, PlanetaryEphemeris and NEOs are passing with this branch |
Thanks @PerezHz, in the next few days I'll be sending some comments (I'm at a conference this week). Here is a very general comment (from skimming Also, we should avoid code repetition; makes life simpler to maintain... |
|
@lbenet the idea of the new "full-loop" That said, I agree on both points: we should make this consistent with how |
I see that the direct method (
I am already half-way through a first review, so let me finish it, and then you try condensing/simplifying the code. |
I meant this for
Alright, looking forward to receiving feedback to improve this PR! |
lbenet
left a comment
There was a problem hiding this comment.
First, I think this PR is great, finding a way to avoid some unnecessary allocations. Yet, I think some of the methods introduced define differing (code) paths that could change the behaviour in comparisson with the taylorized functions, which is not advisable. Also, I think some methods are too specialized, and we should specialize the inner methods (such as add!(a, b, c, k)).
I've thus left some comments which are mostly suggestions. Their goal is to avoid repeated code, and make sure that the taylorized functions yield the same.
| b_coeffs = b.coeffs | ||
| @inbounds for i in eachindex(v_coeffs) | ||
| v_coeffs[i] = a_coeffs[i] + b_coeffs[i] | ||
| end |
There was a problem hiding this comment.
Note: This is not a suggestion for replacement (even as it appears as such), it is simply an illustration of my general comment here:
| end | |
| for k in eachindex(a) | |
| add!(v, a, b, k) | |
| end |
The idea is to use in the inner method add!(v, a, b, k) the kind of performance optimisations you are introducing in this PR.
I don't know if this change would provide the same performance improvements, but it is worth checking it. The key point is that the same operations would be implemented when operating directly with the optimized (@taylorized) methods, or directly with the Taylor1 objects.
There was a problem hiding this comment.
Thank you @lbenet for this suggestion. I double-checked this, and it does look like adding the optimizations at the add!(v, a, b, k) level gives essentially the same performance, with a difference ~0.5 ns, which is already around the noise level (this probably means that somehow the Julia compiler manages to inline away the loops and generates optimized code very similar to what the previous approach did). So I modified add!(v, a, b, k) and subst!(v, a, b, k) to use operate directly on the .coeffs.
Thanks for the clarification; it is good point, and I agree in the usufulness of the method. Yet, consider that we may have |
Co-authored-by: Luis Benet <lbenet@users.noreply.github.com>
…dentity! in call site
|
Updated benchmarks on my laptop: julia> @btime test_add_kloop!($ord, $a, $b, $c); # dropped from 9.926 ns vs last run of this PR, 10.135 ns on `master`
4.458 ns (0 allocations: 0 bytes)
julia> @btime TS.add!($c, $a, $b);
4.416 ns (0 allocations: 0 bytes)
julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
74.554 ns (0 allocations: 0 bytes)
julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
81.006 ns (0 allocations: 0 bytes)
julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
279.394 ns (0 allocations: 0 bytes)
julia> @btime $a + $b;
19.204 ns (1 allocation: 272 bytes)
julia> @btime $a * $b;
87.022 ns (1 allocation: 272 bytes)
julia> @btime $a ^ 1.5;
267.590 ns (2 allocations: 544 bytes) |
|
@lbenet thanks a lot for your suggestions and comments; I think I managed to go through all of them; so this is ready for you to have another look |
|
Thanks a lot! I'll have another look, everything seems promising, but probably will do it tomorrow. Yet, I've just checked (locally) if tests for |
|
@lbenet thanks for checking this; yes fixing #411 causes an errored test to pass in TI; I have opened PerezHz/TaylorIntegration.jl#243 which tests TI with this TS branch (plus further performance improvements on TI side); will update it once this is merged. |
|
I am ok with this PR as it is, but fear how it affects TaylorIntegration.jl and/or TaylorModels.jl. I see that TaylorIntegration (using PerezHz/TaylorIntegration.jl#243) now has passing tests, but also lots of other things, which seem to me a bit unrelated. I'll check how TaylorModels behaves... |
|
I'll go ahead and merge this. As far as I've checked everything is fine with TaylorIntegration.jl and TaylorModels.jl |
|
Thanks a lot @PerezHz for this! |
* Test TS PR 415 * Remove test that no longer errors * Add regression test * Import test from #243 * Fix test * Update ci.yml; bump patch version fix TS compat
This PR adds some improvements to
Taylor1{Float64}performance:src/arithmetic.jl: add direct coefficient-vector paths for scalaradd!,subst!,mul!,muladd!,mul_scalar!, plus scalar+and-. Turns out there is a small overhead by accessinga.coeffslots of times in the innermost loops, which can build up. Also these methods now use a local accumulator for each coefficient, then store once; this avoids repeatedc[k]loads/stores inside the order traversing loop.src/power.jl: cached scalarTaylor1power metadata (findfirst(a),findlast(a),a[l0], andorder(a)) instead of recomputing per coefficient. Also, switch to direct coefficient access inpow!, and faster scalarsqr!.src/identity.jl: clean up a small inference inidentity!, which was returning an assigned scalar in one method instead ofnothing.test/mutatingfuncts.jl: added regression checks for full-series scalar mutating methods.nowarnkwarg invariables!, so that tests are free of warnings.Some quick, local, preliminary benchmarks on order 30 /100
Taylor1{Float64}:mul!: ~205 ns / 2419 ns before, now ~85 ns / 1105 ns
a*b: ~211 ns / 2481 ns before, now ~103 ns / 1100 ns
pow! coefficient loop: ~438 ns / 3961 ns before, now ~321 ns / 2560 ns
a^1.5: ~474 ns / 3995 ns before, now ~298 ns / 2235 ns
new full-series add!: (there's no before), now ~3.5 ns / 9.3 ns, allocation-free
TODO: add more systematic benchmarksEDIT: See below for systematic benchmarks