Skip to content

Improve Taylor1 performance#415

Merged
lbenet merged 19 commits into
JuliaDiff:masterfrom
PerezHz:jp/some-perf-fixes
Jun 13, 2026
Merged

Improve Taylor1 performance#415
lbenet merged 19 commits into
JuliaDiff:masterfrom
PerezHz:jp/some-perf-fixes

Conversation

@PerezHz

@PerezHz PerezHz commented May 29, 2026

Copy link
Copy Markdown
Contributor

This PR adds some improvements to Taylor1{Float64} performance:

  • src/arithmetic.jl: add direct coefficient-vector paths for scalar add!, subst!, mul!, muladd!, mul_scalar!, plus scalar + and -. Turns out there is a small overhead by accessing a.coeffs lots of times in the innermost loops, which can build up. Also these methods now use a local accumulator for each coefficient, then store once; this avoids repeated c[k] loads/stores inside the order traversing loop.
  • src/power.jl: cached scalar Taylor1 power metadata (findfirst(a), findlast(a), a[l0], and order(a)) instead of recomputing per coefficient. Also, switch to direct coefficient access in pow!, and faster scalar sqr!.
  • src/identity.jl: clean up a small inference in identity!, which was returning an assigned scalar in one method instead of nothing.
  • test/mutatingfuncts.jl: added regression checks for full-series scalar mutating methods.
  • Add a nowarn kwarg in variables!, so that tests are free of warnings.

Some quick, local, preliminary benchmarks on order 30 /100 Taylor1{Float64}:

mul!: ~205 ns / 2419 ns before, now ~85 ns / 1105 ns
a*b: ~211 ns / 2481 ns before, now ~103 ns / 1100 ns
pow! coefficient loop: ~438 ns / 3961 ns before, now ~321 ns / 2560 ns
a^1.5: ~474 ns / 3995 ns before, now ~298 ns / 2235 ns
new full-series add!: (there's no before), now ~3.5 ns / 9.3 ns, allocation-free

TODO: add more systematic benchmarks
EDIT: See below for systematic benchmarks

@coveralls

coveralls commented May 29, 2026

Copy link
Copy Markdown

Coverage Status

Coverage is 89.326%PerezHz:jp/some-perf-fixes into JuliaDiff:master. No base build found for JuliaDiff:master.

@LuEdRaMo

Copy link
Copy Markdown
Contributor

Nice work @PerezHz! This could be a great improvement for packages upstream.

I'm curious, will this PR also affect the performance of Taylor1{Taylor1{T}} operations?

@PerezHz

PerezHz commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

@LuEdRaMo in principle yes, this should help improve performance for nested Taylor1 since it touches the Taylor1{Float64} kernels which the nested operations use. This is still a proof of concept; I'd like to add some benchmarks to get a better sense of the potential speedups here.

@PerezHz

PerezHz commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Some copy-pasteable benchmarks:

using TaylorSeries, BenchmarkTools
const TS = TaylorSeries

ord = 30
a = Taylor1(rand(ord + 1), ord)
b = Taylor1(rand(ord + 1), ord)
c = zero(a)
aux = zero(a)

function test_add_kloop!(ord, a, b, c)
    @inbounds for k in 0:ord
        TS.add!(c, a, b, k)
    end
    return c
end
function test_muladd_kloop!(ord, a, b, c)
    fill!(c.coeffs, 0.0)
    @inbounds for k in 0:ord
        TS.muladd!(c, a, b, k)
    end
    return c
end
function test_pow_kloop!(ord, a, aux, c, r)
    @inbounds for k in 0:ord
        TS.pow!(c, a, aux, r, k)
    end
    return c
end

@btime test_add_kloop!($ord, $a, $b, $c);
@btime TS.add!($c, $a, $b);
@btime TS.mul!($c, $a, $b);
@btime test_muladd_kloop!($ord, $a, $b, $c);
@btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
@btime $a + $b;
@btime $a * $b;
@btime $a ^ 1.5;

This PR on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c);
  9.926 ns (0 allocations: 0 bytes)

julia> @btime TS.add!($c, $a, $b); # doesn't exist in master
  4.791 ns (0 allocations: 0 bytes)

julia> @btime TS.mul!($c, $a, $b);
  74.632 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  80.489 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  278.347 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  19.475 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  86.589 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  265.975 ns (2 allocations: 544 bytes)

master on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c);
  10.135 ns (0 allocations: 0 bytes)

julia> # @btime TS.add!($c, $a, $b); # does not exist on master

julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
  185.098 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  182.570 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  396.557 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  22.401 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  193.639 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  443.182 ns (3 allocations: 816 bytes)

@PerezHz

PerezHz commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

I took a closer look at nested Taylor1 operations and found a couple of places where the all-orders-at-once versions of add! and mul! for non-nested Taylor1s can benefit the nested case. So I added some changes in the last commit which should help improve performance.

@LuEdRaMo

LuEdRaMo commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

I tested this branch using NEOs' impact monitoring benchmark (which mostly relies on Taylor1{Taylor1{T}} integrations) and these are the results:

# master

BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min  max):  5.914 s  10.634 s  ┊ GC (min  max): 0.28%  0.59%
 Time  (median):     6.564 s             ┊ GC (median):    0.34%
 Time  (mean ± σ):   7.534 s ±  1.832 s  ┊ GC (mean ± σ):  0.36% ± 0.19%

  █                                                         
  █▁▁▁▁▇▁▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  5.91 s        Histogram: frequency by time        10.6 s <

 Memory estimate: 143.05 MiB, allocs estimate: 1834386.

# this PR

BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min  max):  3.946 s     4.228 s  ┊ GC (min  max): 0.41%  0.41%
 Time  (median):     3.998 s               ┊ GC (median):    0.32%
 Time  (mean ± σ):   4.046 s ± 106.174 ms  ┊ GC (mean ± σ):  0.35% ± 0.06%

  ███   █  █ █               █     █                    █  █  
  ███▁▁▁█▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  3.95 s         Histogram: frequency by time         4.23 s <

 Memory estimate: 139.81 MiB, allocs estimate: 1818534.

@PerezHz

PerezHz commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@LuEdRaMo many thanks for checking this, sounds great! Do you find differences which are larger than expected numerical noise?

PerezHz added 2 commits June 6, 2026 16:14
…nctions, arithmetic helpers, zero, one, identity, scalar sqrt!, nested differentiation/integration and some evaluation methods (Horner)
@PerezHz

PerezHz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Fixes #411

@PerezHz

PerezHz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Also, extended this concept to Taylor1{Float64} and Taylor1{Taylor1{Float64}} methods: elementary functions, division helpers (div_scalar!), differentiation and integration, evaluation methods, as well as zero!, one!, identity! and sqrt!

@PerezHz

PerezHz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

I'm also running downstream tests (TaylorIntegration, NEOs, PlanetaryEphemeris) using this branch

@PerezHz PerezHz changed the title WIP: improve Taylor1 performance Improve Taylor1 performance Jun 6, 2026
@PerezHz PerezHz marked this pull request as ready for review June 6, 2026 23:04
@PerezHz

PerezHz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@lbenet this is now ready for review. Worth noting that downstream tests for TaylorIntegration, PlanetaryEphemeris and NEOs are passing with this branch

@lbenet

lbenet commented Jun 8, 2026

Copy link
Copy Markdown
Member

@lbenet this is now ready for review. Worth noting that downstream tests for TaylorIntegration, PlanetaryEphemeris and NEOs are passing with this branch

Thanks @PerezHz, in the next few days I'll be sending some comments (I'm at a conference this week).

Here is a very general comment (from skimming src/arithmetic.jl): While I like very much the approach and its benefits (e.g., add!(a, b, c)) I find it subtle that @taylorize will use a different method (i.e., add!(a, b, c, k)). While the example is perhaps not the best, my point is to use the new approach at the level of add!(a, b, c, k), and also reuse that function as much as possible, e.g. in order to compute a+b, use add!(a, b, c) which loops over k and uses add!(a, b, c, k).

Also, we should avoid code repetition; makes life simpler to maintain...

@PerezHz

PerezHz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@lbenet the idea of the new "full-loop" add!(a, b, c) method for Taylor1{Float64} is mainly to be used internally by add!(a, b, c, k) method for Taylor1{TaylorN{Float64}}, which makes addition for the latter type faster, since it does not have lots of calling overhead going back and forth for addition of nested-type variables (same goes for subst!, while for mul!, div! and pow! things are different since they do "heavier" computations).

That said, I agree on both points: we should make this consistent with how @taylorize is doing things, to avoid divergence between taylorized and non-taylorized computations, and I also agree that we should avoid code duplication where possible (I mainly focused on performance for this PR, but happy to look for places where code can be condensed and simplified). If you agree I can try to take a look into possibly condensing/simplifying code before you take a deeper look into the changes so far, what do you think?

@lbenet

lbenet commented Jun 9, 2026

Copy link
Copy Markdown
Member

@lbenet the idea of the new "full-loop" add!(a, b, c) method for Taylor1{Float64} is mainly to be used internally by add!(a, b, c, k) method for Taylor1{TaylorN{Float64}}, which makes addition for the latter type faster, since it does not have lots of calling overhead going back and forth for addition of nested-type variables (same goes for subst!, while for mul!, div! and pow! things are different since they do "heavier" computations).

I see that the direct method (add!(a, b, c)) is faster, by avoiding the allocations that a[k] causes in favor of a.coeffs[k+1] (including the "views" a_coeffs). But I am not sure that I follow your argument, since it is not clear to me how add!(a, b, c, k) will use ``add!(a, b, c)`...

That said, I agree on both points: we should make this consistent with how @taylorize is doing things, to avoid divergence between taylorized and non-taylorized computations, and I also agree that we should avoid code duplication where possible (I mainly focused on performance for this PR, but happy to look for places where code can be condensed and simplified). If you agree I can try to take a look into possibly condensing/simplifying code before you take a deeper look into the changes so far, what do you think?

I am already half-way through a first review, so let me finish it, and then you try condensing/simplifying the code.

@PerezHz

PerezHz commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

But I am not sure that I follow your argument, since it is not clear to me how add!(a, b, c, k) will use ``add!(a, b, c)`...

I meant this for Taylor1{Taylor1{Float64}}. For example, here the add! method to update the k-th coefficient of a Taylor1{Taylor1{Float64}} is using the "full-loop" add!(a, b, c) to update the inner (or nested) coefficients in a single call.

I am already half-way through a first review, so let me finish it, and then you try condensing/simplifying the code.

Alright, looking forward to receiving feedback to improve this PR!

@lbenet lbenet left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I think this PR is great, finding a way to avoid some unnecessary allocations. Yet, I think some of the methods introduced define differing (code) paths that could change the behaviour in comparisson with the taylorized functions, which is not advisable. Also, I think some methods are too specialized, and we should specialize the inner methods (such as add!(a, b, c, k)).
I've thus left some comments which are mostly suggestions. Their goal is to avoid repeated code, and make sure that the taylorized functions yield the same.

Comment thread src/arithmetic.jl Outdated
b_coeffs = b.coeffs
@inbounds for i in eachindex(v_coeffs)
v_coeffs[i] = a_coeffs[i] + b_coeffs[i]
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is not a suggestion for replacement (even as it appears as such), it is simply an illustration of my general comment here:

Suggested change
end
for k in eachindex(a)
add!(v, a, b, k)
end

The idea is to use in the inner method add!(v, a, b, k) the kind of performance optimisations you are introducing in this PR.

I don't know if this change would provide the same performance improvements, but it is worth checking it. The key point is that the same operations would be implemented when operating directly with the optimized (@taylorized) methods, or directly with the Taylor1 objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lbenet for this suggestion. I double-checked this, and it does look like adding the optimizations at the add!(v, a, b, k) level gives essentially the same performance, with a difference ~0.5 ns, which is already around the noise level (this probably means that somehow the Julia compiler manages to inline away the loops and generates optimized code very similar to what the previous approach did). So I modified add!(v, a, b, k) and subst!(v, a, b, k) to use operate directly on the .coeffs.

Comment thread src/arithmetic.jl Outdated
Comment thread src/arithmetic.jl
Comment thread src/arithmetic.jl Outdated
Comment thread src/arithmetic.jl
Comment thread src/identity.jl
Comment thread src/power.jl Outdated
Comment thread src/power.jl Outdated
Comment thread src/TaylorSeries.jl
Comment thread test/manyvariables.jl
@lbenet

lbenet commented Jun 9, 2026

Copy link
Copy Markdown
Member

I meant this for Taylor1{Taylor1{Float64}}. For example, here the add! method to update the k-th coefficient of a Taylor1{Taylor1{Float64}} is using the "full-loop" add!(a, b, c) to update the inner (or nested) coefficients in a single call.

Thanks for the clarification; it is good point, and I agree in the usufulness of the method. Yet, consider that we may have Taylor1{Taylor{Taylor1{<:NumberNotSeries}}} which would not use those methods.

PerezHz and others added 3 commits June 9, 2026 22:08
@PerezHz

PerezHz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Updated benchmarks on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c); # dropped from 9.926 ns vs last run of this PR, 10.135 ns on `master`
  4.458 ns (0 allocations: 0 bytes)

julia> @btime TS.add!($c, $a, $b);
  4.416 ns (0 allocations: 0 bytes)

julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
  74.554 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  81.006 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  279.394 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  19.204 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  87.022 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  267.590 ns (2 allocations: 544 bytes)

@PerezHz

PerezHz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@lbenet thanks a lot for your suggestions and comments; I think I managed to go through all of them; so this is ready for you to have another look

@lbenet

lbenet commented Jun 12, 2026

Copy link
Copy Markdown
Member

Thanks a lot! I'll have another look, everything seems promising, but probably will do it tomorrow. Yet, I've just checked (locally) if tests for TaylorIntegration pass, and there are two failing tests (in taylorize.jl, for kepler1! defined in line 2093 or around it). The first test that fails is because taylorizing yields no problems (so line 2107 yields no logs), and then parse_eqs is true (line 2116). It would be worth checking if the produced series are the expected ones...

@PerezHz

PerezHz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@lbenet thanks for checking this; yes fixing #411 causes an errored test to pass in TI; I have opened PerezHz/TaylorIntegration.jl#243 which tests TI with this TS branch (plus further performance improvements on TI side); will update it once this is merged.

@lbenet

lbenet commented Jun 12, 2026

Copy link
Copy Markdown
Member

I am ok with this PR as it is, but fear how it affects TaylorIntegration.jl and/or TaylorModels.jl. I see that TaylorIntegration (using PerezHz/TaylorIntegration.jl#243) now has passing tests, but also lots of other things, which seem to me a bit unrelated. I'll check how TaylorModels behaves...

@lbenet

lbenet commented Jun 13, 2026

Copy link
Copy Markdown
Member

I'll go ahead and merge this. As far as I've checked everything is fine with TaylorIntegration.jl and TaylorModels.jl

@lbenet

lbenet commented Jun 13, 2026

Copy link
Copy Markdown
Member

Thanks a lot @PerezHz for this!

@lbenet lbenet merged commit 7a8ccbd into JuliaDiff:master Jun 13, 2026
13 checks passed
PerezHz added a commit to PerezHz/TaylorIntegration.jl that referenced this pull request Jun 13, 2026
* Test TS PR 415

* Remove test that no longer errors

* Add regression test

* Import test from #243

* Fix test

* Update ci.yml; bump patch version fix TS compat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants