Improve Taylor1 performance by PerezHz · Pull Request #415 · JuliaDiff/TaylorSeries.jl

PerezHz · 2026-05-29T22:01:43Z

This PR adds some improvements to Taylor1{Float64} performance:

src/arithmetic.jl: add direct coefficient-vector paths for scalar add!, subst!, mul!, muladd!, mul_scalar!, plus scalar + and -. Turns out there is a small overhead by accessing a.coeffs lots of times in the innermost loops, which can build up. Also these methods now use a local accumulator for each coefficient, then store once; this avoids repeated c[k] loads/stores inside the order traversing loop.
src/power.jl: cached scalar Taylor1 power metadata (findfirst(a), findlast(a), a[l0], and order(a)) instead of recomputing per coefficient. Also, switch to direct coefficient access in pow!, and faster scalar sqr!.
src/identity.jl: clean up a small inference in identity!, which was returning an assigned scalar in one method instead of nothing.
test/mutatingfuncts.jl: added regression checks for full-series scalar mutating methods.
Add a nowarn kwarg in variables!, so that tests are free of warnings.

Some quick, local, preliminary benchmarks on order 30 /100 Taylor1{Float64}:

mul!: ~205 ns / 2419 ns before, now ~85 ns / 1105 ns
a*b: ~211 ns / 2481 ns before, now ~103 ns / 1100 ns
pow! coefficient loop: ~438 ns / 3961 ns before, now ~321 ns / 2560 ns
a^1.5: ~474 ns / 3995 ns before, now ~298 ns / 2235 ns
new full-series add!: (there's no before), now ~3.5 ns / 9.3 ns, allocation-free

~~TODO: add more systematic benchmarks~~
EDIT: See below for systematic benchmarks

…nd use it in tests

coveralls · 2026-05-29T22:06:28Z

Coverage is 89.326% — PerezHz:jp/some-perf-fixes into JuliaDiff:master. No base build found for JuliaDiff:master.

LuEdRaMo · 2026-05-30T15:16:31Z

Nice work @PerezHz! This could be a great improvement for packages upstream.

I'm curious, will this PR also affect the performance of Taylor1{Taylor1{T}} operations?

PerezHz · 2026-05-30T20:50:42Z

@LuEdRaMo in principle yes, this should help improve performance for nested Taylor1 since it touches the Taylor1{Float64} kernels which the nested operations use. This is still a proof of concept; I'd like to add some benchmarks to get a better sense of the potential speedups here.

PerezHz · 2026-05-31T02:09:58Z

Some copy-pasteable benchmarks:

using TaylorSeries, BenchmarkTools
const TS = TaylorSeries

ord = 30
a = Taylor1(rand(ord + 1), ord)
b = Taylor1(rand(ord + 1), ord)
c = zero(a)
aux = zero(a)

function test_add_kloop!(ord, a, b, c)
    @inbounds for k in 0:ord
        TS.add!(c, a, b, k)
    end
    return c
end
function test_muladd_kloop!(ord, a, b, c)
    fill!(c.coeffs, 0.0)
    @inbounds for k in 0:ord
        TS.muladd!(c, a, b, k)
    end
    return c
end
function test_pow_kloop!(ord, a, aux, c, r)
    @inbounds for k in 0:ord
        TS.pow!(c, a, aux, r, k)
    end
    return c
end

@btime test_add_kloop!($ord, $a, $b, $c);
@btime TS.add!($c, $a, $b);
@btime TS.mul!($c, $a, $b);
@btime test_muladd_kloop!($ord, $a, $b, $c);
@btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
@btime $a + $b;
@btime $a * $b;
@btime $a ^ 1.5;

This PR on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c);
  9.926 ns (0 allocations: 0 bytes)

julia> @btime TS.add!($c, $a, $b); # doesn't exist in master
  4.791 ns (0 allocations: 0 bytes)

julia> @btime TS.mul!($c, $a, $b);
  74.632 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  80.489 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  278.347 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  19.475 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  86.589 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  265.975 ns (2 allocations: 544 bytes)

master on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c);
  10.135 ns (0 allocations: 0 bytes)

julia> # @btime TS.add!($c, $a, $b); # does not exist on master

julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
  185.098 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  182.570 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  396.557 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  22.401 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  193.639 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  443.182 ns (3 allocations: 816 bytes)

PerezHz · 2026-05-31T03:49:51Z

I took a closer look at nested Taylor1 operations and found a couple of places where the all-orders-at-once versions of add! and mul! for non-nested Taylor1s can benefit the nested case. So I added some changes in the last commit which should help improve performance.

LuEdRaMo · 2026-06-02T18:26:14Z

I tested this branch using NEOs' impact monitoring benchmark (which mostly relies on Taylor1{Taylor1{T}} integrations) and these are the results:

# master

BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min … max):  5.914 s … 10.634 s  ┊ GC (min … max): 0.28% … 0.59%
 Time  (median):     6.564 s             ┊ GC (median):    0.34%
 Time  (mean ± σ):   7.534 s ±  1.832 s  ┊ GC (mean ± σ):  0.36% ± 0.19%

  █                                                         
  █▁▁▁▁▇▁▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  5.91 s        Histogram: frequency by time        10.6 s <

 Memory estimate: 143.05 MiB, allocs estimate: 1834386.

# this PR

BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min … max):  3.946 s …    4.228 s  ┊ GC (min … max): 0.41% … 0.41%
 Time  (median):     3.998 s               ┊ GC (median):    0.32%
 Time  (mean ± σ):   4.046 s ± 106.174 ms  ┊ GC (mean ± σ):  0.35% ± 0.06%

  ███   █  █ █               █     █                    █  █  
  ███▁▁▁█▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  3.95 s         Histogram: frequency by time         4.23 s <

 Memory estimate: 139.81 MiB, allocs estimate: 1818534.

PerezHz · 2026-06-03T20:39:36Z

@LuEdRaMo many thanks for checking this, sounds great! Do you find differences which are larger than expected numerical noise?

…nctions, arithmetic helpers, zero, one, identity, scalar sqrt!, nested differentiation/integration and some evaluation methods (Horner)

PerezHz · 2026-06-06T20:44:27Z

Fixes #411

PerezHz · 2026-06-06T20:49:36Z

Also, extended this concept to Taylor1{Float64} and Taylor1{Taylor1{Float64}} methods: elementary functions, division helpers (div_scalar!), differentiation and integration, evaluation methods, as well as zero!, one!, identity! and sqrt!

PerezHz · 2026-06-06T20:49:58Z

I'm also running downstream tests (TaylorIntegration, NEOs, PlanetaryEphemeris) using this branch

PerezHz · 2026-06-06T23:08:06Z

@lbenet this is now ready for review. Worth noting that downstream tests for TaylorIntegration, PlanetaryEphemeris and NEOs are passing with this branch

lbenet · 2026-06-08T00:52:23Z

@lbenet this is now ready for review. Worth noting that downstream tests for TaylorIntegration, PlanetaryEphemeris and NEOs are passing with this branch

Thanks @PerezHz, in the next few days I'll be sending some comments (I'm at a conference this week).

Here is a very general comment (from skimming src/arithmetic.jl): While I like very much the approach and its benefits (e.g., add!(a, b, c)) I find it subtle that @taylorize will use a different method (i.e., add!(a, b, c, k)). While the example is perhaps not the best, my point is to use the new approach at the level of add!(a, b, c, k), and also reuse that function as much as possible, e.g. in order to compute a+b, use add!(a, b, c) which loops over k and uses add!(a, b, c, k).

Also, we should avoid code repetition; makes life simpler to maintain...

PerezHz · 2026-06-08T22:19:56Z

@lbenet the idea of the new "full-loop" add!(a, b, c) method for Taylor1{Float64} is mainly to be used internally by add!(a, b, c, k) method for Taylor1{TaylorN{Float64}}, which makes addition for the latter type faster, since it does not have lots of calling overhead going back and forth for addition of nested-type variables (same goes for subst!, while for mul!, div! and pow! things are different since they do "heavier" computations).

That said, I agree on both points: we should make this consistent with how @taylorize is doing things, to avoid divergence between taylorized and non-taylorized computations, and I also agree that we should avoid code duplication where possible (I mainly focused on performance for this PR, but happy to look for places where code can be condensed and simplified). If you agree I can try to take a look into possibly condensing/simplifying code before you take a deeper look into the changes so far, what do you think?

lbenet · 2026-06-09T00:04:43Z

@lbenet the idea of the new "full-loop" add!(a, b, c) method for Taylor1{Float64} is mainly to be used internally by add!(a, b, c, k) method for Taylor1{TaylorN{Float64}}, which makes addition for the latter type faster, since it does not have lots of calling overhead going back and forth for addition of nested-type variables (same goes for subst!, while for mul!, div! and pow! things are different since they do "heavier" computations).

I see that the direct method (add!(a, b, c)) is faster, by avoiding the allocations that a[k] causes in favor of a.coeffs[k+1] (including the "views" a_coeffs). But I am not sure that I follow your argument, since it is not clear to me how add!(a, b, c, k) will use ``add!(a, b, c)`...

That said, I agree on both points: we should make this consistent with how @taylorize is doing things, to avoid divergence between taylorized and non-taylorized computations, and I also agree that we should avoid code duplication where possible (I mainly focused on performance for this PR, but happy to look for places where code can be condensed and simplified). If you agree I can try to take a look into possibly condensing/simplifying code before you take a deeper look into the changes so far, what do you think?

I am already half-way through a first review, so let me finish it, and then you try condensing/simplifying the code.

PerezHz · 2026-06-09T01:37:24Z

But I am not sure that I follow your argument, since it is not clear to me how add!(a, b, c, k) will use ``add!(a, b, c)`...

I meant this for Taylor1{Taylor1{Float64}}. For example, here the add! method to update the k-th coefficient of a Taylor1{Taylor1{Float64}} is using the "full-loop" add!(a, b, c) to update the inner (or nested) coefficients in a single call.

I am already half-way through a first review, so let me finish it, and then you try condensing/simplifying the code.

Alright, looking forward to receiving feedback to improve this PR!

lbenet

First, I think this PR is great, finding a way to avoid some unnecessary allocations. Yet, I think some of the methods introduced define differing (code) paths that could change the behaviour in comparisson with the taylorized functions, which is not advisable. Also, I think some methods are too specialized, and we should specialize the inner methods (such as add!(a, b, c, k)).
I've thus left some comments which are mostly suggestions. Their goal is to avoid repeated code, and make sure that the taylorized functions yield the same.

lbenet · 2026-06-09T04:52:50Z

+    b_coeffs = b.coeffs
+    @inbounds for i in eachindex(v_coeffs)
+        v_coeffs[i] = a_coeffs[i] + b_coeffs[i]
+    end


Note: This is not a suggestion for replacement (even as it appears as such), it is simply an illustration of my general comment here:

Suggested change

end

for k in eachindex(a)

add!(v, a, b, k)

end

The idea is to use in the inner method add!(v, a, b, k) the kind of performance optimisations you are introducing in this PR.

I don't know if this change would provide the same performance improvements, but it is worth checking it. The key point is that the same operations would be implemented when operating directly with the optimized (@taylorized) methods, or directly with the Taylor1 objects.

Thank you @lbenet for this suggestion. I double-checked this, and it does look like adding the optimizations at the add!(v, a, b, k) level gives essentially the same performance, with a difference ~0.5 ns, which is already around the noise level (this probably means that somehow the Julia compiler manages to inline away the loops and generates optimized code very similar to what the previous approach did). So I modified add!(v, a, b, k) and subst!(v, a, b, k) to use operate directly on the .coeffs.

lbenet · 2026-06-09T05:16:26Z

I meant this for Taylor1{Taylor1{Float64}}. For example, here the add! method to update the k-th coefficient of a Taylor1{Taylor1{Float64}} is using the "full-loop" add!(a, b, c) to update the inner (or nested) coefficients in a single call.

Thanks for the clarification; it is good point, and I agree in the usufulness of the method. Yet, consider that we may have Taylor1{Taylor{Taylor1{<:NumberNotSeries}}} which would not use those methods.

Co-authored-by: Luis Benet <lbenet@users.noreply.github.com>

…dentity! in call site

PerezHz · 2026-06-11T21:40:45Z

Updated benchmarks on my laptop:

julia> @btime test_add_kloop!($ord, $a, $b, $c); # dropped from 9.926 ns vs last run of this PR, 10.135 ns on `master`
  4.458 ns (0 allocations: 0 bytes)

julia> @btime TS.add!($c, $a, $b);
  4.416 ns (0 allocations: 0 bytes)

julia> @btime TS.mul!($c, $a, $b);
[ Warning: mul! is defined in LinearAlgebra and is not public in TaylorSeries
  74.554 ns (0 allocations: 0 bytes)

julia> @btime test_muladd_kloop!($ord, $a, $b, $c);
  81.006 ns (0 allocations: 0 bytes)

julia> @btime test_pow_kloop!($ord, $a, $aux, $c, 1.5);
  279.394 ns (0 allocations: 0 bytes)

julia> @btime $a + $b;
  19.204 ns (1 allocation: 272 bytes)

julia> @btime $a * $b;
  87.022 ns (1 allocation: 272 bytes)

julia> @btime $a ^ 1.5;
  267.590 ns (2 allocations: 544 bytes)

PerezHz · 2026-06-12T01:17:39Z

@lbenet thanks a lot for your suggestions and comments; I think I managed to go through all of them; so this is ready for you to have another look

lbenet · 2026-06-12T01:41:42Z

Thanks a lot! I'll have another look, everything seems promising, but probably will do it tomorrow. Yet, I've just checked (locally) if tests for TaylorIntegration pass, and there are two failing tests (in taylorize.jl, for kepler1! defined in line 2093 or around it). The first test that fails is because taylorizing yields no problems (so line 2107 yields no logs), and then parse_eqs is true (line 2116). It would be worth checking if the produced series are the expected ones...

PerezHz · 2026-06-12T03:05:34Z

@lbenet thanks for checking this; yes fixing #411 causes an errored test to pass in TI; I have opened PerezHz/TaylorIntegration.jl#243 which tests TI with this TS branch (plus further performance improvements on TI side); will update it once this is merged.

lbenet · 2026-06-12T18:23:27Z

I am ok with this PR as it is, but fear how it affects TaylorIntegration.jl and/or TaylorModels.jl. I see that TaylorIntegration (using PerezHz/TaylorIntegration.jl#243) now has passing tests, but also lots of other things, which seem to me a bit unrelated. I'll check how TaylorModels behaves...

lbenet · 2026-06-13T18:40:20Z

I'll go ahead and merge this. As far as I've checked everything is fine with TaylorIntegration.jl and TaylorModels.jl

lbenet · 2026-06-13T18:40:52Z

Thanks a lot @PerezHz for this!

* Test TS PR 415 * Remove test that no longer errors * Add regression test * Import test from #243 * Fix test * Update ci.yml; bump patch version fix TS compat

PerezHz added 2 commits May 29, 2026 17:44

Some performance fixes for Taylor1; add nowarn kwarg for variables! a…

8951e74

…nd use it in tests

Bump patch version

eb911aa

Improvements for nested Taylor1s

85a0ec1

PerezHz added 2 commits June 6, 2026 16:14

Switch to direct coeffs access for mutating methods of: elementary fu…

d0de251

…nctions, arithmetic helpers, zero, one, identity, scalar sqrt!, nested differentiation/integration and some evaluation methods (Horner)

Fix JuliaDiff#411

d515ac5

PerezHz changed the title ~~WIP: improve Taylor1 performance~~ Improve Taylor1 performance Jun 6, 2026

Fix docs build

7e546d2

PerezHz marked this pull request as ready for review June 6, 2026 23:04

Try a similar approach for Taylor1{TaylorN{Float64}}

63f90ce

PerezHz mentioned this pull request Jun 7, 2026

Miscellaneous performance improvements PerezHz/TaylorIntegration.jl#243

Open

lbenet reviewed Jun 9, 2026

View reviewed changes

PerezHz and others added 3 commits June 9, 2026 22:08

Apply suggestion from @lbenet

21bc234

Co-authored-by: Luis Benet <lbenet@users.noreply.github.com>

Remove methods (follow-up to last commit)

bc7c4ea

Move add! subst! methods

1593d3f

PerezHz added 5 commits June 11, 2026 00:03

Add missing method (fix JuliaDiff#411)

15abfd8

Add test to catch warning from variables! with nowarn=false

2fe3eb0

Rename _pow_taylor1_cached! -> _pow_cached!

791c660

Inline identity! for scalar coeffs case, and switch from copyto! to i…

d26e622

…dentity! in call site

Apply suggestion by @lbenet

9b6d327

PerezHz mentioned this pull request Jun 11, 2026

WIP: performance improvements for evaluation methods #416

Draft

PerezHz added 3 commits June 11, 2026 15:45

Apply suggeston by @lbenet

f2794f8

Remove redundant methods; edit existing methods appropriately

1b747aa

Move +,add!;-,subst! dispatches into Taylor1 metaprogramming block

b3a000a

Use constant_term! internally

f97c601

PerezHz mentioned this pull request Jun 13, 2026

Minimal updates due to JuliaDiff/TaylorSeries.jl#415 PerezHz/TaylorIntegration.jl#245

Merged

lbenet merged commit 7a8ccbd into JuliaDiff:master Jun 13, 2026
13 checks passed

-    end
+    for k in eachindex(a)
+        add!(v, a, b, k)
+    end

Conversation

PerezHz commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuEdRaMo commented May 30, 2026

Uh oh!

PerezHz commented May 30, 2026

Uh oh!

PerezHz commented May 31, 2026

Uh oh!

PerezHz commented May 31, 2026

Uh oh!

LuEdRaMo commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PerezHz commented Jun 3, 2026

Uh oh!

PerezHz commented Jun 6, 2026

Uh oh!

PerezHz commented Jun 6, 2026

Uh oh!

PerezHz commented Jun 6, 2026

Uh oh!

PerezHz commented Jun 6, 2026

Uh oh!

lbenet commented Jun 8, 2026

Uh oh!

PerezHz commented Jun 8, 2026

Uh oh!

lbenet commented Jun 9, 2026

Uh oh!

PerezHz commented Jun 9, 2026

Uh oh!

lbenet left a comment

Choose a reason for hiding this comment

Uh oh!

lbenet Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

PerezHz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lbenet commented Jun 9, 2026

Uh oh!

PerezHz commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PerezHz commented Jun 12, 2026

Uh oh!

lbenet commented Jun 12, 2026

Uh oh!

PerezHz commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lbenet commented Jun 12, 2026

Uh oh!

lbenet commented Jun 13, 2026

Uh oh!

lbenet commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

PerezHz commented May 29, 2026 •

edited

Loading

coveralls commented May 29, 2026 •

edited

Loading

LuEdRaMo commented Jun 2, 2026 •

edited

Loading

PerezHz commented Jun 11, 2026 •

edited

Loading

PerezHz commented Jun 12, 2026 •

edited

Loading