A Swift macro that transforms arithmetic expressions to use relaxed floating-point operations from swift-numerics.
The #relaxed macro rewrites binary arithmetic operators to use Relaxed.sum and Relaxed.product, enabling more aggressive compiler optimizations. Relaxed operations allow the compiler to reorder and reassociate floating-point operations, which can improve performance but may produce slightly different results due to floating-point semantics.
Add RelaxedMacros to your Package.swift
dependencies: [
.package(url: "https://github.com/loonatick-src/Relaxed.git", from: "0.1.0")
]Then add it as a dependency to your target:
.target(
name: "YourTarget",
dependencies: ["Relaxed"]
)Import the module and wrap floating point arithmetic expressions with the #relaxed macro:
import Relaxed
let a: Double = 1.0
let b: Double = 2.0
let c: Double = 3.0
let x1 = #relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))
let x2 = #relaxed(a * b / c)
// Expands to: Relaxed.product(a, b / c)
let x3 = #relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))Standard IEEE 754 floating-point arithmetic requires strict ordering of operations, which can prevent certain compiler optimizations. Relaxed operations tell the compiler it's okay to:
- Reorder additions and multiplications
- Reassociate nested operations
- Use fused multiply-add (FMA) instructions
This can lead to significant performance improvements in numerical code, especially in tight loops and vector operations.
Consider the following example.
import Relaxed
public func f1(_ a: Float, _ b: Float, _ c: Float) -> Float {
#relaxed(a * b + c)
}
public func f2(_ a: Float, _ b: Float, _ c: Float) -> Float {
a * b + c
}This is the generated code on an ARMv8-A machine when the fmadd is available.
<_$s14RelaxedExample2f2yS2f_S2ftF>:
fmul s0, s0, s1
fadd s0, s0, s2
ret
<_$s14RelaxedExample2f1yS2f_S2ftF>:
fmadd s0, s0, s1, s2
ret
Consider a more involved example. We have two implementations of the same function - the only difference between them being that one wraps
the floating point arithmetic operations in a #relaxed macro invocation.
@inlinable
func saxpyFold(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
precondition(x.count == y.count)
var result: Float = 0
for i in x.indices {
result += a * x[i] + y[i]
}
return result
}
/// Same function, but uses `#relaxed`
@inlinable
func saxpyFoldRelaxed(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
precondition(x.count == y.count)
var result: Float = 0
for i in x.indices {
#relaxed(result += a * x[i] + y[i])
}
return result
}Benchmark result (M3Max MacBook Pro):
saxpyFold:
Mean: 7.835 ± 0.744 ms
Result: 641.4162
saxpyFoldRelaxed:
Mean: 0.994 ± 0.017 ms
Result: 641.35
~7.8x speedup, with the result matching up to two decimal places for this particular randomly generated input. Why does this happen? Consider the primary loop within the generated code for both of those functions.
;; saxpyFold
loop:
ldp q2, q3, [x10, #-0x10]
fmul.4s v2, v2, v0[0] ;; v2 = a * x[i:i+4]
fmul.4s v3, v3, v0[0] ;; v3 = a * x[i+4:i+8] (loop unrolling)
ldp q4, q5, [x11, #-0x10]
fadd.4s v2, v2, v4 ;; v2 = v2 + y[i:i+4]
mov s4, v2[3]
mov s6, v2[2]
mov s7, v2[1]
fadd.4s v3, v3, v5 ;; v3 = v3 + y[i+4:i+8]
mov s5, v3[3]
mov s16, v3[2]
mov s17, v3[1]
fadd s1, s1, s2 ;; horizontal add (hadd) v2 and v3 (∵ reassociation forbidden)
fadd s1, s1, s7
fadd s1, s1, s6
fadd s1, s1, s4
fadd s1, s1, s3
fadd s1, s1, s17
fadd s1, s1, s16
fadd s1, s1, s5
add x10, x10, #0x20
add x11, x11, #0x20
subs x12, x12, #0x8
b.ne loop
SIMD instructions fmul.4s and fadd.4s can be seen in the codegen of saxpyFold (i.e. without relaxed operations), but
immediately afterwards it performs individual scalar additions
(horizontal adds), thereby introducing data hazards and stalling the CPU pipeline.
See also a simulated execution of equivalent x86-64 machine code at uica.uops.info for a deeper analysis (link). Click "Run!", and then "Open Trace"
Now, consider the codegen on using relaxed operations.
;; saxpyFoldRelaxed
vloop:
ldp q4, q5, [x10, #-0x10]
ldp q6, q7, [x11, #-0x10]
fmla.4s v6, v1, v4 ;; v6 = a * x[i:i+4] + y[i:i+4]
fmla.4s v7, v1, v5 ;; v7 = a * x[i+4:i+8] + y[i+4:i+8]
fadd.4s v2, v2, v6 ;; a1[i:i+4] += v6 (vector register accumulator)
fadd.4s v3, v3, v7 ;; a2[i+i+4] += v7 (vector register accumulator)
add x10, x10, #0x20
add x11, x11, #0x20
subs x12, x12, #0x8
b.ne vloop
;; result = hadd(a1) + hadd(a2) after the loop
Two important differences from the previous codegen.
- The loop is fully vectorized, there are no scalar operations (for horizontal adds) in the loop itself
- Fused multiply add (FMA) instructions (
fmla.4s) are used instead of individual multiply (fmul.4s) and add (fadd.4s) instructions
See also the equivalent x86-64 simulation: link.
Execution trace of implementation without relaxed operations:
Execution trace of implementation using relaxed operations:

| Operator | Transformation |
|---|---|
+ |
Relaxed.sum(a, b) |
- |
Relaxed.sum(a, -b) |
* |
Relaxed.product(a, b) |
| others | Preserved as-is |
// Addition
#relaxed(a + b)
// Expands to: Relaxed.sum(a, b)
// Subtraction
#relaxed(a - b)
// Expands to: Relaxed.sum(a, -b)
// Multiplication
#relaxed(a * b)
// Expands to: Relaxed.product(a, b)
// Division (not transformed)
#relaxed(a / b)
// Expands to: a / b// Mixed operations
#relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))
// Parenthesized expressions
#relaxed((a + b) * (c + d))
// Expands to: Relaxed.product(Relaxed.sum(a, b), Relaxed.sum(c, d))
// Complex expressions
#relaxed(a * b + c * d)
// Expands to: Relaxed.sum(Relaxed.product(a, b), Relaxed.product(c, d))Arithmetic expressions inside function calls are also transformed:
#relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))
#relaxed(f(a + b, c * d))
// Expands to: f(Relaxed.sum(a, b), Relaxed.product(c, d))Evaluate the sanity and feasibility of and/or implement the following.
- Support for
+=,-=,*= - Rewrite references to operators, e.g.
array.reduce(0, +)toarray.reduce(0, Relaxed.sum)zip(xs, ys).map(*).reduce(0, +)tozip(xs, ys).map(Relaxed.product).reduce(0, Relaxed.sum)
3-Clause BSD License