Relaxed

A Swift macro that transforms arithmetic expressions to use relaxed floating-point operations from swift-numerics.

Overview

The #relaxed macro rewrites binary arithmetic operators to use Relaxed.sum and Relaxed.product, enabling more aggressive compiler optimizations. Relaxed operations allow the compiler to reorder and reassociate floating-point operations, which can improve performance but may produce slightly different results due to floating-point semantics.

Usage

Add RelaxedMacros to your Package.swift

dependencies: [
    .package(url: "https://github.com/loonatick-src/Relaxed.git", from: "0.1.0")
]

Then add it as a dependency to your target:

.target(
    name: "YourTarget",
    dependencies: ["Relaxed"]
)

Import the module and wrap floating point arithmetic expressions with the #relaxed macro:

import Relaxed

let a: Double = 1.0
let b: Double = 2.0
let c: Double = 3.0

let x1 = #relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))
let x2 = #relaxed(a * b / c)
// Expands to: Relaxed.product(a, b / c)
let x3 = #relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))

Why Use Relaxed Operations?

Standard IEEE 754 floating-point arithmetic requires strict ordering of operations, which can prevent certain compiler optimizations. Relaxed operations tell the compiler it's okay to:

Reorder additions and multiplications
Reassociate nested operations
Use fused multiply-add (FMA) instructions

This can lead to significant performance improvements in numerical code, especially in tight loops and vector operations.

Consider the following example.

import Relaxed

public func f1(_ a: Float, _ b: Float, _ c: Float) -> Float {
    #relaxed(a * b + c)
}

public func f2(_ a: Float, _ b: Float, _ c: Float) -> Float {
    a * b + c
}

This is the generated code on an ARMv8-A machine when the fmadd is available.

<_$s14RelaxedExample2f2yS2f_S2ftF>:
fmul    s0, s0, s1
fadd    s0, s0, s2
ret

<_$s14RelaxedExample2f1yS2f_S2ftF>:
fmadd   s0, s0, s1, s2
ret

Consider a more involved example. We have two implementations of the same function - the only difference between them being that one wraps the floating point arithmetic operations in a #relaxed macro invocation.

@inlinable
func saxpyFold(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
    precondition(x.count == y.count)
    var result: Float = 0
    for i in x.indices {
        result += a * x[i] + y[i]
    }
    return result
}

/// Same function, but uses `#relaxed`
@inlinable
func saxpyFoldRelaxed(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
    precondition(x.count == y.count)
    var result: Float = 0
    for i in x.indices {
        #relaxed(result += a * x[i] + y[i])
    }
    return result
}

Benchmark result (M3Max MacBook Pro):

saxpyFold:
  Mean:   7.835 ± 0.744 ms
  Result: 641.4162

saxpyFoldRelaxed:
  Mean:   0.994 ± 0.017 ms
  Result: 641.35

~7.8x speedup, with the result matching up to two decimal places for this particular randomly generated input. Why does this happen? Consider the primary loop within the generated code for both of those functions.

;; saxpyFold
loop:
ldp     q2, q3, [x10, #-0x10]
fmul.4s v2, v2, v0[0]         ;; v2 = a * x[i:i+4]
fmul.4s v3, v3, v0[0]         ;; v3 = a * x[i+4:i+8]  (loop unrolling)
ldp     q4, q5, [x11, #-0x10]
fadd.4s v2, v2, v4            ;; v2 = v2 + y[i:i+4]
mov     s4, v2[3]
mov     s6, v2[2]
mov     s7, v2[1]
fadd.4s v3, v3, v5            ;; v3 = v3 + y[i+4:i+8]
mov     s5, v3[3]
mov     s16, v3[2]
mov     s17, v3[1]
fadd    s1, s1, s2            ;; horizontal add (hadd) v2 and v3 (∵ reassociation forbidden)
fadd    s1, s1, s7
fadd    s1, s1, s6
fadd    s1, s1, s4
fadd    s1, s1, s3
fadd    s1, s1, s17
fadd    s1, s1, s16
fadd    s1, s1, s5
add     x10, x10, #0x20
add     x11, x11, #0x20
subs    x12, x12, #0x8
b.ne    loop

SIMD instructions fmul.4s and fadd.4s can be seen in the codegen of saxpyFold (i.e. without relaxed operations), but immediately afterwards it performs individual scalar additions (horizontal adds), thereby introducing data hazards and stalling the CPU pipeline.

See also a simulated execution of equivalent x86-64 machine code at uica.uops.info for a deeper analysis (link). Click "Run!", and then "Open Trace"

Now, consider the codegen on using relaxed operations.

;; saxpyFoldRelaxed
vloop:
ldp     q4, q5, [x10, #-0x10]
ldp     q6, q7, [x11, #-0x10]
fmla.4s v6, v1, v4            ;; v6 = a * x[i:i+4] + y[i:i+4]
fmla.4s v7, v1, v5            ;; v7 = a * x[i+4:i+8] + y[i+4:i+8]
fadd.4s v2, v2, v6            ;; a1[i:i+4] += v6    (vector register accumulator)
fadd.4s v3, v3, v7            ;; a2[i+i+4] += v7    (vector register accumulator)
add     x10, x10, #0x20
add     x11, x11, #0x20
subs    x12, x12, #0x8
b.ne    vloop

;; result = hadd(a1) + hadd(a2) after the loop

Two important differences from the previous codegen.

The loop is fully vectorized, there are no scalar operations (for horizontal adds) in the loop itself
Fused multiply add (FMA) instructions (fmla.4s) are used instead of individual multiply (fmul.4s) and add (fadd.4s) instructions

See also the equivalent x86-64 simulation: link.

Execution trace of implementation without relaxed operations: Execution trace of implementation using relaxed operations:

Supported Operators

Operator	Transformation
`+`	`Relaxed.sum(a, b)`
`-`	`Relaxed.sum(a, -b)`
`*`	`Relaxed.product(a, b)`
others	Preserved as-is

Examples

Basic Operations

// Addition
#relaxed(a + b)
// Expands to: Relaxed.sum(a, b)

// Subtraction
#relaxed(a - b)
// Expands to: Relaxed.sum(a, -b)

// Multiplication
#relaxed(a * b)
// Expands to: Relaxed.product(a, b)

// Division (not transformed)
#relaxed(a / b)
// Expands to: a / b

Nested Expressions

// Mixed operations
#relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))

// Parenthesized expressions
#relaxed((a + b) * (c + d))
// Expands to: Relaxed.product(Relaxed.sum(a, b), Relaxed.sum(c, d))

// Complex expressions
#relaxed(a * b + c * d)
// Expands to: Relaxed.sum(Relaxed.product(a, b), Relaxed.product(c, d))

Function Calls

Arithmetic expressions inside function calls are also transformed:

#relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))

#relaxed(f(a + b, c * d))
// Expands to: f(Relaxed.sum(a, b), Relaxed.product(c, d))

Wishlist/Near-Term Future Work

Evaluate the sanity and feasibility of and/or implement the following.

Support for +=, -=, *=
Rewrite references to operators, e.g.
- array.reduce(0, +) to array.reduce(0, Relaxed.sum)
- zip(xs, ys).map(*).reduce(0, +) to zip(xs, ys).map(Relaxed.product).reduce(0, Relaxed.sum)

License

3-Clause BSD License

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Sources		Sources
Tests/RelaxedTests		Tests/RelaxedTests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relaxed

Overview

Usage

Why Use Relaxed Operations?

Supported Operators

Examples

Basic Operations

Nested Expressions

Function Calls

Wishlist/Near-Term Future Work

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Relaxed

Overview

Usage

Why Use Relaxed Operations?

Supported Operators

Examples

Basic Operations

Nested Expressions

Function Calls

Wishlist/Near-Term Future Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages