Skip to content

loonatick-src/RelaxedMacro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relaxed

A Swift macro that transforms arithmetic expressions to use relaxed floating-point operations from swift-numerics.

Overview

The #relaxed macro rewrites binary arithmetic operators to use Relaxed.sum and Relaxed.product, enabling more aggressive compiler optimizations. Relaxed operations allow the compiler to reorder and reassociate floating-point operations, which can improve performance but may produce slightly different results due to floating-point semantics.

Usage

Add RelaxedMacros to your Package.swift

dependencies: [
    .package(url: "https://github.com/loonatick-src/Relaxed.git", from: "0.1.0")
]

Then add it as a dependency to your target:

.target(
    name: "YourTarget",
    dependencies: ["Relaxed"]
)

Import the module and wrap floating point arithmetic expressions with the #relaxed macro:

import Relaxed

let a: Double = 1.0
let b: Double = 2.0
let c: Double = 3.0

let x1 = #relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))
let x2 = #relaxed(a * b / c)
// Expands to: Relaxed.product(a, b / c)
let x3 = #relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))

Why Use Relaxed Operations?

Standard IEEE 754 floating-point arithmetic requires strict ordering of operations, which can prevent certain compiler optimizations. Relaxed operations tell the compiler it's okay to:

  • Reorder additions and multiplications
  • Reassociate nested operations
  • Use fused multiply-add (FMA) instructions

This can lead to significant performance improvements in numerical code, especially in tight loops and vector operations.

Consider the following example.

import Relaxed

public func f1(_ a: Float, _ b: Float, _ c: Float) -> Float {
    #relaxed(a * b + c)
}

public func f2(_ a: Float, _ b: Float, _ c: Float) -> Float {
    a * b + c
}

This is the generated code on an ARMv8-A machine when the fmadd is available.

<_$s14RelaxedExample2f2yS2f_S2ftF>:
fmul    s0, s0, s1
fadd    s0, s0, s2
ret

<_$s14RelaxedExample2f1yS2f_S2ftF>:
fmadd   s0, s0, s1, s2
ret

Consider a more involved example. We have two implementations of the same function - the only difference between them being that one wraps the floating point arithmetic operations in a #relaxed macro invocation.

@inlinable
func saxpyFold(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
    precondition(x.count == y.count)
    var result: Float = 0
    for i in x.indices {
        result += a * x[i] + y[i]
    }
    return result
}

/// Same function, but uses `#relaxed`
@inlinable
func saxpyFoldRelaxed(_ x: [Float], _ y: [Float], _ a: Float) -> Float {
    precondition(x.count == y.count)
    var result: Float = 0
    for i in x.indices {
        #relaxed(result += a * x[i] + y[i])
    }
    return result
}

Benchmark result (M3Max MacBook Pro):

saxpyFold:
  Mean:   7.835 ± 0.744 ms
  Result: 641.4162

saxpyFoldRelaxed:
  Mean:   0.994 ± 0.017 ms
  Result: 641.35

~7.8x speedup, with the result matching up to two decimal places for this particular randomly generated input. Why does this happen? Consider the primary loop within the generated code for both of those functions.

;; saxpyFold
loop:
ldp     q2, q3, [x10, #-0x10]
fmul.4s v2, v2, v0[0]         ;; v2 = a * x[i:i+4]
fmul.4s v3, v3, v0[0]         ;; v3 = a * x[i+4:i+8]  (loop unrolling)
ldp     q4, q5, [x11, #-0x10]
fadd.4s v2, v2, v4            ;; v2 = v2 + y[i:i+4]
mov     s4, v2[3]
mov     s6, v2[2]
mov     s7, v2[1]
fadd.4s v3, v3, v5            ;; v3 = v3 + y[i+4:i+8]
mov     s5, v3[3]
mov     s16, v3[2]
mov     s17, v3[1]
fadd    s1, s1, s2            ;; horizontal add (hadd) v2 and v3 (∵ reassociation forbidden)
fadd    s1, s1, s7
fadd    s1, s1, s6
fadd    s1, s1, s4
fadd    s1, s1, s3
fadd    s1, s1, s17
fadd    s1, s1, s16
fadd    s1, s1, s5
add     x10, x10, #0x20
add     x11, x11, #0x20
subs    x12, x12, #0x8
b.ne    loop

SIMD instructions fmul.4s and fadd.4s can be seen in the codegen of saxpyFold (i.e. without relaxed operations), but immediately afterwards it performs individual scalar additions (horizontal adds), thereby introducing data hazards and stalling the CPU pipeline.

See also a simulated execution of equivalent x86-64 machine code at uica.uops.info for a deeper analysis (link). Click "Run!", and then "Open Trace"

Now, consider the codegen on using relaxed operations.

;; saxpyFoldRelaxed
vloop:
ldp     q4, q5, [x10, #-0x10]
ldp     q6, q7, [x11, #-0x10]
fmla.4s v6, v1, v4            ;; v6 = a * x[i:i+4] + y[i:i+4]
fmla.4s v7, v1, v5            ;; v7 = a * x[i+4:i+8] + y[i+4:i+8]
fadd.4s v2, v2, v6            ;; a1[i:i+4] += v6    (vector register accumulator)
fadd.4s v3, v3, v7            ;; a2[i+i+4] += v7    (vector register accumulator)
add     x10, x10, #0x20
add     x11, x11, #0x20
subs    x12, x12, #0x8
b.ne    vloop

;; result = hadd(a1) + hadd(a2) after the loop

Two important differences from the previous codegen.

  1. The loop is fully vectorized, there are no scalar operations (for horizontal adds) in the loop itself
  2. Fused multiply add (FMA) instructions (fmla.4s) are used instead of individual multiply (fmul.4s) and add (fadd.4s) instructions

See also the equivalent x86-64 simulation: link.

Execution trace of implementation without relaxed operations: image Execution trace of implementation using relaxed operations: image

Supported Operators

Operator Transformation
+ Relaxed.sum(a, b)
- Relaxed.sum(a, -b)
* Relaxed.product(a, b)
others Preserved as-is

Examples

Basic Operations

// Addition
#relaxed(a + b)
// Expands to: Relaxed.sum(a, b)

// Subtraction
#relaxed(a - b)
// Expands to: Relaxed.sum(a, -b)

// Multiplication
#relaxed(a * b)
// Expands to: Relaxed.product(a, b)

// Division (not transformed)
#relaxed(a / b)
// Expands to: a / b

Nested Expressions

// Mixed operations
#relaxed(a + b * c)
// Expands to: Relaxed.sum(a, Relaxed.product(b, c))

// Parenthesized expressions
#relaxed((a + b) * (c + d))
// Expands to: Relaxed.product(Relaxed.sum(a, b), Relaxed.sum(c, d))

// Complex expressions
#relaxed(a * b + c * d)
// Expands to: Relaxed.sum(Relaxed.product(a, b), Relaxed.product(c, d))

Function Calls

Arithmetic expressions inside function calls are also transformed:

#relaxed(sin(a + b * c))
// Expands to: sin(Relaxed.sum(a, Relaxed.product(b, c)))

#relaxed(f(a + b, c * d))
// Expands to: f(Relaxed.sum(a, b), Relaxed.product(c, d))

Wishlist/Near-Term Future Work

Evaluate the sanity and feasibility of and/or implement the following.

  1. Support for +=, -=, *=
  2. Rewrite references to operators, e.g.
    • array.reduce(0, +) to array.reduce(0, Relaxed.sum)
    • zip(xs, ys).map(*).reduce(0, +) to zip(xs, ys).map(Relaxed.product).reduce(0, Relaxed.sum)

License

3-Clause BSD License

About

A Swift macro for rewriting floating point arithmetic expressions to use relaxed arithmetic operators operators from swift-numerics (https://github.com/apple/swift-numerics/).

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages