Skip to content

dataparallel-swift/swift-multiarray

Repository files navigation

swift-multiarray

An array that stores elements in unboxed struct-of-array style. This can provide better data locality and enable efficient (automatic) vectorization.

Example

In order to store values in unboxed struct-of-array style, array elements must be members of the included Generic protocol. This currently supports regular non-recursive product types over primitive values. Sum types could be supported as well, but we'll defer that until we have good uses cases to properly explore that space (see comments in the code).

Consider the following datatype:

@Generic
struct Vec3<Element> {
    let x, y, z: Element
}

The required instance is will eventually be provided by the @Generic macro, which generates something like:

extension Vec3: Generic where Element: Generic {
    typealias RawRepresentation = Product<Element.RawRepresentation, Product<Element.RawRepresentation, Element.RawRepresentation>>
    var rawRepresentation: RawRepresentation {
        Product(self.x.rawRepresentation, Product(self.y.rawRepresentation, self.z.rawRepresentation))
    }
    
    init(from rep: Product<Element.RawRepresentation, Product<Element.RawRepresentation, Element.RawRepresentation>>) {
        self = .init(
            x: .init(from: rep._0),
            y: .init(from: rep._1._0),
            z: .init(from: rep._1._1)
        )
    }
}

Here Product is a simple pair type, because Swift does not allow us to extend regular tuples (,). Boo.

For convenience we provide synonyms for tuples from 2 to 16 elements (T2, T3...), so that you do not have to do the binary nesting yourself.

As you can see, this is a straightforward translation over the structure of the datatype into an isomorphic representation using (nested) pairs. As far as the compiler is concerned, the in-memory layout of Vec3<Float> and Vec3<Float>.RawRepresentation is identical, so in practice (i.e. with sufficient inlining) this representation change should be a no-op.

Similarly, the following works exactly as you would expect:

@Generic
struct Zone {
    let id: Int
    let position: Vec3<Float>
}

// Generates...
extension Zone: Generic {
    typealias RawRepresentation = Product<Int.RawRepresentation, Vec3<Float>.RawRepresentation>
    var rawRepresentation: RawRepresentation {
        Product(self.id.rawRepresentation, self.position.rawRepresentation)
    }
    
    init(from rep: RawRepresentation) {
        self = Zone(id: rep._0, position: .init(from: rep._1))
    }
}

Once the protocol instance is defined, you can use it in the usual way and have the compiler automatically transform access to and from the underlying storage representation.

For example, suppose we have the following function to move the position of a Zone by a given x-, y-, and z-offset:

extension Zone {
    public func move(dx: Float = 0, dy: Float = 0, dz: Float = 0) -> Zone {
        Zone(id: self.id,
             position: Vec3(x: self.position.x + dx,
                            y: self.position.y + dy,
                            z: self.position.z + dz))
    }
}

In a regular Swift Array the fields of our Zone structure will be stored contiguously in memory as (on a 64-bit system):

                            1        1        2        2        2        3
 0        4        8        2        6        0        4        8        2
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| id0             | x0     | y0     | z0     | <pad>  | id1             | x1     | ...
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Notice that due to alignment restricts, an extra 4 bytes padding must be added between each array element, and wasting 16.6% of our available memory bandwidth.

If we map our move(dx: 1) function over this array and inspect the generated code, we'll see that the core of the loop looks like this (annotated):

10:                                               ; preds = %26, %5
  %11 = phi i64 [ %.pre, %5 ], [ %23, %26 ]                                 ; array capacity
  %12 = phi i64 [ 0, %5 ], [ %28, %26 ]                                     ; loop counter
  %13 = phi ptr [ %6, %5 ], [ %27, %26 ]                                    ; array storage
  %14 = mul nuw nsw i64 %12, 24                                             ; calculate offset to start of this element
  %15 = getelementptr inbounds i8, ptr %7, i64 %14                          ; get pointer to this element
  %16 = load i64, ptr %15, align 1                                          ; load .id
  %.position = getelementptr inbounds i8, ptr %15, i64 8                    ; get pointer to .x
  %17 = load <2 x float>, ptr %.position, align 1                           ; load .x and .y
  %.position.z = getelementptr inbounds i8, ptr %15, i64 16                 ; get pointer to .z
  %18 = load float, ptr %.position.z, align 1                               ; load .z
  %19 = fadd <2 x float> %17, <float 1.000000e+00, float 0.000000e+00>      ; compute new .x and .y
  %20 = fadd float %18, 0.000000e+00                                        ; compute new .z
  store ptr %13, ptr %1, align 8
  %._storage3._capacityAndFlags = getelementptr inbounds i8, ptr %13, i64 24
  %21 = load i64, ptr %._storage3._capacityAndFlags, align 8
  %22 = lshr i64 %21, 1
  %23 = add nuw nsw i64 %11, 1
  %.not = icmp ugt i64 %22, %11
  br i1 %.not, label %26, label %24, !prof !11

The swift compiler does a good job to turn the individual load of x and y into a single vectorised load, but is otherwise hamstrung by the underlying data layout and unable to optimise the code further.

Using MultiArray, each of the individual fields of the structure are stored in their own individual memory regions:

+--------+--------+--------+--------+--------+--------+
| id0             | id1             | id2             | ...
+--------+--------+--------+--------+--------+--------+

+--------+--------+--------+--------+--------+--------+
| x0     | x1     | x2     | x3     | x4     | x5     | ...
+--------+--------+--------+--------+--------+--------+

+--------+--------+--------+--------+--------+--------+
| y0     | y1     | y2     | y3     | y4     | y5     | ...
+--------+--------+--------+--------+--------+--------+

+--------+--------+--------+--------+--------+--------+
| z0     | z1     | z2     | z3     | z4     | z5     | ...
+--------+--------+--------+--------+--------+--------+

And the core loop of the generated code now looks like this:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %25 = getelementptr inbounds %TSi, ptr %19, i64 %index
  %wide.load = load <4 x i64>, ptr %25, align 8, !alias.scope !13
  %26 = getelementptr inbounds %TSf, ptr %20, i64 %index
  %wide.load100 = load <4 x float>, ptr %26, align 4, !alias.scope !16
  %27 = getelementptr inbounds %TSf, ptr %21, i64 %index
  %wide.load101 = load <4 x float>, ptr %27, align 4, !alias.scope !18
  %28 = getelementptr inbounds %TSf, ptr %22, i64 %index
  %wide.load102 = load <4 x float>, ptr %28, align 4, !alias.scope !20
  %29 = fadd <4 x float> %wide.load100, <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
  %30 = fadd <4 x float> %wide.load101, zeroinitializer
  %31 = fadd <4 x float> %wide.load102, zeroinitializer
  %32 = getelementptr inbounds %TSi, ptr %8, i64 %index
  store <4 x i64> %wide.load, ptr %32, align 8, !alias.scope !22, !noalias !24
  %33 = getelementptr inbounds %TSf, ptr %10, i64 %index
  store <4 x float> %29, ptr %33, align 4, !alias.scope !28, !noalias !29
  %34 = getelementptr inbounds %TSf, ptr %11, i64 %index
  store <4 x float> %30, ptr %34, align 4, !alias.scope !30, !noalias !31
  %35 = getelementptr inbounds %TSf, ptr %12, i64 %index
  store <4 x float> %31, ptr %35, align 4, !alias.scope !32, !noalias !33
  %index.next = add nuw i64 %index, 4
  %36 = icmp eq i64 %index.next, %n.vec
  br i1 %36, label %middle.block, label %vector.body, !llvm.loop !34

Immediately we can see that this loop has been 4-wide vectorized: each field is read 4-elements at a time, and computations are done on 4-wide SIMD vectors (M4 Max).

(A very observant viewer might also note that this loop branches directly to the top of the basic block, compared to the previous version which did not. Indeed that version requires a "cleanup" step after each iteration to update the stored size of the array and grow its capacity if necessary, but that's out of scope for this discussion).

Benchmarking the above shows a 20% reduction in memory usage and (up to) ~ 2x performance improvement (which is not bad for a simple memory bound operation).

TODO

  • Implement @Generic macro

  • Support for sum datatypes (i.e. enums). There are different ways this could be achieved, and the best choice may depend on the individual application, so punting this until we have a real use case for it.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages