Skip to content

dataparallel-swift/swift-to-gpu

Repository files navigation

Swift-to-GPU

Lift Swift code to parallel CUDA kernels.

Adding it to your project

Add to your Package.swift:

    dependencies: [
        .package(url:"https://github.com/dataparallel-swift/swift-to-gpu.git", from: "1.0.0", traits: ["PTX"])
    ]

If you do not have a NVIDIA GPU, you can instead build against the (currently sequential) CPU backend by supplying the "CPU" trait instead.

Important

You must supply either the PTX or CPU trait, otherwise the package will fail to compile.

Adding it to your code

In a module containing a loop(s) that you wish to hoist to the GPU:

import SwiftToGPU

and replace for loops with the provided parallel_for construct. Example:

func nondeterministicIndex(of target: Float, in array: [Float]) -> Int?
{
    var index : Int? = nil
    parallel_for(iterations: array.count) { i in
        if array[i] == target {
            index = i
        }
    }.sync()        // wait for the GPU to finish before proceeding
    return index
}

The parallel_for function takes a closure that is given an index in the range 0..<iterations, with which the loop body can do something to compute a result, closing over captured variables. In principle all of the loop iterations are executed concurrently in data-parallel, and thus must all be independent of one another. The above example then is non-deterministic because if the target value exists in multiple positions in the array, the function may return a different index each time it is called.

Note that any data to be filled-in must be pre-allocated at the (maximum) size required before entering the parallel section: avoid using inherently sequential operations such as Array.append(). (You should (almost) never be using that anyway: figure out what the requirements of your program are instead!)

API

A number of the usual data-parallel array operators are available from the Prelude.swift module. It is expected that this will grow (and change) rapidly as the project continues. Prefer to use these combinators rather than raw parallel_for loops whenever possible.

Building

When building against the PTX backend, you will need to compile your project with a swift toolchain that includes the swift-to-ptx compiler transformation, e.g. available from here:

https://github.com/dataparallel-swift/swift

Note that the transformation is only enabled when compiling with optimisations (either compile in release mode, or enable optimisations for the specific target that uses SwiftToGPU).

There are two ways to set up the development environment:

Container

This is a self-contained development environment that uses the same Docker image used by CI for local development. Beyond the benefits of dev/prod parity, this approach may be easier to set up. No additional toolchain or dependencies are necessary when using the container.

  1. Install Podman for your operating system:

Warning

Make sure to follow the installation instructions carefully. If you do not have a working Podman Machine, executing the commands below won't be possible.

  1. Verify that Podman is installed and running correctly

    podman --version
    
  2. Once this is done, the container can be started with the podman run command. For example, to launch an interactive container:

    podman run --rm -it \
      -v $PWD:/$(basename $PWD)  \ # make the current directory available in the container
      -w /$(basename $PWD)       \ # set default working directory
      ghcr.io/dataparallel-swift/swift:latest \
      /bin/bash
    

    but it can be simpler to just run the one command you want, for example:

    podman run -v ... ghcr.io/dataparallel-swift/swift:latest \
      swift build -c release
    

    Building executables with the --static-swift-stdlib option is useful when copying the resulting executable to a remote executor.

    If you want to run GPU-accelerated code inside of a container, you will need to set up and configure the NVIDIA container toolkit.

Native

Building the compiler is as usual, with the addition that LLVM must be built with the NVPTX backend, e.g.:

./utils/build-script --llvm-targets-to-build "AArch64;NVPTX" ...

The swift-docker includes an example container that can be used to build a GPU enabled toolchain, e.g.:

./utils/build-script --preset=buildbot_linux,gpu,no_test install_destdir=... installable_package=...

Command line options

There are several command line options that can be used to control the behaviour of the transformation.

  • --swift-to-ptx-verbose[=BOOL] Use verbose output (false).

  • --swift-to-ptx-keep-intermediate-files[=BOOL] Keep intermediate files (false). Use this to see what code was generated.

  • --swift-to-ptx-ptxas-path=PATH Path to the ptxas executable. Defaults to "/usr/local/cuda/bin/ptxas".

  • --swift-to-ptx-target-gpu=STRING Generate code for this specific GPU architecture. Defaults to "sm_87" (NVIDIA Jetson Orin).

  • --swift-to-ptx-target-attr=STRING Target specific attributes to add during compilation. Defaults to "+ptx81" (NVIDIA Jetson Orin).

  • --swift-to-ptx-allow-fp-arcp[=BOOL] Allow floating-point division to be treated as multiplication by a reciprocal (true).

  • --swift-to-ptx-allow-fp-contract[=BOOL] Allow floating-point contraction, e.g. fusing a multiply followed by an addition into a fused multiply-add (true).

  • --swift-to-ptx-allow-fp-afn[=BOOL] Allow substitution of approximate calculation for functions, e.g. sin, log, sqrt, etc. (true).

  • --swift-to-ptx-allow-fp-reassoc[=BOOL] Allow re-association transformations for floating-point operations (true).

  • --swift-to-ptx-device-debug[=BOOL] Include debug information in device code (false). Requires compiling the Swift module with debug information as well.

Note that these options must be passed through to the LLVM phase of compilation. For example, you can add them to your Package.swift as:

    swiftSettings: [
        .unsafeFlags([
            "-Xllvm", "--swift-to-ptx-verbose"
        ])
    ]

Benchmarks

The following benchmarks were conducted on a NVIDIA Jetson Orin (ARM A78AEv8.2 CPU with 8-cores @ 2GHz, Ampere SM8.7 GPU with 1024 cores in 8 SMs @ 918 MHz) in MAXN power mode.

How to interpret the results

The following results are shown using a box-and-whiskers plot to concisely describe the statistical properties at each data point. The box represents the interquartile range, the span in which 50% of the samples were collected, with the solid line in the box marking the median value and the dashed line the average. The whiskers extending from the box represent the minimum and maximum values observed. From this, we can visually estimate the degree of dispersion and skewness of the data.

Along with a realisation in Swift-to-PTX, each benchmark contains a number of comparative implementations:

  • Implementations on the CPU: a "regular" implementation and an "optimised" implementation (labelled "unsafe" due to the use of a so-named initialiser function). These bounding lines are not meant to be authoritative: it may be possible to squeeze more out of the optimised implementation, and an unoptimised implementation can be made infinitely worse, but the purpose is to give an indication of at what data size is it worthwhile to move a computation from the CPU to the GPU.

  • An implementation in raw CUDA (called from Swift, but his should be close enough to a raw CUDA/C++ implementation). This gives an indication of the overheads incurred implementing GPU kernels in Swift compared to CUDA/C++. As the project progresses, we aim to close this gap.

This benchmark implements the classic Level-1 BLAS routine SAXPY, which multiplies a vector by a scalar constant and adds it to another vector; i.e. $z_i = \alpha \cdot x_i + y_i$. This represents a workload with a high bytes/flops ratio, that is overall dominated by the cost of data transfer.

This benchmark implements the Black-Scholes options pricing model. This represents a workload that does a reasonable amount of computation for each byte transferred.

Limitations

  • All code to be lifted to the device must be present in a single compilation unit passed to the LLVM compiler. Typically this can be achieved by putting all of the code for the GPU kernel into a single .swift file, and/or by sprinkling @alwaysEmitIntoClient onto any functions that you want to call from the GPU. You will also need to make sure that any generic functions can be completely specialised at the call site. Still, the swift-to-ptx transformation pass does not always succeed, so improving this compilation model is an important milestone on the roadmap.

  • The --enable-testing flag, which is added automatically by swift test, changes how optimisations are performed, which may in turn cause the pass to fail in cases that succeed in the regular compilation mode.

  • --enable-code-coverage is currently not supported.

TODO

  • Improve the compilation model
  • Integration with Swift structured concurrency
  • Integration with debugging / profiling tools
  • Leverage Swift language safety features in kernel code
  • A mechanism for automatic kernel fusion
  • ...

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors