Lift Swift code to parallel CUDA kernels.
Add to your Package.swift:
dependencies: [
.package(url:"https://github.com/dataparallel-swift/swift-to-gpu.git", from: "1.0.0", traits: ["PTX"])
]If you do not have a NVIDIA GPU, you can instead build against the (currently sequential) CPU backend by supplying the "CPU" trait instead.
Important
You must supply either the PTX or CPU trait, otherwise the package will fail to compile.
In a module containing a loop(s) that you wish to hoist to the GPU:
import SwiftToGPUand replace for loops with the provided parallel_for construct. Example:
func nondeterministicIndex(of target: Float, in array: [Float]) -> Int?
{
var index : Int? = nil
parallel_for(iterations: array.count) { i in
if array[i] == target {
index = i
}
}.sync() // wait for the GPU to finish before proceeding
return index
}The parallel_for function takes a closure that is given an index in the range
0..<iterations, with which the loop body can do something to compute a result,
closing over captured variables. In principle all of the loop iterations are
executed concurrently in data-parallel, and thus must all be independent of one
another. The above example then is non-deterministic because if the target value
exists in multiple positions in the array, the function may return a different
index each time it is called.
Note that any data to be filled-in must be pre-allocated at the (maximum) size required before entering the parallel section: avoid using inherently sequential operations such as Array.append(). (You should (almost) never be using that anyway: figure out what the requirements of your program are instead!)
A number of the usual data-parallel array operators are available from the
Prelude.swift module. It is expected that
this will grow (and change) rapidly as the project continues. Prefer to use
these combinators rather than raw parallel_for loops whenever possible.
When building against the PTX backend, you will need to compile your project with a swift toolchain that includes the swift-to-ptx compiler transformation, e.g. available from here:
https://github.com/dataparallel-swift/swift
Note that the transformation is only enabled when compiling with optimisations (either compile in release mode, or enable optimisations for the specific target that uses SwiftToGPU).
There are two ways to set up the development environment:
This is a self-contained development environment that uses the same Docker image used by CI for local development. Beyond the benefits of dev/prod parity, this approach may be easier to set up. No additional toolchain or dependencies are necessary when using the container.
- Install Podman for your operating system:
Warning
Make sure to follow the installation instructions carefully. If you do not have a working Podman Machine, executing the commands below won't be possible.
-
Verify that Podman is installed and running correctly
podman --version -
Once this is done, the container can be started with the
podman runcommand. For example, to launch an interactive container:podman run --rm -it \ -v $PWD:/$(basename $PWD) \ # make the current directory available in the container -w /$(basename $PWD) \ # set default working directory ghcr.io/dataparallel-swift/swift:latest \ /bin/bashbut it can be simpler to just run the one command you want, for example:
podman run -v ... ghcr.io/dataparallel-swift/swift:latest \ swift build -c releaseBuilding executables with the
--static-swift-stdliboption is useful when copying the resulting executable to a remote executor.If you want to run GPU-accelerated code inside of a container, you will need to set up and configure the NVIDIA container toolkit.
Building the compiler is as usual, with the addition that LLVM must be built with the NVPTX backend, e.g.:
./utils/build-script --llvm-targets-to-build "AArch64;NVPTX" ...
The swift-docker includes an example container that can be used to build a GPU enabled toolchain, e.g.:
./utils/build-script --preset=buildbot_linux,gpu,no_test install_destdir=... installable_package=...
There are several command line options that can be used to control the behaviour of the transformation.
-
--swift-to-ptx-verbose[=BOOL]Use verbose output (false). -
--swift-to-ptx-keep-intermediate-files[=BOOL]Keep intermediate files (false). Use this to see what code was generated. -
--swift-to-ptx-ptxas-path=PATHPath to theptxasexecutable. Defaults to "/usr/local/cuda/bin/ptxas". -
--swift-to-ptx-target-gpu=STRINGGenerate code for this specific GPU architecture. Defaults to "sm_87" (NVIDIA Jetson Orin). -
--swift-to-ptx-target-attr=STRINGTarget specific attributes to add during compilation. Defaults to "+ptx81" (NVIDIA Jetson Orin). -
--swift-to-ptx-allow-fp-arcp[=BOOL]Allow floating-point division to be treated as multiplication by a reciprocal (true). -
--swift-to-ptx-allow-fp-contract[=BOOL]Allow floating-point contraction, e.g. fusing a multiply followed by an addition into a fused multiply-add (true). -
--swift-to-ptx-allow-fp-afn[=BOOL]Allow substitution of approximate calculation for functions, e.g. sin, log, sqrt, etc. (true). -
--swift-to-ptx-allow-fp-reassoc[=BOOL]Allow re-association transformations for floating-point operations (true). -
--swift-to-ptx-device-debug[=BOOL]Include debug information in device code (false). Requires compiling the Swift module with debug information as well.
Note that these options must be passed through to the LLVM phase of compilation.
For example, you can add them to your Package.swift as:
swiftSettings: [
.unsafeFlags([
"-Xllvm", "--swift-to-ptx-verbose"
])
]The following benchmarks were conducted on a NVIDIA Jetson Orin (ARM A78AEv8.2 CPU with 8-cores @ 2GHz, Ampere SM8.7 GPU with 1024 cores in 8 SMs @ 918 MHz) in MAXN power mode.
The following results are shown using a box-and-whiskers plot to concisely describe the statistical properties at each data point. The box represents the interquartile range, the span in which 50% of the samples were collected, with the solid line in the box marking the median value and the dashed line the average. The whiskers extending from the box represent the minimum and maximum values observed. From this, we can visually estimate the degree of dispersion and skewness of the data.
Along with a realisation in Swift-to-PTX, each benchmark contains a number of comparative implementations:
-
Implementations on the CPU: a "regular" implementation and an "optimised" implementation (labelled "unsafe" due to the use of a so-named initialiser function). These bounding lines are not meant to be authoritative: it may be possible to squeeze more out of the optimised implementation, and an unoptimised implementation can be made infinitely worse, but the purpose is to give an indication of at what data size is it worthwhile to move a computation from the CPU to the GPU.
-
An implementation in raw CUDA (called from Swift, but his should be close enough to a raw CUDA/C++ implementation). This gives an indication of the overheads incurred implementing GPU kernels in Swift compared to CUDA/C++. As the project progresses, we aim to close this gap.
This benchmark implements the classic Level-1 BLAS routine
SAXPY,
which multiplies a vector by a scalar constant and adds it to another vector;
i.e.
This benchmark implements the Black-Scholes options pricing model. This represents a workload that does a reasonable amount of computation for each byte transferred.
-
All code to be lifted to the device must be present in a single compilation unit passed to the LLVM compiler. Typically this can be achieved by putting all of the code for the GPU kernel into a single .swift file, and/or by sprinkling
@alwaysEmitIntoClientonto any functions that you want to call from the GPU. You will also need to make sure that any generic functions can be completely specialised at the call site. Still, the swift-to-ptx transformation pass does not always succeed, so improving this compilation model is an important milestone on the roadmap. -
The
--enable-testingflag, which is added automatically byswift test, changes how optimisations are performed, which may in turn cause the pass to fail in cases that succeed in the regular compilation mode. -
--enable-code-coverageis currently not supported.
- Improve the compilation model
- Integration with Swift structured concurrency
- Integration with debugging / profiling tools
- Leverage Swift language safety features in kernel code
- A mechanism for automatic kernel fusion
- ...