swift-llama

Run LLMs on Apple devices with Swift. Wraps llama.cpp with a native Swift API.

Load a GGUF model, stream tokens, parse tool calls, all from Swift. The inference runs on an actor so you get Swift 6 concurrency safety without thinking about it. Metal GPU acceleration works out of the box on Apple Silicon.

No dependencies beyond llama.cpp itself (shipped as a pre-built xcframework).

graph TD
    App[Your Swift App] --> Actor[LlamaActor<br/>async streaming]
    App --> Templates[ChatTemplate<br/>.gemma .llama3 .mistral .chatML]
    App --> Parser[ToolCallParser<br/>streaming detection]
    App --> Downloader[ModelDownloader<br/>HuggingFace + SHA256]

    Actor --> Sampling[LlamaSampling<br/>temp / top-p / top-k]
    Sampling --> Core[LlamaModel + LlamaContext]
    Core --> XCF[llama.cpp xcframework<br/>Metal GPU]

    style App fill:#1a1a2e,stroke:#e94560,color:#fff
    style Actor fill:#16213e,stroke:#0f3460,color:#fff
    style Templates fill:#16213e,stroke:#0f3460,color:#fff
    style Parser fill:#16213e,stroke:#0f3460,color:#fff
    style Downloader fill:#16213e,stroke:#0f3460,color:#fff
    style Sampling fill:#0f3460,stroke:#533483,color:#fff
    style Core fill:#0f3460,stroke:#533483,color:#fff
    style XCF fill:#533483,stroke:#e94560,color:#fff

Quick start

import SwiftLlama

let model = try LlamaModel(path: "path/to/model.gguf", config: .init(gpuLayers: .all))
let llama = try LlamaActor(model: model)

for try await chunk in llama.chat(messages: [.user("Hello!")], template: .gemma) {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): print("Tool: \(call.name)(\(call.arguments))")
    }
}

Installation

Add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/profclaw/swift-llama", from: "0.1.0"),
]

Then add "SwiftLlama" to your target dependencies.

What's in the box

Actor-isolated inference with AsyncThrowingStream for token streaming
Chat templates for Gemma, Llama 3, Mistral, and ChatML (or write your own)
A streaming tool call parser that detects function calls as tokens arrive
Model downloader that grabs GGUFs from HuggingFace with progress and SHA256 checks
A catalog of popular models with recommended configs
Configurable sampling (temperature, top-p, top-k, repeat penalty, or just .greedy)

Usage

Loading a model

let model = try LlamaModel(
    path: "/path/to/model.gguf",
    config: ModelConfig(
        contextSize: 8192,
        gpuLayers: .all,    // .all, .none, .count(20)
        threads: 8
    )
)

Streaming chat

let llama = try LlamaActor(model: model, params: .balanced)

let stream = llama.chat(
    messages: [
        .system("You are a helpful assistant."),
        .user("What is Swift concurrency?")
    ],
    template: .gemma
)

for try await chunk in stream {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): await handleTool(call)
    }
}

Chat templates

// Pick a built-in format
llama.chat(messages: messages, template: .gemma)
llama.chat(messages: messages, template: .llama3)
llama.chat(messages: messages, template: .mistral)
llama.chat(messages: messages, template: .chatML)

// Or bring your own
struct MyTemplate: ChatTemplateProtocol {
    let stopTokens = ["<|end|>"]
    func format(_ messages: [ChatMessage]) -> String { /* ... */ }
}
llama.chat(messages: messages, template: .custom(MyTemplate()))

Sampling

// Presets
let llama = try LlamaActor(model: model, params: .greedy)
let llama = try LlamaActor(model: model, params: .creative)
let llama = try LlamaActor(model: model, params: .balanced)

// Or tune it yourself
let params = SamplingParams(temperature: 0.8, topP: 0.95, topK: 50, maxTokens: 4096)

Downloading models

let downloader = ModelDownloader()

for await event in downloader.download(model: ModelCatalog.recommended[0], to: modelsDir) {
    switch event {
    case .progress(let percent, _, _): print("\(Int(percent))%")
    case .verifying: print("Verifying checksum...")
    case .completed(let url): print("Done: \(url.path)")
    case .failed(let error): print("Error: \(error)")
    }
}

// See what's available
let gemmaModels = ModelCatalog.models(for: .gemma)

Supported models

Model	Family	Size	Template
Gemma 4 E2B	Gemma	2.9 GB	`.gemma`
Gemma 4 E4B	Gemma	5.4 GB	`.gemma`
Llama 3.2 3B	Llama	2.0 GB	`.llama3`
Mistral 7B v0.3	Mistral	4.4 GB	`.mistral`
Phi-3.5 Mini	Phi	2.4 GB	`.chatML`
Qwen 2.5 3B	Qwen	2.1 GB	`.chatML`

Any GGUF model works. The catalog is there so you don't have to hunt for HuggingFace URLs.

Requirements

macOS 14+ / iOS 17+ / visionOS 1+
Swift 6.0+
Apple Silicon recommended (Intel works, just slower)

License

MIT. See LICENSE.

Who made this

ProfClaw. We're building ProfClaw Studio, a native macOS AI assistant that uses this package for local inference.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Sources/SwiftLlama		Sources/SwiftLlama
Tests/SwiftLlamaTests		Tests/SwiftLlamaTests
.gitignore		.gitignore
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

swift-llama

Quick start

Installation

What's in the box

Usage

Loading a model

Streaming chat

Chat templates

Sampling

Downloading models

Supported models

Requirements

License

Who made this

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

swift-llama

Quick start

Installation

What's in the box

Usage

Loading a model

Streaming chat

Chat templates

Sampling

Downloading models

Supported models

Requirements

License

Who made this

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages