Run LLMs on Apple devices with Swift. Wraps llama.cpp with a native Swift API.
Load a GGUF model, stream tokens, parse tool calls, all from Swift. The inference runs on an actor so you get Swift 6 concurrency safety without thinking about it. Metal GPU acceleration works out of the box on Apple Silicon.
No dependencies beyond llama.cpp itself (shipped as a pre-built xcframework).
graph TD
App[Your Swift App] --> Actor[LlamaActor<br/>async streaming]
App --> Templates[ChatTemplate<br/>.gemma .llama3 .mistral .chatML]
App --> Parser[ToolCallParser<br/>streaming detection]
App --> Downloader[ModelDownloader<br/>HuggingFace + SHA256]
Actor --> Sampling[LlamaSampling<br/>temp / top-p / top-k]
Sampling --> Core[LlamaModel + LlamaContext]
Core --> XCF[llama.cpp xcframework<br/>Metal GPU]
style App fill:#1a1a2e,stroke:#e94560,color:#fff
style Actor fill:#16213e,stroke:#0f3460,color:#fff
style Templates fill:#16213e,stroke:#0f3460,color:#fff
style Parser fill:#16213e,stroke:#0f3460,color:#fff
style Downloader fill:#16213e,stroke:#0f3460,color:#fff
style Sampling fill:#0f3460,stroke:#533483,color:#fff
style Core fill:#0f3460,stroke:#533483,color:#fff
style XCF fill:#533483,stroke:#e94560,color:#fff
import SwiftLlama
let model = try LlamaModel(path: "path/to/model.gguf", config: .init(gpuLayers: .all))
let llama = try LlamaActor(model: model)
for try await chunk in llama.chat(messages: [.user("Hello!")], template: .gemma) {
switch chunk {
case .text(let token): print(token, terminator: "")
case .toolCall(let call): print("Tool: \(call.name)(\(call.arguments))")
}
}Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/profclaw/swift-llama", from: "0.1.0"),
]Then add "SwiftLlama" to your target dependencies.
- Actor-isolated inference with
AsyncThrowingStreamfor token streaming - Chat templates for Gemma, Llama 3, Mistral, and ChatML (or write your own)
- A streaming tool call parser that detects function calls as tokens arrive
- Model downloader that grabs GGUFs from HuggingFace with progress and SHA256 checks
- A catalog of popular models with recommended configs
- Configurable sampling (temperature, top-p, top-k, repeat penalty, or just
.greedy)
let model = try LlamaModel(
path: "/path/to/model.gguf",
config: ModelConfig(
contextSize: 8192,
gpuLayers: .all, // .all, .none, .count(20)
threads: 8
)
)let llama = try LlamaActor(model: model, params: .balanced)
let stream = llama.chat(
messages: [
.system("You are a helpful assistant."),
.user("What is Swift concurrency?")
],
template: .gemma
)
for try await chunk in stream {
switch chunk {
case .text(let token): print(token, terminator: "")
case .toolCall(let call): await handleTool(call)
}
}// Pick a built-in format
llama.chat(messages: messages, template: .gemma)
llama.chat(messages: messages, template: .llama3)
llama.chat(messages: messages, template: .mistral)
llama.chat(messages: messages, template: .chatML)
// Or bring your own
struct MyTemplate: ChatTemplateProtocol {
let stopTokens = ["<|end|>"]
func format(_ messages: [ChatMessage]) -> String { /* ... */ }
}
llama.chat(messages: messages, template: .custom(MyTemplate()))// Presets
let llama = try LlamaActor(model: model, params: .greedy)
let llama = try LlamaActor(model: model, params: .creative)
let llama = try LlamaActor(model: model, params: .balanced)
// Or tune it yourself
let params = SamplingParams(temperature: 0.8, topP: 0.95, topK: 50, maxTokens: 4096)let downloader = ModelDownloader()
for await event in downloader.download(model: ModelCatalog.recommended[0], to: modelsDir) {
switch event {
case .progress(let percent, _, _): print("\(Int(percent))%")
case .verifying: print("Verifying checksum...")
case .completed(let url): print("Done: \(url.path)")
case .failed(let error): print("Error: \(error)")
}
}
// See what's available
let gemmaModels = ModelCatalog.models(for: .gemma)| Model | Family | Size | Template |
|---|---|---|---|
| Gemma 4 E2B | Gemma | 2.9 GB | .gemma |
| Gemma 4 E4B | Gemma | 5.4 GB | .gemma |
| Llama 3.2 3B | Llama | 2.0 GB | .llama3 |
| Mistral 7B v0.3 | Mistral | 4.4 GB | .mistral |
| Phi-3.5 Mini | Phi | 2.4 GB | .chatML |
| Qwen 2.5 3B | Qwen | 2.1 GB | .chatML |
Any GGUF model works. The catalog is there so you don't have to hunt for HuggingFace URLs.
- macOS 14+ / iOS 17+ / visionOS 1+
- Swift 6.0+
- Apple Silicon recommended (Intel works, just slower)
MIT. See LICENSE.
ProfClaw. We're building ProfClaw Studio, a native macOS AI assistant that uses this package for local inference.