Skip to content

profclaw/swift-llama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

swift-llama

Run LLMs on Apple devices with Swift. Wraps llama.cpp with a native Swift API.

Swift 6 Platforms License

Load a GGUF model, stream tokens, parse tool calls, all from Swift. The inference runs on an actor so you get Swift 6 concurrency safety without thinking about it. Metal GPU acceleration works out of the box on Apple Silicon.

No dependencies beyond llama.cpp itself (shipped as a pre-built xcframework).

graph TD
    App[Your Swift App] --> Actor[LlamaActor<br/>async streaming]
    App --> Templates[ChatTemplate<br/>.gemma .llama3 .mistral .chatML]
    App --> Parser[ToolCallParser<br/>streaming detection]
    App --> Downloader[ModelDownloader<br/>HuggingFace + SHA256]

    Actor --> Sampling[LlamaSampling<br/>temp / top-p / top-k]
    Sampling --> Core[LlamaModel + LlamaContext]
    Core --> XCF[llama.cpp xcframework<br/>Metal GPU]

    style App fill:#1a1a2e,stroke:#e94560,color:#fff
    style Actor fill:#16213e,stroke:#0f3460,color:#fff
    style Templates fill:#16213e,stroke:#0f3460,color:#fff
    style Parser fill:#16213e,stroke:#0f3460,color:#fff
    style Downloader fill:#16213e,stroke:#0f3460,color:#fff
    style Sampling fill:#0f3460,stroke:#533483,color:#fff
    style Core fill:#0f3460,stroke:#533483,color:#fff
    style XCF fill:#533483,stroke:#e94560,color:#fff
Loading

Quick start

import SwiftLlama

let model = try LlamaModel(path: "path/to/model.gguf", config: .init(gpuLayers: .all))
let llama = try LlamaActor(model: model)

for try await chunk in llama.chat(messages: [.user("Hello!")], template: .gemma) {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): print("Tool: \(call.name)(\(call.arguments))")
    }
}

Installation

Add to your Package.swift:

dependencies: [
    .package(url: "https://github.com/profclaw/swift-llama", from: "0.1.0"),
]

Then add "SwiftLlama" to your target dependencies.

What's in the box

  • Actor-isolated inference with AsyncThrowingStream for token streaming
  • Chat templates for Gemma, Llama 3, Mistral, and ChatML (or write your own)
  • A streaming tool call parser that detects function calls as tokens arrive
  • Model downloader that grabs GGUFs from HuggingFace with progress and SHA256 checks
  • A catalog of popular models with recommended configs
  • Configurable sampling (temperature, top-p, top-k, repeat penalty, or just .greedy)

Usage

Loading a model

let model = try LlamaModel(
    path: "/path/to/model.gguf",
    config: ModelConfig(
        contextSize: 8192,
        gpuLayers: .all,    // .all, .none, .count(20)
        threads: 8
    )
)

Streaming chat

let llama = try LlamaActor(model: model, params: .balanced)

let stream = llama.chat(
    messages: [
        .system("You are a helpful assistant."),
        .user("What is Swift concurrency?")
    ],
    template: .gemma
)

for try await chunk in stream {
    switch chunk {
    case .text(let token): print(token, terminator: "")
    case .toolCall(let call): await handleTool(call)
    }
}

Chat templates

// Pick a built-in format
llama.chat(messages: messages, template: .gemma)
llama.chat(messages: messages, template: .llama3)
llama.chat(messages: messages, template: .mistral)
llama.chat(messages: messages, template: .chatML)

// Or bring your own
struct MyTemplate: ChatTemplateProtocol {
    let stopTokens = ["<|end|>"]
    func format(_ messages: [ChatMessage]) -> String { /* ... */ }
}
llama.chat(messages: messages, template: .custom(MyTemplate()))

Sampling

// Presets
let llama = try LlamaActor(model: model, params: .greedy)
let llama = try LlamaActor(model: model, params: .creative)
let llama = try LlamaActor(model: model, params: .balanced)

// Or tune it yourself
let params = SamplingParams(temperature: 0.8, topP: 0.95, topK: 50, maxTokens: 4096)

Downloading models

let downloader = ModelDownloader()

for await event in downloader.download(model: ModelCatalog.recommended[0], to: modelsDir) {
    switch event {
    case .progress(let percent, _, _): print("\(Int(percent))%")
    case .verifying: print("Verifying checksum...")
    case .completed(let url): print("Done: \(url.path)")
    case .failed(let error): print("Error: \(error)")
    }
}

// See what's available
let gemmaModels = ModelCatalog.models(for: .gemma)

Supported models

Model Family Size Template
Gemma 4 E2B Gemma 2.9 GB .gemma
Gemma 4 E4B Gemma 5.4 GB .gemma
Llama 3.2 3B Llama 2.0 GB .llama3
Mistral 7B v0.3 Mistral 4.4 GB .mistral
Phi-3.5 Mini Phi 2.4 GB .chatML
Qwen 2.5 3B Qwen 2.1 GB .chatML

Any GGUF model works. The catalog is there so you don't have to hunt for HuggingFace URLs.

Requirements

  • macOS 14+ / iOS 17+ / visionOS 1+
  • Swift 6.0+
  • Apple Silicon recommended (Intel works, just slower)

License

MIT. See LICENSE.

Who made this

ProfClaw. We're building ProfClaw Studio, a native macOS AI assistant that uses this package for local inference.