Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,44 @@ Versioning follows [Semantic Versioning](https://semver.org/).
`macmlx serve` and `macmlx run`). Drop into Claude Desktop's
`claude_desktop_config.json` as
`{ "mcpServers": { "macmlx": { "command": "macmlx", "args": ["mcp", "serve"] } } }`.
- **VLM UI + Persistence + HTTP** (v0.4.1, part 3 of 3). Lights up
the user-facing surfaces for vision-language models. Closes the
v0.4.1 work begun in PRs #33 (Foundation) and #34 (Engine).
- **Chat input image picker.** New paperclip button in the chat
input opens SwiftUI's `.fileImporter` (image UTTypes only:
jpeg / png / webp / gif / heic / bmp), populating a horizontal
thumbnail strip above the text field. Click the × on a thumbnail
to drop it. The button is disabled when the loaded model isn't
a VLM, with an explanatory tooltip ("Load a vision-capable model
(Qwen-VL, Gemma-3, SmolVLM, …) to attach images"). Image-only
messages (no text) are now valid sends on a VLM.
- **Inline thumbnails on chat bubbles.** `ChatMessageView` renders
a 96pt LazyVGrid of attached images above the text bubble for
any turn that has images. Click a thumbnail to open the file in
Preview via `NSWorkspace`.
- **Conversation persistence.** `StoredMessage.images` round-trips
through `ConversationStore`. On save, every external image URL
is copied into `<conversations>/<conv-uuid>/images/` and the
stored URL is rewritten to point there — chats survive the user
moving the picked file. On `delete(id:)`, the per-conversation
directory is torn down so images don't leak. Pre-v0.4.1
conversations decode unchanged (missing key → empty array).
- **OpenAI multimodal HTTP.** `/v1/chat/completions` now accepts
OpenAI's `content` array shape:
```json
{"role":"user","content":[
{"type":"text","text":"What's this?"},
{"type":"image_url","image_url":{"url":"data:image/png;base64,…"}}
]}
```
Plain-string `content` continues to work — the decoder tries
string first, falls through to `[Part]`. base64 data URLs
decode to tmpfile-backed `ImageAttachment` values; caps: 10 MB
per image, 4 images per message; `http(s)://` and `file://`
URLs are not fetched (defence-in-depth on a localhost-bound
server). Ollama's `/api/chat` / `/api/generate` stay text-only
— Ollama's wire format uses a separate top-level
`images: [base64]` field; revisit in a follow-up.
- **VLM Engine** (v0.4.1, part 2 of 3). MLXSwiftEngine now branches
on `model.format` to load text-only models through
`MLXLLM.LLMModelFactory` and vision-language models through
Expand Down
95 changes: 93 additions & 2 deletions MacMLXCore/Sources/MacMLXCore/Managers/ConversationStore.swift
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,42 @@ public struct StoredMessage: Codable, Hashable, Identifiable, Sendable {
public var content: String
public let timestamp: Date
public var tokenCount: Int?
/// Image attachments tied to this turn. Empty for text-only — the
/// common case. URLs point into
/// `<conversations>/<conv-uuid>/images/...` once the conversation
/// has been saved (see `ConversationStore.save(_:)`).
/// Backwards-compatible: pre-v0.4.1 JSON without an `images` key
/// decodes with an empty array.
public var images: [ImageAttachment]

public init(
id: UUID = UUID(),
role: MessageRole,
content: String,
timestamp: Date = Date(),
tokenCount: Int? = nil
tokenCount: Int? = nil,
images: [ImageAttachment] = []
) {
self.id = id
self.role = role
self.content = content
self.timestamp = timestamp
self.tokenCount = tokenCount
self.images = images
}

private enum CodingKeys: String, CodingKey {
case id, role, content, timestamp, tokenCount, images
}

public init(from decoder: Decoder) throws {
let c = try decoder.container(keyedBy: CodingKeys.self)
self.id = try c.decode(UUID.self, forKey: .id)
self.role = try c.decode(MessageRole.self, forKey: .role)
self.content = try c.decode(String.self, forKey: .content)
self.timestamp = try c.decode(Date.self, forKey: .timestamp)
self.tokenCount = try c.decodeIfPresent(Int.self, forKey: .tokenCount)
self.images = try c.decodeIfPresent([ImageAttachment].self, forKey: .images) ?? []
}
}

Expand Down Expand Up @@ -109,6 +132,16 @@ public actor ConversationStore {
/// Persist `conversation` to disk atomically. Creates the directory if
/// missing. Bumps `updatedAt` to "now" before writing.
///
/// Image attachments referenced by any message get copied (best-
/// effort) into `<directory>/<conv-uuid>/images/<image-uuid>.<ext>`
/// the first time we see them, so the saved JSON URLs are stable
/// across user moves of the picked file. Already-internal URLs
/// (already pointing at the conversation's images dir) are left
/// in place. A copy failure logs to stderr and falls through —
/// the conversation still saves with the original URL, which the
/// reader will tolerate (image just won't load if the source
/// disappears).
///
/// Uses `JSONCoding.precisionEncoder` so rapid saves produce distinct
/// `updatedAt` values for `list()` sort stability. Decoder accepts
/// pre-v0.3 ISO-8601-string files for backward compat.
Expand All @@ -120,11 +153,49 @@ public actor ConversationStore {
if copy.title == "New Chat" {
copy.title = copy.derivedTitle
}
copy.messages = copy.messages.map { internaliseImages(of: $0, conversationID: copy.id) }

let data = try JSONCoding.precisionEncoder().encode(copy)
let url = fileURL(for: copy.id)
try data.write(to: url, options: .atomic)
}

/// Copy any image attachments that live outside the conversation's
/// own `images/` directory into it, and rewrite the URLs. Idempotent:
/// images already inside the conversation directory are kept verbatim.
private func internaliseImages(
of message: StoredMessage,
conversationID: UUID
) -> StoredMessage {
guard !message.images.isEmpty else { return message }
let imagesDir = imagesDirectory(for: conversationID)

let updated: [ImageAttachment] = message.images.map { att in
// Already internal? — leave it alone.
if att.fileURL.path.hasPrefix(imagesDir.path) {
return att
}
// Try to copy. On any error, fall through to the original
// attachment so the save still succeeds.
do {
try fileManager.createDirectory(at: imagesDir, withIntermediateDirectories: true)
let ext = att.fileURL.pathExtension.isEmpty ? "img" : att.fileURL.pathExtension
let dest = imagesDir.appending(
path: "\(UUID().uuidString).\(ext)", directoryHint: .notDirectory)
try fileManager.copyItem(at: att.fileURL, to: dest)
return ImageAttachment(fileURL: dest, mimeType: att.mimeType)
} catch {
FileHandle.standardError.write(Data(
"[ConversationStore] image copy failed for \(att.fileURL.path): \(error)\n".utf8
))
return att
}
}
var out = message
out.images = updated
return out
}

/// Return the most-recently-updated conversation, or nil if the store
/// is empty. Corrupt files are skipped (they don't block other loads).
public func loadLatest() async throws -> Conversation? {
Expand Down Expand Up @@ -170,10 +241,16 @@ public actor ConversationStore {
return loaded.sorted { $0.updatedAt > $1.updatedAt }
}

/// Remove a conversation from disk. Idempotent — no error if missing.
/// Remove a conversation from disk along with any internalised
/// image attachments. Idempotent — no error if missing.
public func delete(id: UUID) async throws {
let url = fileURL(for: id)
try? fileManager.removeItem(at: url)
// Also remove the per-conversation directory if it exists
// (images live under it; pre-v0.4.1 conversations didn't
// create one and this no-ops cleanly).
let convDir = conversationDirectory(for: id)
try? fileManager.removeItem(at: convDir)
}

// MARK: - Private
Expand All @@ -182,6 +259,20 @@ public actor ConversationStore {
directory.appending(path: "\(id.uuidString).json", directoryHint: .notDirectory)
}

/// Per-conversation directory holding sidecar resources (images,
/// future audio attachments). Created on demand in
/// `internaliseImages(of:conversationID:)` and torn down by
/// `delete(id:)`.
private func conversationDirectory(for id: UUID) -> URL {
directory.appending(path: id.uuidString, directoryHint: .isDirectory)
}

/// Path used for image attachments of a given conversation.
private func imagesDirectory(for id: UUID) -> URL {
conversationDirectory(for: id)
.appending(path: "images", directoryHint: .isDirectory)
}

private func ensureDirectory() throws {
if !fileManager.fileExists(atPath: directory.path) {
try fileManager.createDirectory(at: directory, withIntermediateDirectories: true)
Expand Down
124 changes: 120 additions & 4 deletions MacMLXCore/Sources/MacMLXCore/Server/HummingbirdServer.swift
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,15 @@ import ServiceLifecycle
// MARK: - OpenAI-compatible request/response types

/// OpenAI-compatible chat completion request body.
///
/// `Message.content` accepts either a plain string (text-only chat) or
/// an OpenAI multimodal content array of `{type, text|image_url}` parts
/// (v0.4.1+ — VLM models can read images this way). The decoder tries
/// string first, falls back to `[Part]`. See `MultimodalContent` below.
private struct ChatCompletionRequest: Decodable, Sendable {
struct Message: Decodable, Sendable {
let role: String
let content: String
let content: MultimodalContent
}

let model: String
Expand All @@ -22,6 +27,111 @@ private struct ChatCompletionRequest: Decodable, Sendable {
let max_tokens: Int?
}

/// OpenAI multimodal content payload. Either a plain string (text-only
/// — backwards compat with every existing client) or an array of typed
/// parts (`text` and `image_url`). The decoder tries the string form
/// first; on failure it falls through to an array of parts so we don't
/// reject older clients that send a bare string.
private enum MultimodalContent: Decodable, Sendable {
case string(String)
case parts([Part])

struct Part: Decodable, Sendable {
let type: String // "text" or "image_url"
let text: String?
let image_url: ImageURL?
}
struct ImageURL: Decodable, Sendable {
/// Either a `data:image/...;base64,XXXX` URL (only form we
/// currently decode — see `extractImages()`) or `http(s)://`.
/// `file://` is rejected by `extractImages()` for defence-in-
/// depth even though the server is localhost-bound.
let url: String
}

init(from decoder: Decoder) throws {
let container = try decoder.singleValueContainer()
if let s = try? container.decode(String.self) {
self = .string(s)
return
}
let parts = try container.decode([Part].self)
self = .parts(parts)
}

/// Concatenated text view — what the model sees as the prompt
/// content for this turn. Image parts are ignored (their bytes
/// flow into the engine separately via `extractImages()`).
var text: String {
switch self {
case .string(let s):
return s
case .parts(let parts):
return parts.compactMap { $0.type == "text" ? $0.text : nil }
.joined(separator: "\n")
}
}

/// Decode any base64 data URLs into `ImageAttachment` values backed
/// by tmpfile copies. Caps:
/// - 4 images per call (further parts silently dropped)
/// - 10 MB per image (oversized parts silently dropped)
/// - data URL only — `http(s)://` and `file://` are not fetched
func extractImages() -> [ImageAttachment] {
guard case .parts(let parts) = self else { return [] }
var out: [ImageAttachment] = []
for part in parts {
guard part.type == "image_url",
let urlStr = part.image_url?.url,
let attachment = MultimodalContent.decodeDataURL(urlStr)
else { continue }
out.append(attachment)
if out.count >= 4 { break }
}
return out
}

/// Best-effort base64 data-URL → on-disk image. Returns nil on
/// any malformed input or unsupported MIME so callers can simply
/// drop the part.
private static func decodeDataURL(_ urlStr: String) -> ImageAttachment? {
guard urlStr.hasPrefix("data:") else { return nil }
let body = urlStr.dropFirst("data:".count)
let split = body.split(separator: ",", maxSplits: 1, omittingEmptySubsequences: false)
guard split.count == 2 else { return nil }
let header = String(split[0]) // e.g. "image/png;base64"
let payload = String(split[1])
guard header.hasSuffix(";base64") else { return nil }
let mime = String(header.dropLast(";base64".count))

let ext: String
switch mime.lowercased() {
case "image/jpeg", "image/jpg": ext = "jpg"
case "image/png": ext = "png"
case "image/webp": ext = "webp"
case "image/gif": ext = "gif"
case "image/heic": ext = "heic"
case "image/bmp": ext = "bmp"
default: return nil
}

guard let bytes = Data(base64Encoded: payload, options: .ignoreUnknownCharacters) else {
return nil
}
// 10 MB per image cap.
if bytes.count > 10 * 1024 * 1024 { return nil }

let tmp = FileManager.default.temporaryDirectory
.appendingPathComponent("macmlx-http-img-\(UUID().uuidString).\(ext)")
do {
try bytes.write(to: tmp)
return ImageAttachment(fileURL: tmp, mimeType: mime)
} catch {
return nil
}
}
}

/// Request body for `/x/models/load`.
private struct LoadModelRequest: Decodable, Sendable {
let model_path: String
Expand Down Expand Up @@ -568,13 +678,19 @@ public actor HummingbirdServer {
// property, so leaving it in both places produces a duplicate
// system turn, which Qwen3 / Gemma / other strict Jinja chat
// templates reject with a TemplateException.
let systemPrompt = chatReq.messages.first(where: { $0.role == "system" })?.content
let systemPrompt = chatReq.messages.first(where: { $0.role == "system" })?.content.text

// Map the rest (user / assistant), dropping unknown roles and
// the now-separated system turns.
// the now-separated system turns. Multimodal `content` arrays
// are split here: text parts → `content`, image_url data URLs
// → `images` via `extractImages()`. See MultimodalContent.
let messages: [ChatMessage] = chatReq.messages.compactMap { msg in
guard let role = MessageRole(rawValue: msg.role), role != .system else { return nil }
return ChatMessage(role: role, content: msg.content)
return ChatMessage(
role: role,
content: msg.content.text,
images: msg.content.extractImages()
)
}

let params = GenerationParameters(
Expand Down
Loading