Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ An Apple-first Swift Package family for local document search and semantic retri

SwiftlyFetch is the umbrella product direction for a small family of Apple-first local search packages. The product goal is simple: hand the system a local corpus and get back a real search engine, with conventional search and semantic retrieval both living under one coherent Swift-native story. In practical terms, SwiftlyFetch is the family for "drop in a corpus, get back local search," with `FetchKit` covering conventional full-document search and `RAGKit` covering semantic retrieval over the same broader corpus model.

Today, the package exposes shipped semantic retrieval work through `RAGCore` and `RAGKit`, plus the first conventional-search foundation through `FetchCore` and `FetchKit`. `FetchCore` now owns the portable conventional-search vocabulary, the durable document-record model, and the indexing-changeset boundary. That record model carries first-class typed lifecycle and source fields like `kind`, `language`, `createdAt`, `updatedAt`, `sourceURI`, and `lastIndexedAt`, while leaving the freeform metadata bag string-based. `FetchCore` also distinguishes between the durable stored record, the lean search-facing document view, and the richer index-facing payload used by the sync boundary. `FetchKitLibrary` now supports a default in-memory construction path, and `FetchKit` includes a Core Data-backed `FetchDocumentStore`, a persisted pending-sync queue, and the first thin macOS SearchKit-backed index.
Today, the package exposes shipped semantic retrieval work through `RAGCore` and `RAGKit`, plus the first conventional-search foundation through `FetchCore` and `FetchKit`. `FetchCore` now owns the portable conventional-search vocabulary, the durable document-record model, and the indexing-changeset boundary. That record model carries first-class typed lifecycle and source fields like `kind`, `language`, `createdAt`, `updatedAt`, `sourceURI`, and `lastIndexedAt`, while leaving the freeform metadata bag string-based. `FetchCore` also distinguishes between the durable stored record, the lean search-facing document view, and the richer index-facing payload used by the sync boundary. `FetchKitLibrary` now supports a default in-memory construction path, and `FetchKit` includes a Core Data-backed `FetchDocumentStore`, a persisted pending-sync queue, and the first thin macOS SearchKit-backed index. Conventional-search results also carry field evidence through `matchedFields` and `snippetField`, so UI code can tell whether a result matched title text, body text, or both.

The intended family split is:

Expand Down Expand Up @@ -103,8 +103,13 @@ try await library.addDocument(
)

let results = try await library.search("apple guide")
let firstResult = results.first
let matchedFields = firstResult?.matchedFields
let snippetField = firstResult?.snippetField
```

`matchedFields` identifies every indexed field that contributed to a search result. `snippetField` identifies the field used to build the returned snippet. Title-only hits intentionally keep the title as the snippet source, so simple result lists still have an immediate explanation for why the result appeared, while richer UIs can render title evidence differently from body evidence.

On macOS, the persistent conventional-search surface is now also shaped around one library storage location instead of separate store and index URLs:

```swift
Expand Down Expand Up @@ -139,6 +144,7 @@ Current defaults:
- markdown images keep alt text primary in chunk text while recording image references as chunk metadata, and whitelisted HTML blocks currently cover `img` plus `details` / `summary`
- markdown fallback is selective: ordinary supported prose still chunks normally, but policy-rejected markdown like unsupported raw-HTML-only or reference-definition-only content does not fall back through the plain paragraph chunker
- conventional search now uses modest field-aware ranking, prefers title hits over body-only hits when both are relevant, and builds query-aware snippets with multi-term highlights instead of a single fixed-width first-term window
- conventional-search results report `matchedFields` and `snippetField`, keeping title-only snippets visible while letting consumers distinguish title evidence from body evidence
- `makeContext(...)` suppresses redundant same-document chunk text, groups annotated output by document, and skips annotated sections that only have room for labels

Supported today:
Expand All @@ -149,7 +155,7 @@ Supported today:
- use `FetchKitLibrary()` with a default in-memory backend or inject custom `FetchDocumentStore` and `FetchIndex` implementations explicitly
- use a real Core Data-backed `FetchDocumentStore` in `FetchKit` with the first thin macOS SearchKit index backend
- persist and retry pending index-sync work through `FetchKitLibrary.pendingIndexSyncs()` and `retryPendingIndexSyncs(...)`
- return conventional-search results with query-aware snippets and field-aware ranking across title and body matches
- return conventional-search results with query-aware snippets, field-aware ranking, matched-field metadata, and snippet-source metadata across title and body matches
- narrow retrieval with typed metadata filters
- preserve meaningful markdown structure for retrieval, including heading paths, list semantics, quote-heavy documents, code-heavy documents, short section breaks, images, and a narrow raw-HTML whitelist
- turn ranked search results into plain or annotated context text for downstream UI or model consumers
Expand Down
5 changes: 3 additions & 2 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,9 @@ In Progress

- [x] Refine ranking behavior for conventional search so the first SearchKit backend feels less like a raw index adapter and more like a library product.
- [x] Improve snippet behavior and result presentation without bloating `FetchCore` into a larger query or rendering DSL.
- [ ] Audit real-corpus result quality now that field-aware ranking, phrase weighting, truncation cues, and multi-term snippets are in place.
- [ ] Decide whether title-only hits should suppress body snippets or use a different presentation policy in the public facade.
- [x] Add the first checked-in fixture corpus and cover title/body result-evidence behavior across the default in-memory path and the macOS SearchKit-backed path.
- [x] Decide that title-only hits should keep title snippets while exposing `matchedFields` and `snippetField` so consumers can distinguish title evidence from body evidence.
- [ ] Audit broader real-corpus result quality now that field-aware ranking, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
- [ ] Keep the persistent `FetchKitLibrary` construction and search API surface under review as real callers exercise the current design.
- [ ] Explore an opt-in extended snippet surface that can use idle time to precompute short document summaries for larger records, with Apple's [`FoundationModels`](https://developer.apple.com/documentation/foundationmodels) or another local summarization path as the first candidate instead of making foreground full-text search wait on summarization.

Expand Down
5 changes: 4 additions & 1 deletion docs/maintainers/fetchkit-product-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ Current status:
- because the Search Kit tests now finish quickly and pass reliably, the repo now runs them in ordinary local validation and the default GitHub macOS CI lane instead of keeping them behind a local-only opt-in gate
- the persistent `FetchKitLibrary` construction path is now intentionally caller-shaped around one storage location, with an Application Support default plus a direct directory override, instead of asking app code to assemble separate Core Data and Search Kit URLs itself
- the first refinement pass on conventional-search result quality is now in place: SearchKit scores are normalized per field, title hits get a modest weight bump, cross-field matches accumulate instead of collapsing to the single best field, and snippets now highlight multiple query terms instead of showing only the first term in a fixed-width window
- title-only hits intentionally keep a title snippet, and `FetchSearchResult` now reports `matchedFields` plus `snippetField` so consumers can distinguish title evidence from body evidence without losing the simple "why did this result appear?" explanation
- the first checked-in fixture corpus now covers both the default in-memory index path and the macOS SearchKit-backed path, using a tiny attributed Project Gutenberg sample from Hugging Face instead of making CI download a live dataset
- the CI investigation on GitHub-hosted macOS found that the Core Data-backed store path could abort under Swift Testing with `Incorrect actor executor assumption`, even after global test parallelism was disabled
- that investigation surfaced two store-shape fixes worth keeping regardless of the runner: the durable Core Data store should use a private-queue background context instead of `viewContext`, and it should use Core Data's async `perform` API directly instead of manually bridging context work through checked continuations
- the Core Data-backed store coverage now lives on XCTest rather than Swift Testing so the package keeps the newer test surface where it is stable while reserving the older runner for framework-heavy Core Data verification
Expand Down Expand Up @@ -160,7 +162,8 @@ The next work is refinement, not first architecture:

- keep the persistent `FetchKitLibrary` surface polished as real callers exercise it
- keep the SearchKit-backed path inside ordinary validation unless a future framework regression forces it back out
- decide whether the current ranking and snippet heuristics are already enough for ordinary callers or whether real corpora show a need for another refinement pass
- use broader fixture corpora or real app corpora to decide whether the current ranking, snippet, and result-evidence heuristics are already enough for ordinary callers
- explore opt-in extended snippets later as background summary metadata for larger documents, not as work that foreground full-text search has to perform before returning results

## First Core Data Entity Shape

Expand Down
19 changes: 15 additions & 4 deletions docs/maintainers/fixture-corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

This note records the first checked-in fixture corpus used for `FetchKit` conventional-search quality tests.

The job of this fixture is deliberately narrow: give the default `FetchKitLibrary` tests enough title/body variety to characterize ranking and snippet behavior without making local or hosted CI download a dataset.
The job of this fixture is deliberately narrow: give the default `FetchKitLibrary` and macOS SearchKit tests enough title/body variety to characterize ranking, snippet, and result-evidence behavior without making local or hosted CI download a dataset.

## Current Fixture Source

Expand All @@ -20,6 +20,17 @@ Why this source fits the first pass:

The fixture records live in `Tests/FetchKitTests/Fixtures/GutenbergMiniCorpus.swift`. Each record carries dataset, config, split, row, and Gutenberg ID metadata so the sample remains attributable and replaceable.

## Result Evidence Policy

The first fixture pass settled the title-only snippet policy for the current public surface:

- keep title snippets for title-only hits
- report all contributing fields through `FetchSearchResult.matchedFields`
- report the field used for the returned snippet through `FetchSearchResult.snippetField`
- cover the same title/body expectations in both the default in-memory index path and the macOS SearchKit-backed path

In practical terms, simple result lists can keep rendering a snippet for every explained hit, while richer consumers can avoid treating a title snippet as body evidence.

## Hugging Face Dependency Boundary

Do not add a Hugging Face Swift dependency for the default fixture lane yet. The current checked-in fixture keeps CI deterministic and avoids adding a network, token, cache, or package-resolution requirement to ordinary tests.
Expand All @@ -42,8 +53,8 @@ Hugging Face documents dataset parquet discovery through the Dataset Viewer serv

## Next Use

Use this fixture to settle the remaining Milestone 4 questions:
Use this fixture to keep the settled Milestone 4 result-evidence behavior honest while broader quality work continues:

- whether the current ranking and snippet heuristics are enough for ordinary app callers
- whether title-only hits should keep using title snippets, suppress snippets, or grow a different presentation policy in the public facade
- whether the first fixture corpus should also cover the macOS SearchKit-backed path directly, or whether the existing SearchKit tests plus the default-library corpus tests are enough for now
- whether a larger fixture corpus exposes ranking or snippet gaps that the mini corpus cannot show
- whether future extended snippets should be backed by precomputed summaries for larger documents rather than by foreground search-time work
Loading