Skip to content

Configure stop words for milli search index #15

@monneyboi

Description

@monneyboi

Problem

The get_collection_terms tool returns common words like "the", "and", "from" which aren't useful for understanding collection topics:

- the (15967 docs)
- and (12878 docs)
- from (11044 docs)
- for (9652 docs)

Solution

Configure stop words in milli using settings.set_stop_words() in open_index().

Considerations

  • Milli's stop word API takes a single global set, not per-language lists

  • Options:

    1. Multilingual stop word set - Combine stop words from common languages (EN, ES, FR, DE, etc.)
    2. Per-collection language setting - Let users specify language(s) when creating a collection
    3. Auto-detect - Detect languages in documents and build combined list
  • Requires re-indexing existing documents after configuration change

  • Milli uses charabia for multi-language tokenization, but stop words are separate

References

  • Milli stop words API: ~/.cargo/git/checkouts/meilisearch-77f25ebad2a1f1e5/aee74f4/crates/milli/src/update/settings.rs
  • Could use stop-words crate for standard lists

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions