Problem
The get_collection_terms tool returns common words like "the", "and", "from" which aren't useful for understanding collection topics:
- the (15967 docs)
- and (12878 docs)
- from (11044 docs)
- for (9652 docs)
Solution
Configure stop words in milli using settings.set_stop_words() in open_index().
Considerations
-
Milli's stop word API takes a single global set, not per-language lists
-
Options:
- Multilingual stop word set - Combine stop words from common languages (EN, ES, FR, DE, etc.)
- Per-collection language setting - Let users specify language(s) when creating a collection
- Auto-detect - Detect languages in documents and build combined list
-
Requires re-indexing existing documents after configuration change
-
Milli uses charabia for multi-language tokenization, but stop words are separate
References
- Milli stop words API:
~/.cargo/git/checkouts/meilisearch-77f25ebad2a1f1e5/aee74f4/crates/milli/src/update/settings.rs
- Could use
stop-words crate for standard lists
Problem
The
get_collection_termstool returns common words like "the", "and", "from" which aren't useful for understanding collection topics:Solution
Configure stop words in milli using
settings.set_stop_words()inopen_index().Considerations
Milli's stop word API takes a single global set, not per-language lists
Options:
Requires re-indexing existing documents after configuration change
Milli uses
charabiafor multi-language tokenization, but stop words are separateReferences
~/.cargo/git/checkouts/meilisearch-77f25ebad2a1f1e5/aee74f4/crates/milli/src/update/settings.rsstop-wordscrate for standard lists