Improved Ollama integration with async + retry support#274
Improved Ollama integration with async + retry support#274
Conversation
for more information, see https://pre-commit.ci
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de89bf4318
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mesa_llm/memory/st_lt_memory.py
Outdated
| if self.short_term_memory: | ||
| entry = self.short_term_memory.popleft() | ||
| if entry not in evicted: | ||
| evicted.append(entry) |
There was a problem hiding this comment.
Avoid consolidating when consolidation is disabled
When consolidation_capacity is disabled (None via consolidation_capacity=0), this overflow branch still appends the removed entry to evicted, which causes process_step() to call _update_long_term_memory() and issue an LLM summarization anyway. That changes the documented/previous semantics of “discard oldest without consolidation,” and it introduces unexpected API calls/cost and long-term memory mutations for users who intentionally turned consolidation off.
Useful? React with 👍 / 👎.
| target_tokens = current_tokens * 0.75 # Target 75% of max | ||
|
|
||
| for salience, tokens, entry in entries_with_salience: | ||
| if tokens_freed >= (current_tokens - target_tokens): |
There was a problem hiding this comment.
Enforce token target from max_tokens, not current usage
This computes target_tokens as current_tokens * 0.75, so pruning only removes 25% of the current footprint rather than driving memory under the configured max_tokens threshold. If memory is far over limit (for example, 3000 tokens with max_tokens=1000), one pass can still leave it massively above budget, so the advertised token cap is not actually enforced and prompt size can remain over context constraints.
Useful? React with 👍 / 👎.
Thanks for opening a PR! Please click the
Previewtab and select a PR template:Fixes #214 - Memory Decay & Salience Pruning in STLTMemory
Related to #178 - Expose Token Usage via LiteLLM
Related to #200 - Critical Performance Issues
Feature/enhancement PRs require prior maintainer approval in an issue or discussion before they are accepted.