longmemeval#316
Closed
xyf2020 wants to merge 12 commits into
Closed
Conversation
…ate injection
- Add evaluation/longmemeval/ with run.py, config.yaml, and test scripts
- Add reme/config/longmemeval.yaml for evaluation-specific model config
- Add tool_defaults mechanism to as_agent_wrapper for injecting default
tool kwargs (uses setdefault so LLM-provided values take priority)
- Pass tool_defaults={'daily_write': {'date': day}} in auto_memory to
ensure notes always use the correct historical date
- Add timestamp interpolation (_interpolate_timestamps) in auto_memory
for filling missing created_at fields via linear interpolation
- Evaluation pipeline: ingest sessions -> dream -> search -> answer -> judge
- Uses qwen3.6-flash for memory, qwen3.7-max for answer/judge
…eval runner - Replace async execution with synchronous + multiprocessing for parallel item evaluation - Add filter_future_sessions option to only ingest sessions <= question date - Add question_types filtering in config - Add result summary with binary accuracy and avg score - Update config defaults (oracle variant, 50 items, 32 workers) - Minor code style fixes in agent_wrapper and auto_memory
- Add BenchQueryStep using agent_wrapper with search job tool - Replace manual search+LLM answer in run.py with bench_query_job - Remove unused answer LLM config from longmemeval.yaml - Register benchmark step module in steps/__init__.py
- Add _extract_date_from_path to extract validated YYYY-MM-DD from chunk paths - Add start_date/end_date filtering in _matches_search_filter - Implement progressive recall in FaissLocalFileStore.vector_search - Promote start_date/end_date from context to search_filter in SearchStep - Add start_date/end_date parameters to search job in default.yaml - Add unit tests for date filter functionality
Address three code-review comments on the time_filter search feature:
1. Validate/normalize start_date and end_date before string comparison.
_matches_search_filter does lexicographic comparison against path_date
(always canonical YYYY-MM-DD). Raw caller values like '2026-2-28' or
'abc' would produce silently wrong results. Now SearchStep normalizes
valid dates via extract_daily_date (with strptime fallback for
non-zero-padded input) and silently ignores invalid dates with a
logger.warning, removing them from the filter.
2. Clarify behavior for paths without embedded dates.
Added optional strict_date_filter parameter (default False). When True
and at least one date bound is active, chunks whose path yields no date
(e.g. digest/personal/topic.md) are excluded. When False (default),
the existing behavior is preserved — dateless paths pass through.
3. Harden _extract_date_from_path against non-standard suffixes.
Previously parts[1].split('.')[0] accepted '2026-05-18.anything' as a
valid date. Now only exact 'YYYY-MM-DD' (dir) and 'YYYY-MM-DD.md'
(day-index) forms are accepted.
# Conflicts: # .gitignore
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.