Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3
Open
Dicoangelo wants to merge 8 commits into
Open
Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3Dicoangelo wants to merge 8 commits into
Dicoangelo wants to merge 8 commits into
Conversation
Both queries targeted flattened SQLite-style columns (light_topic, instinct_gut_signal, etc.) but the live Postgres cognitive_events table stores those fields inside JSONB columns (light_layer, instinct_layer). Result: 'column "light_topic" does not exist' on every call. Rewrote both queries to extract from JSONB with aliases that preserve existing row-access names. Verified against live ucw_cognitive DB.
Sweep sonnet-4-6 -> sonnet-5 in coherence_engine extractors, and rewire cpb/llm_client.py to load Claude model IDs and per-token costs from ~/.claude/config/pricing.json instead of inline literals. Hardcoded values kept only as offline fallback mirroring the current Claude 5 family. Kills the recurring manual model-id sweep on each release.
Wire Firecrawl's research-specific paper index into TieredSearchLayer as the primary Tier-1 source, running alongside the raw arXiv client as fallback. Semantic relevance scoring, canonical/source id extraction, and approximate publish dates parsed from the arXiv YYMM id encoding feed the existing time-decay + tier-weight scoring. Papers deduped by arXiv id across providers. Keyless-capable; honors FIRECRAWL_API_KEY for higher rate limits. Docs: https://docs.firecrawl.dev/features/research
Adds read_paper_passages() to TieredSearchLayer, backing citation-grounding checks: pull the top full-text passages in a cited paper that address a specific claim before trusting the citation. Accepts canonical paperId or source ids (e.g. arxiv:1706.03762). Firecrawl-backed, keyless-capable.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Adds opt-in ground_citations() to the verification pipeline: for each arXiv citation in a response, pull the cited paper's passages via Firecrawl read-paper and confirm the paper is real + retrievable, attaching the top passage as evidence. Exposed via verify(ground_citations=True) and surfaced on VerificationResult (citations_grounded, grounding_evidence). Default off so the core pipeline stays hermetic; 17/17 cpb tests unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Updates the ResearchGravity search + LLM engine. Two themes: wire Firecrawl's new Research Index in as the primary paper source, and stop hardcoding Claude model IDs.
Firecrawl Research Index (
cpb/search_layer.py)_search_firecrawl_papers— semantically-ranked paper search as the primary Tier-1 source, running alongside the raw arXiv client as fallback. Deduped by arXiv id across providers. Publish dates parsed from the arXivYYMMid encoding feed the existing time-decay + tier-weight scoring.read_paper_passages— pulls the top full-text passages in a cited paper that address a specific claim, backing citation-grounding checks (verify a paper actually contains a method/result before trusting the citation).FIRECRAWL_API_KEYfor higher rate limits.Model-registry sovereignty (
cpb/llm_client.py)~/.claude/config/pricing.jsoninstead of inline literals; hardcoded values kept only as offline fallback. Kills the recurring manual model-id sweep on each release.sonnet-4-6 -> sonnet-5in the coherence extractors; fixed 6 ruff F541s.A/B (Firecrawl vs raw arXiv, same queries)
Comparable count + latency; Firecrawl's semantic ranking returns precisely on-topic papers where arXiv keyword search returns broad/tangential surveys. Note: the pre-existing arXiv Tier-1 path returns nothing unless the optional
arxivdep is installed in the runtime env — it was not, so Firecrawl is effectively the first functioning paper source in the day-to-day build.Testing
Follow-ups
arxivin the MCP runtime env (declared optional incpb/requirements.txt) to revive the fallback.cpb/router.py:340-343still returns bare tier aliases; map through the registry when routing is next touched.