Skip to content

Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3

Open
Dicoangelo wants to merge 8 commits into
mainfrom
fix/ucw-timeline-jsonb-drift
Open

Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3
Dicoangelo wants to merge 8 commits into
mainfrom
fix/ucw-timeline-jsonb-drift

Conversation

@Dicoangelo

Copy link
Copy Markdown
Owner

Summary

Updates the ResearchGravity search + LLM engine. Two themes: wire Firecrawl's new Research Index in as the primary paper source, and stop hardcoding Claude model IDs.

Firecrawl Research Index (cpb/search_layer.py)

  • _search_firecrawl_papers — semantically-ranked paper search as the primary Tier-1 source, running alongside the raw arXiv client as fallback. Deduped by arXiv id across providers. Publish dates parsed from the arXiv YYMM id encoding feed the existing time-decay + tier-weight scoring.
  • read_paper_passages — pulls the top full-text passages in a cited paper that address a specific claim, backing citation-grounding checks (verify a paper actually contains a method/result before trusting the citation).
  • Keyless-capable; honors FIRECRAWL_API_KEY for higher rate limits.
  • Docs: https://docs.firecrawl.dev/features/research

Model-registry sovereignty (cpb/llm_client.py)

  • Reads Claude model IDs + per-token costs from ~/.claude/config/pricing.json instead of inline literals; hardcoded values kept only as offline fallback. Kills the recurring manual model-id sweep on each release.
  • Swept sonnet-4-6 -> sonnet-5 in the coherence extractors; fixed 6 ruff F541s.

A/B (Firecrawl vs raw arXiv, same queries)

Comparable count + latency; Firecrawl's semantic ranking returns precisely on-topic papers where arXiv keyword search returns broad/tangential surveys. Note: the pre-existing arXiv Tier-1 path returns nothing unless the optional arxiv dep is installed in the runtime env — it was not, so Firecrawl is effectively the first functioning paper source in the day-to-day build.

Testing

  • 17/17 cpb tests pass; ruff clean on touched files.
  • Live-verified: paper search, dedup, and passage reading against the real API.

Follow-ups

  • Install arxiv in the MCP runtime env (declared optional in cpb/requirements.txt) to revive the fallback.
  • cpb/router.py:340-343 still returns bare tier aliases; map through the registry when routing is next touched.

Both queries targeted flattened SQLite-style columns (light_topic,
instinct_gut_signal, etc.) but the live Postgres cognitive_events table
stores those fields inside JSONB columns (light_layer, instinct_layer).
Result: 'column "light_topic" does not exist' on every call.

Rewrote both queries to extract from JSONB with aliases that preserve
existing row-access names. Verified against live ucw_cognitive DB.
Sweep sonnet-4-6 -> sonnet-5 in coherence_engine extractors, and rewire
cpb/llm_client.py to load Claude model IDs and per-token costs from
~/.claude/config/pricing.json instead of inline literals. Hardcoded values
kept only as offline fallback mirroring the current Claude 5 family. Kills
the recurring manual model-id sweep on each release.
Wire Firecrawl's research-specific paper index into TieredSearchLayer as the
primary Tier-1 source, running alongside the raw arXiv client as fallback.
Semantic relevance scoring, canonical/source id extraction, and approximate
publish dates parsed from the arXiv YYMM id encoding feed the existing
time-decay + tier-weight scoring. Papers deduped by arXiv id across providers.
Keyless-capable; honors FIRECRAWL_API_KEY for higher rate limits.

Docs: https://docs.firecrawl.dev/features/research
Adds read_paper_passages() to TieredSearchLayer, backing citation-grounding
checks: pull the top full-text passages in a cited paper that address a
specific claim before trusting the citation. Accepts canonical paperId or
source ids (e.g. arxiv:1706.03762). Firecrawl-backed, keyless-capable.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Adds opt-in ground_citations() to the verification pipeline: for each arXiv
citation in a response, pull the cited paper's passages via Firecrawl
read-paper and confirm the paper is real + retrievable, attaching the top
passage as evidence. Exposed via verify(ground_citations=True) and surfaced
on VerificationResult (citations_grounded, grounding_evidence). Default off
so the core pipeline stays hermetic; 17/17 cpb tests unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant