Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty by Dicoangelo · Pull Request #3 · Dicoangelo/ResearchGravity

Dicoangelo · 2026-07-01T04:22:50Z

Summary

Updates the ResearchGravity search + LLM engine. Two themes: wire Firecrawl's new Research Index in as the primary paper source, and stop hardcoding Claude model IDs.

Firecrawl Research Index (`cpb/search_layer.py`)

_search_firecrawl_papers — semantically-ranked paper search as the primary Tier-1 source, running alongside the raw arXiv client as fallback. Deduped by arXiv id across providers. Publish dates parsed from the arXiv YYMM id encoding feed the existing time-decay + tier-weight scoring.
read_paper_passages — pulls the top full-text passages in a cited paper that address a specific claim, backing citation-grounding checks (verify a paper actually contains a method/result before trusting the citation).
Keyless-capable; honors FIRECRAWL_API_KEY for higher rate limits.
Docs: https://docs.firecrawl.dev/features/research

Model-registry sovereignty (`cpb/llm_client.py`)

Reads Claude model IDs + per-token costs from ~/.claude/config/pricing.json instead of inline literals; hardcoded values kept only as offline fallback. Kills the recurring manual model-id sweep on each release.
Swept sonnet-4-6 -> sonnet-5 in the coherence extractors; fixed 6 ruff F541s.

A/B (Firecrawl vs raw arXiv, same queries)

Comparable count + latency; Firecrawl's semantic ranking returns precisely on-topic papers where arXiv keyword search returns broad/tangential surveys. Note: the pre-existing arXiv Tier-1 path returns nothing unless the optional arxiv dep is installed in the runtime env — it was not, so Firecrawl is effectively the first functioning paper source in the day-to-day build.

Testing

17/17 cpb tests pass; ruff clean on touched files.
Live-verified: paper search, dedup, and passage reading against the real API.

Follow-ups

Install arxiv in the MCP runtime env (declared optional in cpb/requirements.txt) to revive the fallback.
cpb/router.py:340-343 still returns bare tier aliases; map through the registry when routing is next touched.

Both queries targeted flattened SQLite-style columns (light_topic, instinct_gut_signal, etc.) but the live Postgres cognitive_events table stores those fields inside JSONB columns (light_layer, instinct_layer). Result: 'column "light_topic" does not exist' on every call. Rewrote both queries to extract from JSONB with aliases that preserve existing row-access names. Verified against live ucw_cognitive DB.

… per registry

Sweep sonnet-4-6 -> sonnet-5 in coherence_engine extractors, and rewire cpb/llm_client.py to load Claude model IDs and per-token costs from ~/.claude/config/pricing.json instead of inline literals. Hardcoded values kept only as offline fallback mirroring the current Claude 5 family. Kills the recurring manual model-id sweep on each release.

Wire Firecrawl's research-specific paper index into TieredSearchLayer as the primary Tier-1 source, running alongside the raw arXiv client as fallback. Semantic relevance scoring, canonical/source id extraction, and approximate publish dates parsed from the arXiv YYMM id encoding feed the existing time-decay + tier-weight scoring. Papers deduped by arXiv id across providers. Keyless-capable; honors FIRECRAWL_API_KEY for higher rate limits. Docs: https://docs.firecrawl.dev/features/research

Adds read_paper_passages() to TieredSearchLayer, backing citation-grounding checks: pull the top full-text passages in a cited paper that address a specific claim before trusting the citation. Accepts canonical paperId or source ids (e.g. arxiv:1706.03762). Firecrawl-backed, keyless-capable.

chatgpt-codex-connector · 2026-07-01T04:22:55Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Adds opt-in ground_citations() to the verification pipeline: for each arXiv citation in a response, pull the cited paper's passages via Firecrawl read-paper and confirm the paper is real + retrievable, attaching the top passage as evidence. Exposed via verify(ground_citations=True) and surfaced on VerificationResult (citations_grounded, grounding_evidence). Default off so the core pipeline stays hermetic; 17/17 cpb tests unchanged.

Dicoangelo added 7 commits June 18, 2026 20:31

chore: ignore local AI context files

e7625e3

chore(model-ids): sweep deprecated claude-opus-4-6 -> claude-opus-4-8…

4eddae6

… per registry

style: remove 6 extraneous f-string prefixes (ruff F541)

e64dfb0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3

Engine update: Firecrawl Research Index (Tier-1 papers + citation grounding) + model-registry sovereignty#3
Dicoangelo wants to merge 8 commits into
mainfrom
fix/ucw-timeline-jsonb-drift

Dicoangelo commented Jul 1, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Dicoangelo commented Jul 1, 2026

Summary

Firecrawl Research Index (cpb/search_layer.py)

Model-registry sovereignty (cpb/llm_client.py)

A/B (Firecrawl vs raw arXiv, same queries)

Testing

Follow-ups

Uh oh!

chatgpt-codex-connector Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Firecrawl Research Index (`cpb/search_layer.py`)

Model-registry sovereignty (`cpb/llm_client.py`)