Skip to content

feat(scoring): one-shot JSONL scoring + P1 prose fix (#75)#81

Merged
bayrem merged 9 commits into
mainfrom
feat/75-scoring-simplification
May 19, 2026
Merged

feat(scoring): one-shot JSONL scoring + P1 prose fix (#75)#81
bayrem merged 9 commits into
mainfrom
feat/75-scoring-simplification

Conversation

@bjridicodes

@bjridicodes bjridicodes commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

What changed

File Change
agent/nodes/analyze_jobs.py Read from JSONL, one code path, write scored JSONL
providers/scoring/llm_scorer.py SystemMessage + prose fast-fail + clean retry + 1000-char cap
providers/scoring/hybrid_scorer.py Deleted
providers/scoring/static_scorer.py Deleted
providers/scoring/profile_store.py Deleted
tests/test_analyze_jobs.py Updated for new cap, message index, SystemMessage assertion, prose tests
tests/test_hybrid_scorer.py Deleted
tests/test_static_scorer.py Deleted
tests/test_profile_store.py Deleted

Test plan

  • 205 tests pass (pytest tests/ -v)
  • ruff: no issues
  • mypy: no issues (61 source files)
  • Manual: run infisical run --env=dev -- python scripts/test_node.py analyze_jobs with a populated query/jobs_found.jsonl to confirm JSONL read + scoring + jobs_scored.jsonl written

🤖 Generated with Claude Code

bjridicodes and others added 9 commits May 19, 2026 18:53
…ve call (closes #79)

Replace N keyword queries with one directive LLM call that carries full
context: all target positions, all locations, and company ATS hints. Strict
anti-hallucination rules forbid the LLM from generating URLs from memory or
training data. Capped at 30 results per run.

URL validation now only drops network-unreachable domains (DNS/connection
failure). ATS platforms return HTTP 200 for any path regardless of whether
the job exists, so status codes were not a reliable hallucination signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After the LLM directive call returns URL candidates, run Tavily extract
on every URL. URLs where Tavily returns no content are dropped — they are
hallucinated, stale, or unreachable. URLs that pass have their description
replaced with the real posting content (up to 2000 chars).

LLM now asked for max_results+20 candidates so Tavily filtering doesn't
leave us short of the 30-result target. Removed unreliable HEAD-based URL
validation — Tavily content extraction is the definitive signal.

Degrades gracefully: if TAVILY_API_KEY is not set, Tavily step is skipped
and LLM output is returned as-is.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inct modules

web_search.py: returns URL candidates only ({url, source, found_in_snippet}).
  LLM now returns a URL-only JSON payload — no fabricated descriptions.

url_validator.py (new): Tavily extract validates URLs, drops hallucinated
  or unreachable ones (16/26 dropped in live test), builds job dicts from
  real extracted content + URL-pattern metadata.

search_jobs.py: calls both steps explicitly — search then validate — with
  separate log lines for each. Fixed config path bug (_get_positions and
  locations were reading from wrong key).

Live result: 26 LLM candidates → 10 Tavily-validated → 8 after semantic
dedup. All 8 jobs carry 2000 chars of real extracted posting content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Tavily extract

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…zation header

The file had two docstrings concatenated without a closing triple-quote, causing
a syntax error that failed ruff/mypy. Also had duplicate search() and extract()
method definitions from the merge. Moved api_key from request body to
Authorization Bearer header (addresses GitHub Advanced Security flag).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…75)

Three changes land together:

1. **Fix P1 (#75)**: `llm_scorer` now prepends a `SystemMessage` before the
   scoring payload so the Claude CLI treats it as a task not a conversation.
   Added prose fast-fail (`_is_prose`) that detects non-JSON output by its
   first character, bypassing the 120s timeout entirely. Retry sends a clean
   format-only prompt instead of echoing the broken prose back.

2. **JSONL-based scoring**: `analyze_jobs` reads jobs from
   `query/jobs_found.jsonl` (the checkpoint written by `aggregate_jobs`)
   rather than from LangGraph state. Scored output is written to
   `query/jobs_scored.jsonl`. This makes the scoring step independently
   runnable and the checkpoint file the single source of truth between search
   and scoring.

3. **Remove hybrid/static modes**: `hybrid_scorer.py`, `static_scorer.py`,
   and `profile_store.py` are deleted. `analyze_jobs` now has one code path:
   one LLM call for all jobs via `score_jobs_batch`. The mode-switching
   branches, the profile bootstrap loop, and the borderline escalation logic
   are gone. Issues #13 and #14 were closed as superseded.

   `cv_cache.py` is retained — CV compression is still needed and is
   independent of scoring mode.

Description cap in the scoring prompt increased from 600 → 1000 chars to use
Tavily's richer extracted content (now up to 2000 chars per job).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bayrem bayrem merged commit a12d620 into main May 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(scoring): Claude CLI returns conversational prose instead of JSON — batches skipped

2 participants