Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ source .venv/bin/activate && pytest tests/ -x -q && ruff check --fix . && ruff f
| `app.py` | Streamlit UI: CV upload → profile → search → evaluate → display |
| `cv_parser.py` | Extract text from PDF/DOCX/MD/TXT |
| `llm.py` | Gemini API wrapper with retry/backoff |
| `search_agent.py` | Generate search queries (LLM) + orchestrate search |
| `search_provider.py` | `SearchProvider` protocol + `get_provider()` factory |
| `bundesagentur.py` | Bundesagentur für Arbeit API provider (default) |
| `search_api/search_agent.py` | Generate search queries (LLM) + orchestrate search |
| `search_api/search_provider.py` | `SearchProvider` protocol + `get_provider()` factory |
| `search_api/bundesagentur.py` | Bundesagentur für Arbeit API provider (default) |
| `evaluator_agent.py` | Score jobs against profile (LLM) + career summary |
| `models.py` | All Pydantic schemas (`CandidateProfile`, `JobListing`, etc.) |
| `cache.py` | JSON file cache in `.immermatch_cache/` |
Expand Down
3 changes: 2 additions & 1 deletion .github/prompts/pr-review.prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Fetch and address review comments from the most recent PR on the current branch.

## Execution policy

- Run all `gh` commands (or equivalent GitHub MCP calls) immediately without asking for confirmation.
- Run all `gh` commands (or equivalent GitHub MCP calls) immediately without asking for confirmation. Prefer MCP calls for efficiency when possible.
- Do **not** start code edits until after presenting a full comment assessment and getting explicit user confirmation.

## Workflow
Expand All @@ -17,6 +17,7 @@ Fetch and address review comments from the most recent PR on the current branch.
```
3. **List all comments first (no edits yet):**
- Produce a complete checklist of every review comment.
- Make it look like a pretty table or bulleted list for easy reading.
- For each item include:
- **Assessment:** valid / duplicate / not applicable
- **Suggestion:** exact fix you plan to apply (or why you will skip)
Expand Down
4 changes: 2 additions & 2 deletions .github/prompts/write-tests.prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ When writing tests for a module in `immermatch/`:
- Gemini: `@patch("immermatch.<module>.call_gemini")`
- Supabase: `@patch("immermatch.db.get_admin_client")`
- Resend: `@patch("immermatch.emailer.resend")`
- SerpApi: `@patch("immermatch.serpapi_provider.serpapi_search")`
- Bundesagentur: `@patch("immermatch.bundesagentur.requests.get")`
- SerpApi: `@patch("immermatch.search_api.serpapi_provider.GoogleSearch.get_dict")`
- Bundesagentur: `@patch("immermatch.search_api.bundesagentur.httpx.Client.get")`
4. **Use shared fixtures** from `tests/conftest.py`:
- `sample_profile` — `CandidateProfile` with work history
- `sample_job` — `JobListing` with apply options
Expand Down
20 changes: 10 additions & 10 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This document defines the persona, context, and instruction sets for the AI agen
**Input:** Raw text extracted from a CV (PDF, DOCX, Markdown, or plain text).
**Output:** A structured JSON summary of the candidate.

**System Prompt:** *(source of truth: `immermatch/search_agent.py:PROFILER_SYSTEM_PROMPT`)*
**System Prompt:** *(source of truth: `immermatch/search_api/search_agent.py:PROFILER_SYSTEM_PROMPT`)*
> You are an expert technical recruiter with deep knowledge of European job markets.
> You will be given the raw text of a candidate's CV. Extract a comprehensive profile.
>
Expand Down Expand Up @@ -73,7 +73,7 @@ The system prompt is selected based on the active **SearchProvider**:

Used when `provider.name == "Bundesagentur für Arbeit"`. Generates keyword-only queries (no location tokens) because the BA API has a dedicated `wo` parameter for location filtering.

**System Prompt:** *(source of truth: `immermatch/search_agent.py:BA_HEADHUNTER_SYSTEM_PROMPT`)*
**System Prompt:** *(source of truth: `immermatch/search_api/search_agent.py:BA_HEADHUNTER_SYSTEM_PROMPT`)*
> You are a Search Specialist generating keyword queries for the German Federal Employment Agency job search API (Bundesagentur für Arbeit).
>
> Based on the candidate's profile, generate distinct keyword queries to find relevant job openings. The API searches across German job listings and handles location filtering separately.
Expand All @@ -94,7 +94,7 @@ Used when `provider.name == "Bundesagentur für Arbeit"`. Generates keyword-only

Used when `provider.name != "Bundesagentur für Arbeit"` (e.g., SerpApiProvider for non-German markets). Generates location-enriched queries optimised for Google Jobs.

**System Prompt:** *(source of truth: `immermatch/search_agent.py:HEADHUNTER_SYSTEM_PROMPT`)*
**System Prompt:** *(source of truth: `immermatch/search_api/search_agent.py:HEADHUNTER_SYSTEM_PROMPT`)*
> You are a Search Specialist. Based on the candidate's profile and location, generate 20 distinct search queries to find relevant job openings.
>
> IMPORTANT: Keep queries SHORT and SIMPLE (1-3 words). Google Jobs works best with simple, broad queries.
Expand All @@ -109,7 +109,7 @@ Used when `provider.name != "Bundesagentur für Arbeit"` (e.g., SerpApiProvider

**Search Provider Architecture:**

The search pipeline uses a pluggable `SearchProvider` protocol (defined in `search_provider.py`):
The search pipeline uses a pluggable `SearchProvider` protocol (defined in `immermatch/search_api/search_provider.py`):

```python
class SearchProvider(Protocol):
Expand Down Expand Up @@ -231,7 +231,7 @@ SERPAPI_PARAMS = {

### Blocked Job Portals (SerpApi only)

Jobs from the following portals are discarded during search result parsing (see `immermatch/serpapi_provider.py:BLOCKED_PORTALS`):
Jobs from the following portals are discarded during search result parsing (see `immermatch/search_api/serpapi_provider.py:BLOCKED_PORTALS`):

> bebee, trabajo, jooble, adzuna, jobrapido, neuvoo, mitula, trovit, jobomas, jobijoba, talent, jobatus, jobsora, studysmarter, jobilize, learn4good, grabjobs, jobtensor, zycto, terra.do, jobzmall, simplyhired

Expand Down Expand Up @@ -504,8 +504,8 @@ Schema setup: run `python setup_db.py` to check tables and print migration SQL.
|---|---|---|
| `test_llm.py` (12 tests) | `llm.py` | `parse_json()` (8 cases: raw, fenced, embedded, nested, errors) + `call_gemini()` retry logic (4 cases: success, ServerError retry, 429 retry, non-429 immediate raise) |
| `test_evaluator_agent.py` (8 tests) | `evaluator_agent.py` | `evaluate_job()` (4 cases: happy path, API error fallback, parse error fallback, non-dict fallback) + `evaluate_all_jobs()` (3 cases: sorted output, progress callback, empty list) + `generate_summary()` (2 cases: score distribution in prompt, missing skills in prompt) |
| `test_search_agent.py` (35 tests) | `search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (provider delegation, dedup, early stopping, callbacks, default provider) + `generate_search_queries()` prompt selection (BA vs SerpApi) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) |
| `test_bundesagentur.py` (22 tests) | `bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_parse_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance |
| `test_search_agent.py` (35 tests) | `search_api/search_agent.py` | `_is_remote_only()` (remote tokens, non-remote) + `_infer_gl()` (known locations, unknown default, remote returns None, case insensitive) + `_localise_query()` (city names, country names, case insensitive, multiple cities) + `_parse_job_results()` (valid, blocked portals, mixed, empty, no-apply-links) + `search_all_queries()` (provider delegation, dedup, early stopping, callbacks, default provider) + `generate_search_queries()` prompt selection (BA vs SerpApi) + `TestLlmJsonRecovery` (profile_candidate and generate_search_queries retry/recovery) |
| `test_bundesagentur.py` (22 tests) | `search_api/bundesagentur.py` | `_build_ba_link()`, `_parse_location()`, `_parse_search_results()`, `_parse_listing()`, `BundesagenturProvider.search()` (basic merge, pagination, HTTP errors, empty results, detail fetch failures), `SearchProvider` protocol conformance |
| `test_cache.py` (17 tests) | `cache.py` | All cache operations: profile, queries, jobs (merge/dedup), evaluations, unevaluated job filtering |
| `test_cv_parser.py` (6 tests) | `cv_parser.py` | `_clean_text()` + `extract_text()` for .txt/.md, error cases |
| `test_models.py` (23 tests) | `models.py` | All Pydantic models: validation, defaults, round-trip serialization |
Expand All @@ -517,7 +517,7 @@ Schema setup: run `python setup_db.py` to check tables and print migration SQL.
| `test_integration.py` (11 tests) | Full pipeline | End-to-end: CV text → profile → queries → search → evaluate → summary, all services mocked |
| `test_pages_unsubscribe.py` (6 tests) | `pages/unsubscribe.py` | Unsubscribe page logic: token validation, DB deactivation, error states (AppTest) |
| `test_pages_verify.py` (7 tests) | `pages/verify.py` | DOI verification page: token confirmation, welcome email, expiry setting, error states (AppTest) |
| `test_search_provider.py` (2 tests) | `search_provider.py` | Provider helpers: `parse_provider_query()`, combined provider behavior |
| `test_search_provider.py` (2 tests) | `search_api/search_provider.py` | Provider helpers: `parse_provider_query()`, combined provider behavior |

### Testing conventions
- All external services (Gemini API, SerpAPI, Supabase) are mocked — no API keys needed to run tests
Expand Down Expand Up @@ -590,7 +590,7 @@ make clean # remove caches and build artifacts

The recommended workflow for implementing tasks/issues:

1. **Pick the next unchecked task** from `ROADMAP.md`
1. **Pick the next unchecked task** from `docs/strategy/ROADMAP.md`
2. **Plan the implementation** in Copilot Chat — describe the task, ask for a plan, review it
3. **Implement via Copilot Chat** (agent mode) — let the agent write code, create files, and run tests. It will implement → test → fix in a loop.
4. **Review the diff locally** — check changed files, run the Streamlit app once if needed
Expand All @@ -607,7 +607,7 @@ The recommended workflow for implementing tasks/issues:
```bash
gh pr merge --squash --delete-branch
```
9. **Mark the task as done** in `ROADMAP.md` (change `- [ ]` to `- [x]`)
9. **Mark the task as done** in `docs/strategy/ROADMAP.md` (change `- [ ]` to `- [x]`)

### Tool allocation (token efficiency)

Expand Down
8 changes: 4 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ source .venv/bin/activate && pytest tests/ -x -q && ruff check --fix . && ruff f
| `app.py` | Streamlit UI: CV upload → profile → search → evaluate → display |
| `cv_parser.py` | Extract text from PDF/DOCX/MD/TXT |
| `llm.py` | Gemini API wrapper with retry/backoff |
| `search_agent.py` | Generate search queries (LLM) + orchestrate search |
| `search_provider.py` | `SearchProvider` protocol + `get_provider()` factory |
| `bundesagentur.py` | Bundesagentur für Arbeit job search API provider |
| `serpapi_provider.py` | Google Jobs via SerpApi provider (future non-DE markets) |
| `search_api/search_agent.py` | Generate search queries (LLM) + orchestrate search |
| `search_api/search_provider.py` | `SearchProvider` protocol + `get_provider()` factory |
| `search_api/bundesagentur.py` | Bundesagentur für Arbeit job search API provider |
| `search_api/serpapi_provider.py` | Google Jobs via SerpApi provider (future non-DE markets) |
| `evaluator_agent.py` | Score jobs against candidate profile (LLM) + career summary |
| `models.py` | All Pydantic schemas (`CandidateProfile`, `JobListing`, etc.) |
| `cache.py` | JSON file cache in `.immermatch_cache/` |
Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Jobs are fetched from Google Jobs via SerpApi, deduplicated, and scored in paral

## Bundesagentur Provider Tuning

The Bundesagentur provider in `immermatch/bundesagentur.py` supports a configurable detail-fetch strategy:
The Bundesagentur provider in `immermatch/search_api/bundesagentur.py` supports a configurable detail-fetch strategy:

- `api_then_html` (default): first tries `/pc/v4/jobdetails/{refnr}`, then falls back to scraping the public job-detail page if needed
- `api_only`: uses only the API detail endpoint
Expand Down Expand Up @@ -152,7 +152,11 @@ immermatch/
app.py # Streamlit web UI
llm.py # Gemini client and retry logic
cv_parser.py # CV text extraction (PDF/DOCX/MD/TXT)
search_agent.py # Profile extraction and job search
search_api/
search_agent.py # Profile extraction and job search orchestration
search_provider.py # Provider abstraction + routing/factory
bundesagentur.py # Bundesagentur für Arbeit provider
serpapi_provider.py # SerpApi provider
evaluator_agent.py # Job scoring and career summary
models.py # Pydantic data models
cache.py # JSON-based result caching
Expand All @@ -165,6 +169,12 @@ immermatch/
privacy.py # Privacy policy
daily_task.py # Daily digest cron job (GitHub Actions)
setup_db.py # Database schema checker / migration helper
docs/
strategy/
ROADMAP.md # Product roadmap and priorities
search-api/
AGENT.md # Search API decision log + specialist guidance
Improving Job Search API Results.md # Search quality research
tests/ # tests (all mocked)
```

Expand Down
2 changes: 1 addition & 1 deletion daily_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
from immermatch.evaluator_agent import evaluate_all_jobs
from immermatch.llm import create_client
from immermatch.models import CandidateProfile, EvaluatedJob, JobListing
from immermatch.search_agent import search_all_queries
from immermatch.search_api.search_agent import search_all_queries

logging.basicConfig(
level=logging.INFO,
Expand Down
34 changes: 34 additions & 0 deletions docs/search-api/AGENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Search API Specialist Agent

## Mission
Maintain and improve search quality, freshness, and provider reliability for Immermatch job discovery.

## Canonical Code Scope
- `immermatch/search_api/search_provider.py`
- `immermatch/search_api/search_agent.py`
- `immermatch/search_api/serpapi_provider.py`
- `immermatch/search_api/bundesagentur.py`

## Current Architecture Decisions
- Default provider is Bundesagentur für Arbeit (verified German listings).
- SerpApi provider is optional and enabled only when `SERPAPI_KEY` is set.
- Combined provider mode merges BA + SerpApi when SerpApi is configured.
- Search orchestration deduplicates by `title|company_name|location`.
- Provider quotas in combined mode enforce source diversity (`_MIN_JOBS_PER_PROVIDER`).

## Known Tradeoffs
- BA gives higher listing trust; SerpApi increases breadth at higher noise risk.
- Portal blocklist removes common low-quality aggregators but may drop occasional valid listings.
- Temporal freshness currently relies on provider recency filters; no URL HEAD-validation pipeline yet.

## Research Inputs
- `docs/search-api/Improving Job Search API Results.md`

## Decision Log Template
Use this format for each change:
- Date:
- Decision:
- Context:
- Alternatives considered:
- Impact:
- Follow-up tasks:
28 changes: 28 additions & 0 deletions docs/strategy/AGENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Strategy Specialist Agent

## Mission
Translate product goals into an executable roadmap balancing launch speed, user value, and monetization.

## Canonical Strategy Docs
- `docs/strategy/ROADMAP.md`
- Additional market/positioning analyses in `docs/strategy/`

## Planning Principles
- Prefer small, validated increments over broad speculative work.
- Prioritize reliability, GDPR compliance, and job quality before growth features.
- Gate paid-tier complexity (Stripe, webhooks, infra migration) behind demand signals.

## Current Priority Lens
1. Search relevance and listing quality
2. UX conversion improvements (profile edits/preferences)
3. Digest reliability and anti-abuse hardening
4. Monetization readiness

## Decision Log Template
Use this format for each strategic update:
- Date:
- Hypothesis:
- Evidence:
- Decision:
- KPI impact expected:
- Revisit date:
File renamed without changes.
4 changes: 2 additions & 2 deletions immermatch/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@
from immermatch.evaluator_agent import evaluate_job, generate_summary # noqa: E402
from immermatch.llm import create_client # noqa: E402
from immermatch.models import CandidateProfile, EvaluatedJob, JobListing # noqa: E402
from immermatch.search_agent import ( # noqa: E402
from immermatch.search_api.search_agent import ( # noqa: E402
generate_search_queries,
profile_candidate,
search_all_queries,
)
from immermatch.search_provider import ( # noqa: E402
from immermatch.search_api.search_provider import ( # noqa: E402
get_provider,
get_provider_fingerprint,
parse_provider_query, # noqa: E402
Expand Down
30 changes: 30 additions & 0 deletions immermatch/search_api/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""Search API domain package.

Canonical location for search provider implementations and orchestration.
"""

from .bundesagentur import BundesagenturProvider
from .search_agent import (
BA_HEADHUNTER_SYSTEM_PROMPT,
HEADHUNTER_SYSTEM_PROMPT,
PROFILER_SYSTEM_PROMPT,
generate_search_queries,
profile_candidate,
search_all_queries,
)
from .search_provider import CombinedSearchProvider, SearchProvider, get_provider
from .serpapi_provider import SerpApiProvider

__all__ = [
"BA_HEADHUNTER_SYSTEM_PROMPT",
"BundesagenturProvider",
"CombinedSearchProvider",
"HEADHUNTER_SYSTEM_PROMPT",
"PROFILER_SYSTEM_PROMPT",
"SearchProvider",
"SerpApiProvider",
"generate_search_queries",
"get_provider",
"profile_candidate",
"search_all_queries",
]
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

import httpx

from .models import ApplyOption, JobListing
from ..models import ApplyOption, JobListing

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -227,7 +227,7 @@ def _parse_search_results(data: dict) -> list[dict]:
class BundesagenturProvider:
"""Job-search provider backed by the Bundesagentur für Arbeit API.

Satisfies the :class:`~immermatch.search_provider.SearchProvider` protocol.
Satisfies the :class:`~immermatch.search_api.search_provider.SearchProvider` protocol.
"""

name: str = "Bundesagentur für Arbeit"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""Search Agent module - Generates optimized job search queries using LLM.

The SerpApi-specific helpers (``_infer_gl``, ``_localise_query``, etc.) live
in :mod:`immermatch.serpapi_provider` and are re-exported here for backward
compatibility.
in :mod:`immermatch.search_api.serpapi_provider` and are re-exported here for
backward compatibility.
"""

from __future__ import annotations
Expand All @@ -15,8 +15,8 @@
from google import genai
from pydantic import ValidationError

from .llm import call_gemini, parse_json
from .models import CandidateProfile, JobListing
from ..llm import call_gemini, parse_json
from ..models import CandidateProfile, JobListing
from .search_provider import (
CombinedSearchProvider,
SearchProvider,
Expand Down Expand Up @@ -205,7 +205,7 @@ def generate_search_queries(
) -> list[str]:
"""Generate optimized job search queries based on candidate profile.

When a :class:`~immermatch.bundesagentur.BundesagenturProvider` is active
When a :class:`~immermatch.search_api.bundesagentur.BundesagenturProvider` is active
the prompt asks the LLM for short keyword-only queries (no location
tokens). For SerpApi / Google Jobs the prompt includes location-enrichment
strategies.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import os
from typing import Protocol, runtime_checkable

from .models import JobListing
from ..models import JobListing

logger = logging.getLogger(__name__)

Expand Down
Loading