Synthdocs ships with MistralBackend and OpenAIBackend, but you can implement your own backend to use any LLM provider.
Implement the LLMBackend abstract base class from synthdocs.llm.base:
from synthdocs.llm.base import LLMBackend
from pydantic import BaseModel
class MyBackend(LLMBackend):
def __init__(self, model: str = "my-model", temperature: float = 0.2) -> None:
super().__init__(model=model, temperature=temperature)
# Initialize your client here
def generate(
self,
prompt: str,
response_model: type[BaseModel] | None = None,
temperature: float | None = None,
) -> str | BaseModel:
effective_temp = temperature if temperature is not None else self.temperature
if response_model is not None:
# Structured output mode: return a validated Pydantic model instance
# Use your provider's structured output feature, or parse JSON manually
raw_json = self._call_llm_json_mode(prompt, effective_temp)
return response_model.model_validate_json(raw_json)
# Text mode: return raw string
return self._call_llm_text_mode(prompt, effective_temp)- Text mode (
response_model=None): return a plainstr - Structured mode (
response_modelprovided): return a validated Pydantic model instance, not raw JSON text
For structured output, you can either:
- Use your provider's native structured output / JSON schema feature (preferred)
- Request JSON and parse with
response_model.model_validate_json(raw_json)
Pass your backend to the generation functions:
from synthdocs import generate_document, generate_case_batch
# Single document
result = generate_document(case_input, backend=MyBackend())
# Batch generation
results = generate_case_batch(
template=my_template,
count=5,
backend=MyBackend(model="my-large-model", temperature=0.3),
output_dir=Path("./output"),
)If you want to use your custom backend as the judge for fact-location evaluation, you have two options:
Sanity-check your backend against the bundled labeled dataset:
from synthdocs.eval.judge_benchmarks import (
load_fact_locations_judge_benchmark,
run_fact_locations_judge_benchmark,
)
items = load_fact_locations_judge_benchmark() # bundled JSONL
result = run_fact_locations_judge_benchmark(
backend=MyBackend(),
items=items,
tightness_threshold=3,
)
print(f"Entailment accuracy: {result['summary']['entailed_accuracy']:.3f}")
print(f"Span pass agreement: {result['summary']['pass_agreement']:.3f}")This runs your backend through the same judge prompts used in evaluation and compares against human-labeled expected outputs. Look for:
- Entailment accuracy > 0.9 (does your model correctly identify when facts are present?)
- Span pass agreement Does the judge's pass/fail (entailed AND span minimality score >= threshold) align with labels? This is a proxy for whether extracted spans are reasonably minimal and usable as citations. WARNING: We are still testing this.
Use your backend in the full evaluation pipeline:
from synthdocs.eval.fact_location import (
FactLocationJudgeConfig,
run_fact_location_batch_eval,
)
judge_config = FactLocationJudgeConfig(
backend=MyBackend(temperature=0.1),
model="my-model",
temperature=0.1,
context_chars=120,
enabled=True,
)
summary = run_fact_location_batch_eval(
target=Path("output/"),
run_id="my-run",
judge_config=judge_config,
)Note: The CLI (synthdocs eval fact-locations) currently only supports OpenAI and Mistral backends. For custom backends, use the Python API directly.
For implementation examples, see:
src/synthdocs/llm/openai.py— uses OpenAI'sresponses.parse()for structured outputsrc/synthdocs/llm/mistral.py— usesresponse_format: json_schemafor structured output