Skip to content

Latest commit

 

History

History
98 lines (69 loc) · 2.98 KB

File metadata and controls

98 lines (69 loc) · 2.98 KB

Agent Instructions for Inspect Eval Convertor

This file provides instructions for AI coding assistants working on this repository.

Project Overview

This repository converts custom LLM evaluation formats into Inspect AI's canonical eval format using the Task framework (not manual EvalLog construction).

Critical Principles

  1. ALWAYS use task.py - Never create convert.py files
  2. Use task_main() helper - Never manually construct EvalLog objects
  3. Store messages in metadata - Use sample.metadata["messages"] not sample.messages
  4. Output naming: Always creates input.eval (same name as input with .eval extension)

When Creating a New Converter

  1. Copy from example: Start with examples/simple_chat/task.py as template
  2. Read documentation: Check docs/INDEX.md and docs/CONVERSION_GUIDE.md
  3. Study similar examples: Pick the closest match to your format
  4. Create task.py: Follow the required structure (see .cursor/rules/001-core-patterns.mdc)
  5. Test immediately: Run uv run python task.py input.json and validate output

Task.py Structure

Every task.py MUST have:

@solver
def replay_solve():
    # Replay pre-recorded messages

@scorer(metrics=[mean()])
def score_scorer():
    # Read score from metadata

@task
def my_task(input_path: Path, model_name: str, **kwargs) -> Task:
    # Create Sample objects with metadata
    # Return Task with solver and scorer

if __name__ == "__main__":
    task_main(my_task, get_model_name)

Files to Reference

  • Examples: examples/*/task.py - Study these patterns
  • Documentation: docs/CONVERSION_GUIDE.md - Step-by-step guide
  • Utilities: src/inspect_convertor/utils.py - task_main() helper
  • Troubleshooting: docs/TROUBLESHOOTING.md - Common issues

Never Do

  • ❌ Create convert.py files
  • ❌ Use EvalLog, EvalSample, EvalSpec directly
  • ❌ Use deprecated safe_convert_message() or safe_extract_score()
  • ❌ Use ConversionContext or create_conversion_context()
  • ❌ Create output.eval files (use input.eval)

Always Do

  • ✅ Use @task decorator
  • ✅ Use task_main() from inspect_convertor.utils
  • ✅ Store messages in sample.metadata["messages"]
  • ✅ Create ModelEvents in metadata for tools/branching
  • ✅ Run make test after changes
  • ✅ Validate output with inspect-convert-validate

Example Workflow

# 1. Study example
cat examples/simple_chat/task.py

# 2. Create new task.py based on pattern
# (follow .cursor/rules/001-core-patterns.mdc)

# 3. Install dependencies (if needed)
uv pip install -e .

# 4. Test it
uv run python examples/my_format/task.py examples/my_format/input.json

# 5. Validate
inspect-convert-validate examples/my_format/input.eval

# 6. Run all tests
make test

Getting Help

  • Check .cursor/rules/ for detailed patterns
  • Read docs/CONVERSION_GUIDE.md for complete examples
  • See docs/TROUBLESHOOTING.md for error solutions
  • Look at examples/ for working implementations