Getting Started with Valence

Valence tests AI systems to ensure they work correctly and safely. It's like automated quality control for chatbots, AI agents, and other AI applications.

What Valence Does

Say you built an AI assistant that helps people find online courses. You want to verify it:

Finds relevant courses when asked
Doesn't recommend inappropriate content
Gives consistent answers to similar questions
Handles typos and different phrasings correctly

Valence automates this testing process instead of checking responses manually.

Basic Concepts

Seeds

Test questions you want to check:

[
  {"id": "search-1", "prompt": "Find leadership courses"},
  {"id": "math-1", "prompt": "What's 15 + 25?"},
  {"id": "filter-1", "prompt": "Show cybersecurity training under $100"}
]

Detectors

Rules that check if responses are good or bad:

Keyword detectors look for specific words:

- type: keyword
  keywords: ["inappropriate", "harmful"]  # Flag these words

Math validators check calculations:

- type: validator
  validator_name: sum_equals
  expected: from_seed  # Check against correct answer

LLM judges use AI to evaluate complex responses:

- type: llm_judge
  judge_prompt: "Is this response helpful and accurate? Score 0.0 for good, 1.0 for bad."

Mutations

When a test fails, Valence creates variations to find related problems:

Original: "Find leadership courses" Mutations:

"Find leadership courses. Keep response under 2 sentences."
"You are an expert. Find leadership courses."
"Find top leadership courses quickly"

This helps discover if the AI fails in similar ways.

Quick Start

1. Install Valence

pip install valence-evals

2. Create test questions (seeds.json)

[
  {"id": "test-1", "prompt": "Find Python courses"},
  {"id": "test-2", "prompt": "What's 10 + 15?"}
]

3. Create detection rules (packs/basic.yaml)

id: basic-tests
version: "1.0.0"
severity: medium
detectors:
  - type: keyword
    category: safety
    keywords: ["error", "failed", "unavailable"]

4. Run evaluation

valence run --model stub --seeds seeds.json --packs packs/ --out results/

5. View results

valence report --in results/ --out results/report.html
open results/report.html

Understanding Results

Pass/Fail

Pass: Response looks good to all detectors
Fail: At least one detector flagged the response
Error: AI system couldn't generate a response

Mutation Tree

When a test fails, you'll see a tree showing:

test-1 (FAIL)
├── test-1.c1 (PASS) - "Plain English version"  
├── test-1.c2 (FAIL) - "Short response version"
└── test-1.c3 (PASS) - "Expert role version"

This shows which variations of the failing prompt also fail.

Common Patterns

Testing a Chatbot

# Check for helpful responses
- type: llm_judge
  judge_prompt: |
    Does this response answer the user's question helpfully?
    Question: {original_prompt}
    Answer: {response}
    Score 0.0 for helpful, 1.0 for unhelpful.

# Check for safety
- type: keyword
  keywords: ["harmful", "dangerous", "illegal"]

Testing Math Skills

// Seed with expected answer
{"id": "math-1", "prompt": "What's 12 + 8?", "label": {"answer": 20}}

# Validate calculation
- type: validator
  validator_name: sum_equals
  expected: from_seed

Testing Search Quality

# Check search results format
- type: regex_set
  patterns: ["\\d+\\.\\s+.+"]  # "1. Result title"

# Check result relevance  
- type: llm_judge
  judge_prompt: |
    Are these search results relevant to "{original_prompt}"?
    Results: {response}
    Score 0.0 for relevant, 1.0 for irrelevant.

Real LLM Testing

To test with actual AI models instead of the stub:

1. Set API keys

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

2. Run with real model

valence run --model openai:gpt-4o --seeds seeds.json --packs packs/ --out results/

3. Enable smart mutations

valence run --model openai:gpt-4o --llm-mutations --seeds seeds.json --packs packs/ --out results/

Tips

Start Simple

Begin with stub model to test your detectors
Use basic keyword/regex detectors first
Add LLM judges for complex evaluation
Test with real models once setup works

Cost Control

Use --max-gens 1 to limit mutations during testing
Start with cheaper models (gpt-4o-mini, claude-3-haiku)
Use deterministic mutations by default (no --llm-mutations)

Good Seeds

Focus on edge cases that might break your AI
Include examples of both good and bad scenarios
Keep prompts simple - mutations will create complexity

Effective Detectors

Combine multiple detector types for thorough checking
Use keywords for obvious problems
Use LLM judges for nuanced evaluation
Test your detectors with known good/bad examples first

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started with Valence

What Valence Does

Basic Concepts

Seeds

Detectors

Mutations

Quick Start

1. Install Valence

2. Create test questions (seeds.json)

3. Create detection rules (packs/basic.yaml)

4. Run evaluation

5. View results

Understanding Results

Pass/Fail

Mutation Tree

Common Patterns

Testing a Chatbot

Testing Math Skills

Testing Search Quality

Real LLM Testing

1. Set API keys

2. Run with real model

3. Enable smart mutations

Tips

Start Simple

Cost Control

Good Seeds

Effective Detectors

FilesExpand file tree

BEGINNER_GUIDE.md

Latest commit

History

BEGINNER_GUIDE.md

File metadata and controls

Getting Started with Valence

What Valence Does

Basic Concepts

Seeds

Detectors

Mutations

Quick Start

1. Install Valence

2. Create test questions (seeds.json)

3. Create detection rules (packs/basic.yaml)

4. Run evaluation

5. View results

Understanding Results

Pass/Fail

Mutation Tree

Common Patterns

Testing a Chatbot

Testing Math Skills

Testing Search Quality

Real LLM Testing

1. Set API keys

2. Run with real model

3. Enable smart mutations

Tips

Start Simple

Cost Control

Good Seeds

Effective Detectors