Skip to content

Commit d1691e6

Browse files
committed
Added: Support for multiple scores from scorers
1 parent aa1e7f8 commit d1691e6

10 files changed

Lines changed: 789 additions & 125 deletions

File tree

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,8 @@ Braintrust::Eval.run(
259259
)
260260
```
261261

262+
See [eval.rb](./examples/eval.rb) for a full example.
263+
262264
### Datasets
263265

264266
Use test cases from a Braintrust dataset:
@@ -287,6 +289,8 @@ Braintrust::Eval.run(
287289
)
288290
```
289291

292+
See [dataset.rb](./examples/eval/dataset.rb) for a full example.
293+
290294
### Scorers
291295

292296
Use scoring functions defined in Braintrust:
@@ -315,6 +319,8 @@ Braintrust::Eval.run(
315319
)
316320
```
317321

322+
See [remote_functions.rb](./examples/eval/remote_functions.rb) for a full example.
323+
318324
#### Scorer metadata
319325

320326
Scorers can return a Hash with `:score` and `:metadata` to attach structured context to the score. The metadata is logged on the scorer's span and visible in the Braintrust UI for debugging and filtering:
@@ -332,6 +338,27 @@ end
332338

333339
See [scorer_metadata.rb](./examples/eval/scorer_metadata.rb) for a full example.
334340

341+
#### Multiple scores from one scorer
342+
343+
When several scores can be computed together (e.g. in one LLM call), you can return an `Array` of score `Hash` instead of a single value. Each metric appears as a separate score column in the Braintrust UI:
344+
345+
```ruby
346+
Braintrust::Scorer.new("summary_quality") do |output:, expected:|
347+
words = output.downcase.split
348+
key_terms = expected[:key_terms]
349+
covered = key_terms.count { |t| words.include?(t) }
350+
351+
[
352+
{name: "coverage", score: covered.to_f / key_terms.size, metadata: {missing: key_terms - words}},
353+
{name: "conciseness", score: words.size <= expected[:max_words] ? 1.0 : 0.0}
354+
]
355+
end
356+
```
357+
358+
`name` and `score` are required, `metadata` is optional.
359+
360+
See [multi_score.rb](./examples/eval/multi_score.rb) for a full example.
361+
335362
#### Trace scoring
336363

337364
Scorers can access the full evaluation trace (all spans generated by the task) by declaring a `trace:` keyword parameter. This is useful for inspecting intermediate LLM calls, validating tool usage, or checking the message thread:
@@ -361,7 +388,7 @@ Braintrust::Eval.run(
361388
)
362389
```
363390

364-
See examples: [eval.rb](./examples/eval.rb), [dataset.rb](./examples/eval/dataset.rb), [remote_functions.rb](./examples/eval/remote_functions.rb), [trace_scoring.rb](./examples/eval/trace_scoring.rb)
391+
See [trace_scoring.rb](./examples/eval/trace_scoring.rb) for a full example.
365392

366393
### Dev Server
367394

examples/eval/multi_score.rb

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
#!/usr/bin/env ruby
2+
# frozen_string_literal: true
3+
4+
require "bundler/setup"
5+
require "braintrust"
6+
require "opentelemetry/sdk"
7+
8+
# Example: Multi-Score Scorers
9+
#
10+
# A scorer can return an Array of score hashes to emit multiple named metrics
11+
# from a single scorer call. Each hash must have a :name and :score key; an
12+
# optional :metadata key attaches structured context to that metric.
13+
#
14+
# This is useful when several dimensions of quality (e.g. correctness,
15+
# completeness, format) can be computed together — sharing one inference call
16+
# or one pass over the output — rather than running separate scorers.
17+
#
18+
# Two patterns are shown:
19+
#
20+
# 1. Block-based (Braintrust::Scorer.new):
21+
# Pass a block that returns an Array. Good for concise, one-off scorers.
22+
#
23+
# 2. Class-based (include Braintrust::Scorer):
24+
# Define a class with a #call method. Good for reusable scorers that
25+
# share helper logic across multiple metrics.
26+
#
27+
# Usage:
28+
# bundle exec ruby examples/eval/multi_score.rb
29+
30+
Braintrust.init
31+
32+
# ---------------------------------------------------------------------------
33+
# Task: summarise a list of facts
34+
# ---------------------------------------------------------------------------
35+
FACTS = {
36+
"The sky is blue and clouds are white." => {
37+
key_terms: %w[sky blue clouds white],
38+
max_words: 10
39+
},
40+
"Ruby was created by Matz in 1995." => {
41+
key_terms: %w[ruby matz 1995],
42+
max_words: 8
43+
},
44+
"The Pacific Ocean is the largest ocean on Earth." => {
45+
key_terms: %w[pacific largest ocean earth],
46+
max_words: 10
47+
}
48+
}
49+
50+
# Simulated summariser (replace with a real LLM call in production)
51+
def summarise(text)
52+
# Naive: drop words over the limit and lowercase
53+
text.split.first(8).join(" ").downcase
54+
end
55+
56+
# ---------------------------------------------------------------------------
57+
# Pattern 1: block-based multi-score scorer
58+
#
59+
# Returns three metrics in one pass:
60+
# - coverage: fraction of key terms present in the summary
61+
# - conciseness: 1.0 if under the word limit, else 0.0
62+
# - lowercase: 1.0 if the summary is fully lowercased
63+
# ---------------------------------------------------------------------------
64+
summary_quality = Braintrust::Scorer.new("summary_quality") do |output:, expected:|
65+
words = output.to_s.downcase.split
66+
key_terms = expected[:key_terms]
67+
max_words = expected[:max_words]
68+
69+
covered = key_terms.count { |t| words.include?(t) }
70+
coverage_score = key_terms.empty? ? 1.0 : covered.to_f / key_terms.size
71+
72+
[
73+
{
74+
name: "coverage",
75+
score: coverage_score,
76+
metadata: {covered: covered, total: key_terms.size, missing: key_terms - words}
77+
},
78+
{
79+
name: "conciseness",
80+
score: (words.size <= max_words) ? 1.0 : 0.0,
81+
metadata: {word_count: words.size, limit: max_words}
82+
},
83+
{
84+
name: "lowercase",
85+
score: (output.to_s == output.to_s.downcase) ? 1.0 : 0.0
86+
}
87+
]
88+
end
89+
90+
# ---------------------------------------------------------------------------
91+
# Pattern 2: class-based multi-score scorer
92+
#
93+
# Include Braintrust::Scorer and define #call. The class name is used as the
94+
# scorer name by default; override #name to customise it.
95+
#
96+
# Returns two metrics:
97+
# - ends_with_period: checks punctuation
98+
# - no_first_person: checks for avoided first-person pronouns
99+
# ---------------------------------------------------------------------------
100+
class StyleChecker
101+
include Braintrust::Scorer
102+
103+
FIRST_PERSON = %w[i me my myself we us our].freeze
104+
105+
def call(output:, **)
106+
text = output.to_s
107+
words = text.downcase.split(/\W+/)
108+
fp_words = words & FIRST_PERSON
109+
110+
[
111+
{
112+
name: "ends_with_period",
113+
score: text.strip.end_with?(".") ? 1.0 : 0.0
114+
},
115+
{
116+
name: "no_first_person",
117+
score: fp_words.empty? ? 1.0 : 0.0,
118+
metadata: {found: fp_words}
119+
}
120+
]
121+
end
122+
end
123+
124+
Braintrust::Eval.run(
125+
project: "ruby-sdk-examples",
126+
experiment: "multi-score-example",
127+
cases: FACTS.map { |text, expected| {input: text, expected: expected} },
128+
task: ->(input:) { summarise(input) },
129+
scorers: [summary_quality, StyleChecker.new]
130+
)
131+
132+
OpenTelemetry.tracer_provider.shutdown

lib/braintrust/eval/runner.rb

Lines changed: 18 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,8 @@ def run_eval_case(case_context, errors)
111111
case_context.trace = build_trace(eval_span)
112112

113113
# Run scorers
114-
case_scores = nil
115114
begin
116-
case_scores = run_scorers(case_context)
115+
run_scorers(case_context)
117116
rescue => e
118117
# Error already recorded on score span, set eval span status
119118
eval_span.status = OpenTelemetry::Trace::Status.error(e.message)
@@ -123,7 +122,7 @@ def run_eval_case(case_context, errors)
123122
# Set output after task completes
124123
set_json_attr(eval_span, "braintrust.output_json", {output: case_context.output})
125124

126-
report_progress(eval_span, case_context, data: case_context.output, scores: case_scores || {})
125+
report_progress(eval_span, case_context, data: case_context.output)
127126
end
128127
ensure
129128
eval_span&.finish
@@ -157,7 +156,6 @@ def run_task(case_context)
157156
# Run scorers with OpenTelemetry tracing.
158157
# Creates one span per scorer, each a direct child of the current (eval) span.
159158
# @param case_context [CaseContext] The per-case context (output must be populated)
160-
# @return [Hash] Scores hash { scorer_name => score_value }
161159
def run_scorers(case_context)
162160
scorer_kwargs = {
163161
input: case_context.input,
@@ -173,47 +171,41 @@ def run_scorers(case_context)
173171
metadata: case_context.metadata || {}
174172
}
175173

176-
scores = {}
177174
scorer_error = nil
178175
eval_context.scorers.each do |scorer|
179-
run_scorer(scorer, scorer_kwargs, scorer_input, scores)
176+
collect_scores(run_scorer(scorer, scorer_kwargs, scorer_input))
180177
rescue => e
181178
scorer_error ||= e
182179
end
183180

184181
raise scorer_error if scorer_error
185-
186-
scores
187182
end
188183

189184
# Run a single scorer inside its own span.
190185
# @param scorer [Scorer] The scorer to run
191186
# @param scorer_kwargs [Hash] Keyword arguments for the scorer
192187
# @param scorer_input [Hash] Input to log on the span
193-
# @param scores [Hash] Accumulator for score results
194-
def run_scorer(scorer, scorer_kwargs, scorer_input, scores)
188+
# @return [Array<Hash>] Raw score results from the scorer
189+
def run_scorer(scorer, scorer_kwargs, scorer_input)
195190
tracer.in_span(scorer.name) do |score_span|
196191
score_span.set_attribute("braintrust.parent", eval_context.parent_span_attr) if eval_context.parent_span_attr
197192
set_json_attr(score_span, "braintrust.span_attributes", build_scorer_span_attributes(scorer.name))
198193
set_json_attr(score_span, "braintrust.input_json", scorer_input)
199194

200-
raw_result = scorer.call(**scorer_kwargs)
201-
normalized = normalize_score_result(raw_result, scorer.name)
195+
score_results = scorer.call(**scorer_kwargs)
202196

203-
score_name = normalized[:name]
204-
scores[score_name] = normalized[:score]
197+
scorer_scores = {}
198+
scorer_metadata = {}
199+
score_results.each do |s|
200+
scorer_scores[s[:name]] = s[:score]
201+
scorer_metadata[s[:name]] = s[:metadata] if s[:metadata].is_a?(Hash)
202+
end
205203

206-
scorer_scores = {score_name => normalized[:score]}
207204
set_json_attr(score_span, "braintrust.output_json", scorer_scores)
208205
set_json_attr(score_span, "braintrust.scores", scorer_scores)
206+
set_json_attr(score_span, "braintrust.metadata", scorer_metadata) unless scorer_metadata.empty?
209207

210-
# Set scorer metadata on its span
211-
if normalized[:metadata].is_a?(Hash)
212-
set_json_attr(score_span, "braintrust.metadata", normalized[:metadata])
213-
end
214-
215-
# Collect raw score for summary (thread-safe)
216-
collect_score(score_name, normalized[:score])
208+
score_results
217209
rescue => e
218210
record_span_error(score_span, e, "ScorerError")
219211
raise
@@ -302,28 +294,11 @@ def set_json_attr(span, key, value)
302294
span.set_attribute(key, JSON.dump(value))
303295
end
304296

305-
# Collect a single score value for summary calculation
306-
# @param name [String] Scorer name
307-
# @param value [Object] Score value (only Numeric values are collected)
308-
def collect_score(name, value)
309-
return unless value.is_a?(Numeric)
310-
297+
# Collect score results into the summary accumulator (thread-safe).
298+
# @param score_results [Array<Hash>] Score results from a scorer
299+
def collect_scores(score_results)
311300
@score_mutex.synchronize do
312-
(@scores[name] ||= []) << value
313-
end
314-
end
315-
316-
# Normalize a scorer return value into its component parts.
317-
# Scorers may return a raw Numeric or a Hash with :score, :metadata, and :name keys.
318-
# @param result [Object] Raw scorer return value
319-
# @param default_name [String] Scorer name to use if not overridden
320-
# @return [Hash] Normalized hash with :score, :metadata, :name keys
321-
def normalize_score_result(result, default_name)
322-
if result.is_a?(Hash)
323-
result[:name] ||= default_name
324-
result
325-
else
326-
{score: result, metadata: nil, name: default_name}
301+
score_results.each { |s| (@scores[s[:name]] ||= []) << s[:score] }
327302
end
328303
end
329304
end

0 commit comments

Comments
 (0)