Research eval frameworks and define quality criteria for AI skill outputs. Coordinate with UXD on their evals research to date before recommending an approach.
Acceptance Criteria:
-
Converse with UXD about their evals research — what they've explored, what worked, what didn't
-
Reference the Confluence "Skill Evaluation" page in the UIE space for prior team thinking
-
Research existing eval frameworks (e.g., promptfoo, Braintrust, custom harness) and recommend one that fits our skill architecture
-
Define what "good output" means for at least 3 existing skills (e.g., pf-unit-test-generator, pf-compliance-checker, pf-design-mode)
-
Document the recommended framework, criteria definitions, and eval patterns so implementation stories can execute against them
-
Deliver findings as a written summary (Google Doc or Confluence page)
Jira Issue: PF-4231
Research eval frameworks and define quality criteria for AI skill outputs. Coordinate with UXD on their evals research to date before recommending an approach.
Acceptance Criteria:
Converse with UXD about their evals research — what they've explored, what worked, what didn't
Reference the Confluence "Skill Evaluation" page in the UIE space for prior team thinking
Research existing eval frameworks (e.g., promptfoo, Braintrust, custom harness) and recommend one that fits our skill architecture
Define what "good output" means for at least 3 existing skills (e.g., pf-unit-test-generator, pf-compliance-checker, pf-design-mode)
Document the recommended framework, criteria definitions, and eval patterns so implementation stories can execute against them
Deliver findings as a written summary (Google Doc or Confluence page)
Jira Issue: PF-4231