Adding Agent Benchmarking by djriffle · Pull Request #2 · OpenTechBio/Olaf

djriffle · 2025-04-24T14:36:16Z

Added Agent Benchmarking
This pull request introduces a benchmarking framework for evaluating AI-generated code in the context of single-cell transcriptomics data analysis. It includes the implementation of a new evaluation script, dataset metadata, and supporting infrastructure, as well as updates to documentation and configuration files.

New Features and Functionality:

Evaluation Script: Added Evaluator.py, which provides functionality to evaluate AI-generated code using OpenAI's API. It includes helper functions for formatting conversations, sending evaluation requests, and processing datasets. The script supports interactive usage and integrates with the dotenv library for API key management.
Dataset Metadata: Introduced a new dataset metadata file, spatial_transcriptomics_in_mouse_puck_191109_14.json, which includes details such as citation, dataset ID, and cell count. This file is part of the benchmarking datasets.

Configuration and Setup:

Environment Configuration Script: Added create_benchmark_env.sh, a script to securely prompt for and save the OpenAI API key into a .env file. It ensures proper file permissions for security.
.gitignore Updates: Updated .gitignore to exclude .env, __pycache__/, .DS_store, and outputs/ to prevent sensitive or unnecessary files from being tracked.
Requirements File: Added requirements.txt with dependencies such as openai, rich, docker, and cellxgene-census to support the benchmarking framework.

Documentation:

Comprehensive README: Added a detailed README.md file outlining the purpose, setup, and usage of the benchmarking framework. It includes instructions for dataset management, sandbox setup, and running the evaluation process.

djriffle and others added 9 commits April 22, 2025 22:29

Started Adding Basic One Shot Support

4e44602

Docker fixes and sample prompts

c4219ac

Git API fixes

52cfb46

working on adding session memory

e29a124

switched to fast API

36ea7ff

logging fix

61633a7

added evaluator

c20c684

added prompt evolver

4b0f98b

updated README

7be1fe5

djriffle merged commit 026b2aa into main Apr 24, 2025
2 checks passed

djriffle deleted the AgentBenchmarkingWithMemory branch June 2, 2025 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Agent Benchmarking#2

Adding Agent Benchmarking#2
djriffle merged 9 commits intomainfrom
AgentBenchmarkingWithMemory

djriffle commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

djriffle commented Apr 24, 2025

New Features and Functionality:

Configuration and Setup:

Documentation:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant