Skip to content

Adding Agent Benchmarking#2

Merged
djriffle merged 9 commits intomainfrom
AgentBenchmarkingWithMemory
Apr 24, 2025
Merged

Adding Agent Benchmarking#2
djriffle merged 9 commits intomainfrom
AgentBenchmarkingWithMemory

Conversation

@djriffle
Copy link
Member

Added Agent Benchmarking
This pull request introduces a benchmarking framework for evaluating AI-generated code in the context of single-cell transcriptomics data analysis. It includes the implementation of a new evaluation script, dataset metadata, and supporting infrastructure, as well as updates to documentation and configuration files.

New Features and Functionality:

  • Evaluation Script: Added Evaluator.py, which provides functionality to evaluate AI-generated code using OpenAI's API. It includes helper functions for formatting conversations, sending evaluation requests, and processing datasets. The script supports interactive usage and integrates with the dotenv library for API key management.

  • Dataset Metadata: Introduced a new dataset metadata file, spatial_transcriptomics_in_mouse_puck_191109_14.json, which includes details such as citation, dataset ID, and cell count. This file is part of the benchmarking datasets.

Configuration and Setup:

  • Environment Configuration Script: Added create_benchmark_env.sh, a script to securely prompt for and save the OpenAI API key into a .env file. It ensures proper file permissions for security.

  • .gitignore Updates: Updated .gitignore to exclude .env, __pycache__/, .DS_store, and outputs/ to prevent sensitive or unnecessary files from being tracked.

  • Requirements File: Added requirements.txt with dependencies such as openai, rich, docker, and cellxgene-census to support the benchmarking framework.

Documentation:

  • Comprehensive README: Added a detailed README.md file outlining the purpose, setup, and usage of the benchmarking framework. It includes instructions for dataset management, sandbox setup, and running the evaluation process.

@djriffle djriffle merged commit 026b2aa into main Apr 24, 2025
2 checks passed
@djriffle djriffle deleted the AgentBenchmarkingWithMemory branch June 2, 2025 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant