Skip to content

ByteDance-Seed/AInsteinBench

Repository files navigation

AInsteinBench

AInsteinBench is a benchmark for evaluating the capabilities of AI agents in solving scientific computing (a.k.a. scico) problems. It currently supports Einstein Toolkit and Multi-SWE-bench formats of coding questions.

Prerequisites

  • Python 3.8+
  • Docker
  • Required packages: pip install -r requirements.txt
Optional (for multi-swe-bench type questions)
# install multi-swe-bench
git clone https://github.com/multi-swe-bench/multi-swe-bench.git && cd multi-swe-bench && make install
# the docker images for the questions are available upon request
Optional (for Einstein Toolkit type questions)
cd curation/et
# install Einstein Toolkit following https://einsteintoolkit.org/download.html
curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2025_05/GetComponents
chmod a+x GetComponents
./GetComponents --shallow https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2025_05/einsteintoolkit.th
# Then the curation/et/Cactus folder should be populated with the Einstein Toolkit code and tests.
# the ADMConstraints Thorn are outdated but used in the docker, get it by
git clone https://bitbucket.org/einsteintoolkit/einsteinanalysis.git
cp -r einsteinanalysis/ADMConstraints .
# prepare the docker environment
docker pull rynge/einsteintoolkit

Quick Start

As a quick start, you can prepare the docker environments and evaluate the Einstein Toolkit questions by

python scripts/extract_dockerhub_images.py
bash pull_images.sh --sample
python evaluate_questions.py  --questions data/questions/et_converted.jsonl --answers data/answers

results will be saved to evaluation_results.json.

Evaluation

Docker Setup

Before running evaluation, Pull the required images once:

# this may take a few minutes or hours depending on network, make sure you have 50GB+ free storage
bash pull_images.sh --all

You can pull only the required images for the few questions you want to test in data/questions folder:

# extract the image names, this will be written to `sample_images.txt`
python scripts/extract_dockerhub_images.py
# pull only the sample images
bash pull_images.sh --sample

You can also specify the dockerhub username by adding --username <username> to the command.

Prepared Docker Images All evaluations in AInsteinBench are executed inside pre-built Docker containers to ensure environment consistency, reproducibility, and isolation across different scientific codebases.

We pre-build and publish all required Docker images on Docker Hub for both supported question types:

Einstein Toolkit questions

Images contain a fully configured Einstein Toolkit environment, including the required thorns, build system, and test dependencies.

Multi-SWE-bench questions

Images correspond to specific repositories and pull requests (e.g., PySCF, AMReX), with the codebase, dependencies, and test harness pre-installed.

Questions/Answers Preparation

Prepare the questions and answers in the required format. The answer filename should follow the corresponding "question_id". For EinsteinToolkit type questions, the answer should be one file of C/C++/Fortran code. For Multi-SWE-bench type questions, the answer should be a patch file.

As reference, once you download the questions in data/questions folder, you can extract the answers by running python scripts/extract_answers.py. The answers will be written to data/answers folder.

As an example, the reference answer to the question `MSB_pyscf_pyscf_pr2373`
diff --git a/pyscf/tdscf/rhf.py b/pyscf/tdscf/rhf.py
index b1b680b69f..d2cb63e086 100644
--- a/pyscf/tdscf/rhf.py
+++ b/pyscf/tdscf/rhf.py
@@ -530,16 +530,20 @@ def _charge_center(mol):
     return numpy.einsum('z,zr->r', charges, coords)/charges.sum()
 
 def _contract_multipole(tdobj, ints, hermi=True, xy=None):
+    '''ints is the integral tensor of a spin-independent operator'''
     if xy is None: xy = tdobj.xy
+    nstates = len(xy)
+    pol_shape = ints.shape[:-2]
+    nao = ints.shape[-1]
+
+    if not tdobj.singlet:
+        return numpy.zeros((nstates,) + pol_shape)
+
     mo_coeff = tdobj._scf.mo_coeff
     mo_occ = tdobj._scf.mo_occ
     orbo = mo_coeff[:,mo_occ==2]
     orbv = mo_coeff[:,mo_occ==0]
 
-    nstates = len(xy)
-    pol_shape = ints.shape[:-2]
-    nao = ints.shape[-1]
-
     #Incompatible to old numpy version
     #ints = numpy.einsum('...pq,pi,qj->...ij', ints, orbo.conj(), orbv)
     ints = lib.einsum('xpq,pi,qj->xij', ints.reshape(-1,nao,nao), orbo.conj(), orbv)

You can use your favorite agent to work on questions and provide answers. We provide a minimal working agent in scripts/run_agent.py. You can run it by:

python scripts/run_agent.py \
  --question-file data/questions/msb_converted.jsonl \
  --api-key $OPENAI_API_KEY \
  --output-dir outputs/

Run Evaluations

Evaluate the answer by running python evaluate_questions.py --questions <questions_file> --answers <answers_dir> with your questions file and answers directory. For example, to evaluate the example questions in data/questions and the reference answers in data/answers, run:

# Evaluate Einstein Toolkit questions
python evaluate_questions.py \
    --questions data/questions/et_converted.jsonl \
    --answers data/answers \
    --output data/eval/et_eval.json

# Evaluate Multi-SWE-bench questions
python evaluate_questions.py \
    --questions data/questions/msb_converted.jsonl \
    --answers data/answers \
    --output data/eval/msb_eval.json

Repository Structure

AInsteinBench/
├── evaluate_questions.py          # Unified evaluator 
├── ainsteinbench/
│   ├── question.py                # Question class
│   └── utils/                     # Utility modules
├── curation/
│   ├── et/                        # ET data curation
│   │   ├── ET_evaluator.py        # Original ET evaluator
│   │   └── config_server.json     # ET configuration
│   └── msb/                       # MSB data curation
│       ├── log_parser.py          # log parser
│       ├── AMReXCodes/            # AMReX questions
│       └── pyscf/                 # PySCF questions
├── scripts/
├── data/
│   ├── questions/                 # Unified format questions
│   ├── answers/                   # Reference answers
│   ├── demo/                      # Demo questions (non-runnable)
│   └── raw/                       # Raw datasets
└── tests/
    ├── test_question.py           # Question class tests
    └── test_config.py             # Config tests

Data Curation

Question Format

Both question types share a common structure with various question "content", "environment", "answer", "test", and "scoring_config". Currently, we support the following types of questions:

EinsteinToolkit Question

The format should follow:

{
  "question_id": "ET_ADMConstraints_InitSymBound",
  "question_type": "einstein_toolkit",
  "description": "Implement symmetry boundary initialization for ADM Constraints",
  "content": {
    "thorn_name": "EinsteinAnalysis/ADMConstraints",
    "src_filename": "InitSymBound.F",
    "interface": "...",
    "schedule": "...",
    "param": "...",
    "configuration": "...",
    "context": "...",
    "combined_doc_context": "..."
  },
  "environment": {
    "env_type": "docker",
    "docker_image": "rynge/einsteintoolkit",
    "working_directory": "/opt/Cactus",
    "setup_commands": ["..."]
  },
  "answer": {
    "src_code": "/* C/C++/Fortran implementation */"
  },
  "test": {
    "test_names": ["test1", "test2"],
    "benchmark_base": "Cactus/arrangements",
    "comparison_method": "numerical_tolerance"
  },
  "scoring_config": {
    "build_points": 40,
    "test_run_points": 10,
    "test_accuracy_points": 50,
    "tolerances": {
      "rtol": 1e-6,
      "atol": 1e-12
    }
  }
}

For a full example, see data/questions/et_converted.jsonl.

Multi-SWE-bench Question

The format should follow:

{
  "question_id": "MSB_AMReX_pr1234",
  "question_type": "multi_swe_bench",
  "description": "Fix boundary condition bug in AMReX Fortran interface",
  "content": {
    "org": "AMReX-Codes",
    "repo": "amrex",
    "pr_number": 1234,
    "issue_title": "...",
    "issue_body": "...",
    "base_commit": "abc123",
    "resolved_issues": [...],
    "commits": [...]
  },
  "environment": {
    "env_type": "docker",
    "docker_image": "mswebench/amrex-codes_m_amrex:pr-1234",
    "working_directory": "/testbed",
    "needs_build": true
  },
  "answer": {
    "fix_patch": "diff --git ..."
  },
  "test": {
    "test_patch": "diff --git ...",
    "pass_criteria": "all_tests_pass"
  },
  "scoring_config": {
    "scoring_method": "binary",
    "resolve_points": 100
  }
}

For a full example, see data/questions/msb_converted.jsonl.

If you have questions in the original multi-swe-bench format or Einstein Toolkit format, we provide several handy scripts converting them to the AInsteinBench question format.

# Convert Einstein Toolkit dataset 
python3 scripts/convert_et_to_unified.py \
  --source local  --input data/raw/et_dataset.jsonl \
  --output data/questions/et_questions.jsonl \
  --thorn-to-tests curation/et/thorn_to_tests.txt

# Convert Multi-SWE-bench dataset
python3 scripts/convert_msb_to_unified.py \
  --input data/raw/msb_dataset.jsonl \
  --output data/questions/msb_questions.jsonl

Configuration

Einstein Toolkit

Edit the relevant part of curation/et/config.json:

{
  "docker": {
    "image_name": "rynge/einsteintoolkit",
    "working_directory": "/opt/Cactus",
    "build_timeout": 600,
    "test_timeout": 1200
  },
  "evaluation": {
    "build_points": 40,
    "test_run_points": 10,
    "test_accuracy_points": 50,
    "tolerances": {
      "rtol": 1e-6,
      "atol": 1e-12
    }
  }
}

Multi-SWE-bench

To add support for a new repository in Multi-SWE-bench, add log parser module with parse_log method into curation/msb/NewRepo/:

from multi_swe_bench.harness.instance import TestResult

class NewRepoInstance:
    def parse_log(self, log: str) -> TestResult:
        # Parse test results from log
        return TestResult(
            passed_count=...,
            failed_count=...,
            passed_tests=set([...]),
            failed_tests=set([...])
        )

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published