AInsteinBench

AInsteinBench is a benchmark for evaluating the capabilities of AI agents in solving scientific computing (a.k.a. scico) problems. It currently supports Einstein Toolkit and Multi-SWE-bench formats of coding questions.

Prerequisites

Python 3.8+
Docker
Required packages: pip install -r requirements.txt

Optional (for multi-swe-bench type questions)

# install multi-swe-bench
git clone https://github.com/multi-swe-bench/multi-swe-bench.git && cd multi-swe-bench && make install
# the docker images for the questions are available upon request

Optional (for Einstein Toolkit type questions)

cd curation/et
# install Einstein Toolkit following https://einsteintoolkit.org/download.html
curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2025_05/GetComponents
chmod a+x GetComponents
./GetComponents --shallow https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2025_05/einsteintoolkit.th
# Then the curation/et/Cactus folder should be populated with the Einstein Toolkit code and tests.
# the ADMConstraints Thorn are outdated but used in the docker, get it by
git clone https://bitbucket.org/einsteintoolkit/einsteinanalysis.git
cp -r einsteinanalysis/ADMConstraints .
# prepare the docker environment
docker pull rynge/einsteintoolkit

Quick Start

As a quick start, you can prepare the docker environments and evaluate the Einstein Toolkit questions by

python scripts/extract_dockerhub_images.py
bash pull_images.sh --sample
python evaluate_questions.py  --questions data/questions/et_converted.jsonl --answers data/answers

results will be saved to evaluation_results.json.

Evaluation

Docker Setup

Before running evaluation, Pull the required images once:

# this may take a few minutes or hours depending on network, make sure you have 50GB+ free storage
bash pull_images.sh --all

You can pull only the required images for the few questions you want to test in data/questions folder:

# extract the image names, this will be written to `sample_images.txt`
python scripts/extract_dockerhub_images.py
# pull only the sample images
bash pull_images.sh --sample

You can also specify the dockerhub username by adding --username <username> to the command.

Prepared Docker Images

All evaluations in AInsteinBench are executed inside pre-built Docker containers to ensure environment consistency, reproducibility, and isolation across different scientific codebases.

We pre-build and publish all required Docker images on Docker Hub for both supported question types:

Einstein Toolkit questions

Images contain a fully configured Einstein Toolkit environment, including the required thorns, build system, and test dependencies.

Multi-SWE-bench questions

Images correspond to specific repositories and pull requests (e.g., PySCF, AMReX), with the codebase, dependencies, and test harness pre-installed.

Questions/Answers Preparation

Prepare the questions and answers in the required format. The answer filename should follow the corresponding "question_id". For EinsteinToolkit type questions, the answer should be one file of C/C++/Fortran code. For Multi-SWE-bench type questions, the answer should be a patch file.

As reference, once you download the questions in data/questions folder, you can extract the answers by running python scripts/extract_answers.py. The answers will be written to data/answers folder.

As an example, the reference answer to the question `MSB_pyscf_pyscf_pr2373`

diff --git a/pyscf/tdscf/rhf.py b/pyscf/tdscf/rhf.py
index b1b680b69f..d2cb63e086 100644
--- a/pyscf/tdscf/rhf.py
+++ b/pyscf/tdscf/rhf.py
@@ -530,16 +530,20 @@ def _charge_center(mol):
     return numpy.einsum('z,zr->r', charges, coords)/charges.sum()
 
 def _contract_multipole(tdobj, ints, hermi=True, xy=None):
+    '''ints is the integral tensor of a spin-independent operator'''
     if xy is None: xy = tdobj.xy
+    nstates = len(xy)
+    pol_shape = ints.shape[:-2]
+    nao = ints.shape[-1]
+
+    if not tdobj.singlet:
+        return numpy.zeros((nstates,) + pol_shape)
+
     mo_coeff = tdobj._scf.mo_coeff
     mo_occ = tdobj._scf.mo_occ
     orbo = mo_coeff[:,mo_occ==2]
     orbv = mo_coeff[:,mo_occ==0]
 
-    nstates = len(xy)
-    pol_shape = ints.shape[:-2]
-    nao = ints.shape[-1]
-
     #Incompatible to old numpy version
     #ints = numpy.einsum('...pq,pi,qj->...ij', ints, orbo.conj(), orbv)
     ints = lib.einsum('xpq,pi,qj->xij', ints.reshape(-1,nao,nao), orbo.conj(), orbv)

You can use your favorite agent to work on questions and provide answers. We provide a minimal working agent in scripts/run_agent.py. You can run it by:

python scripts/run_agent.py \
  --question-file data/questions/msb_converted.jsonl \
  --api-key $OPENAI_API_KEY \
  --output-dir outputs/

Run Evaluations

Evaluate the answer by running python evaluate_questions.py --questions <questions_file> --answers <answers_dir> with your questions file and answers directory. For example, to evaluate the example questions in data/questions and the reference answers in data/answers, run:

# Evaluate Einstein Toolkit questions
python evaluate_questions.py \
    --questions data/questions/et_converted.jsonl \
    --answers data/answers \
    --output data/eval/et_eval.json

# Evaluate Multi-SWE-bench questions
python evaluate_questions.py \
    --questions data/questions/msb_converted.jsonl \
    --answers data/answers \
    --output data/eval/msb_eval.json

Repository Structure

AInsteinBench/
├── evaluate_questions.py          # Unified evaluator 
├── ainsteinbench/
│   ├── question.py                # Question class
│   └── utils/                     # Utility modules
├── curation/
│   ├── et/                        # ET data curation
│   │   ├── ET_evaluator.py        # Original ET evaluator
│   │   └── config_server.json     # ET configuration
│   └── msb/                       # MSB data curation
│       ├── log_parser.py          # log parser
│       ├── AMReXCodes/            # AMReX questions
│       └── pyscf/                 # PySCF questions
├── scripts/
├── data/
│   ├── questions/                 # Unified format questions
│   ├── answers/                   # Reference answers
│   ├── demo/                      # Demo questions (non-runnable)
│   └── raw/                       # Raw datasets
└── tests/
    ├── test_question.py           # Question class tests
    └── test_config.py             # Config tests

Data Curation

Question Format

Both question types share a common structure with various question "content", "environment", "answer", "test", and "scoring_config". Currently, we support the following types of questions:

EinsteinToolkit Question

The format should follow:

{
  "question_id": "ET_ADMConstraints_InitSymBound",
  "question_type": "einstein_toolkit",
  "description": "Implement symmetry boundary initialization for ADM Constraints",
  "content": {
    "thorn_name": "EinsteinAnalysis/ADMConstraints",
    "src_filename": "InitSymBound.F",
    "interface": "...",
    "schedule": "...",
    "param": "...",
    "configuration": "...",
    "context": "...",
    "combined_doc_context": "..."
  },
  "environment": {
    "env_type": "docker",
    "docker_image": "rynge/einsteintoolkit",
    "working_directory": "/opt/Cactus",
    "setup_commands": ["..."]
  },
  "answer": {
    "src_code": "/* C/C++/Fortran implementation */"
  },
  "test": {
    "test_names": ["test1", "test2"],
    "benchmark_base": "Cactus/arrangements",
    "comparison_method": "numerical_tolerance"
  },
  "scoring_config": {
    "build_points": 40,
    "test_run_points": 10,
    "test_accuracy_points": 50,
    "tolerances": {
      "rtol": 1e-6,
      "atol": 1e-12
    }
  }
}

For a full example, see data/questions/et_converted.jsonl.

Multi-SWE-bench Question

The format should follow:

{
  "question_id": "MSB_AMReX_pr1234",
  "question_type": "multi_swe_bench",
  "description": "Fix boundary condition bug in AMReX Fortran interface",
  "content": {
    "org": "AMReX-Codes",
    "repo": "amrex",
    "pr_number": 1234,
    "issue_title": "...",
    "issue_body": "...",
    "base_commit": "abc123",
    "resolved_issues": [...],
    "commits": [...]
  },
  "environment": {
    "env_type": "docker",
    "docker_image": "mswebench/amrex-codes_m_amrex:pr-1234",
    "working_directory": "/testbed",
    "needs_build": true
  },
  "answer": {
    "fix_patch": "diff --git ..."
  },
  "test": {
    "test_patch": "diff --git ...",
    "pass_criteria": "all_tests_pass"
  },
  "scoring_config": {
    "scoring_method": "binary",
    "resolve_points": 100
  }
}

For a full example, see data/questions/msb_converted.jsonl.

If you have questions in the original multi-swe-bench format or Einstein Toolkit format, we provide several handy scripts converting them to the AInsteinBench question format.

# Convert Einstein Toolkit dataset 
python3 scripts/convert_et_to_unified.py \
  --source local  --input data/raw/et_dataset.jsonl \
  --output data/questions/et_questions.jsonl \
  --thorn-to-tests curation/et/thorn_to_tests.txt

# Convert Multi-SWE-bench dataset
python3 scripts/convert_msb_to_unified.py \
  --input data/raw/msb_dataset.jsonl \
  --output data/questions/msb_questions.jsonl

Configuration

Einstein Toolkit

Edit the relevant part of curation/et/config.json:

{
  "docker": {
    "image_name": "rynge/einsteintoolkit",
    "working_directory": "/opt/Cactus",
    "build_timeout": 600,
    "test_timeout": 1200
  },
  "evaluation": {
    "build_points": 40,
    "test_run_points": 10,
    "test_accuracy_points": 50,
    "tolerances": {
      "rtol": 1e-6,
      "atol": 1e-12
    }
  }
}

Multi-SWE-bench

To add support for a new repository in Multi-SWE-bench, add log parser module with parse_log method into curation/msb/NewRepo/:

from multi_swe_bench.harness.instance import TestResult

class NewRepoInstance:
    def parse_log(self, log: str) -> TestResult:
        # Parse test results from log
        return TestResult(
            passed_count=...,
            failed_count=...,
            passed_tests=set([...]),
            failed_tests=set([...])
        )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AInsteinBench

Prerequisites

Quick Start

Evaluation

Docker Setup

Questions/Answers Preparation

Run Evaluations

Repository Structure

Data Curation

Question Format

EinsteinToolkit Question

Multi-SWE-bench Question

Configuration

Einstein Toolkit

Multi-SWE-bench

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ainsteinbench		ainsteinbench
curation		curation
data		data
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
dockerhub_images.txt		dockerhub_images.txt
evaluate_questions.py		evaluate_questions.py
pull_images.sh		pull_images.sh
readme.md		readme.md
requirements.txt		requirements.txt

License

ByteDance-Seed/AInsteinBench

Folders and files

Latest commit

History

Repository files navigation

AInsteinBench

Prerequisites

Quick Start

Evaluation

Docker Setup

Questions/Answers Preparation

Run Evaluations

Repository Structure

Data Curation

Question Format

EinsteinToolkit Question

Multi-SWE-bench Question

Configuration

Einstein Toolkit

Multi-SWE-bench

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages