AInsteinBench is a benchmark for evaluating the capabilities of AI agents in solving scientific computing (a.k.a. scico) problems. It currently supports Einstein Toolkit and Multi-SWE-bench formats of coding questions.
- Python 3.8+
- Docker
- Required packages:
pip install -r requirements.txt
Optional (for multi-swe-bench type questions)
# install multi-swe-bench
git clone https://github.com/multi-swe-bench/multi-swe-bench.git && cd multi-swe-bench && make install
# the docker images for the questions are available upon requestOptional (for Einstein Toolkit type questions)
cd curation/et
# install Einstein Toolkit following https://einsteintoolkit.org/download.html
curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2025_05/GetComponents
chmod a+x GetComponents
./GetComponents --shallow https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2025_05/einsteintoolkit.th
# Then the curation/et/Cactus folder should be populated with the Einstein Toolkit code and tests.
# the ADMConstraints Thorn are outdated but used in the docker, get it by
git clone https://bitbucket.org/einsteintoolkit/einsteinanalysis.git
cp -r einsteinanalysis/ADMConstraints .
# prepare the docker environment
docker pull rynge/einsteintoolkitAs a quick start, you can prepare the docker environments and evaluate the Einstein Toolkit questions by
python scripts/extract_dockerhub_images.py
bash pull_images.sh --sample
python evaluate_questions.py --questions data/questions/et_converted.jsonl --answers data/answersresults will be saved to evaluation_results.json.
Before running evaluation, Pull the required images once:
# this may take a few minutes or hours depending on network, make sure you have 50GB+ free storage
bash pull_images.sh --allYou can pull only the required images for the few questions you want to test in data/questions folder:
# extract the image names, this will be written to `sample_images.txt`
python scripts/extract_dockerhub_images.py
# pull only the sample images
bash pull_images.sh --sampleYou can also specify the dockerhub username by adding --username <username> to the command.
Prepared Docker Images
All evaluations in AInsteinBench are executed inside pre-built Docker containers to ensure environment consistency, reproducibility, and isolation across different scientific codebases.We pre-build and publish all required Docker images on Docker Hub for both supported question types:
Einstein Toolkit questions
Images contain a fully configured Einstein Toolkit environment, including the required thorns, build system, and test dependencies.
Multi-SWE-bench questions
Images correspond to specific repositories and pull requests (e.g., PySCF, AMReX), with the codebase, dependencies, and test harness pre-installed.
Prepare the questions and answers in the required format. The answer filename should follow the corresponding "question_id". For EinsteinToolkit type questions, the answer should be one file of C/C++/Fortran code. For Multi-SWE-bench type questions, the answer should be a patch file.
As reference, once you download the questions in data/questions folder, you can extract the answers by running python scripts/extract_answers.py. The answers will be written to data/answers folder.
As an example, the reference answer to the question `MSB_pyscf_pyscf_pr2373`
diff --git a/pyscf/tdscf/rhf.py b/pyscf/tdscf/rhf.py
index b1b680b69f..d2cb63e086 100644
--- a/pyscf/tdscf/rhf.py
+++ b/pyscf/tdscf/rhf.py
@@ -530,16 +530,20 @@ def _charge_center(mol):
return numpy.einsum('z,zr->r', charges, coords)/charges.sum()
def _contract_multipole(tdobj, ints, hermi=True, xy=None):
+ '''ints is the integral tensor of a spin-independent operator'''
if xy is None: xy = tdobj.xy
+ nstates = len(xy)
+ pol_shape = ints.shape[:-2]
+ nao = ints.shape[-1]
+
+ if not tdobj.singlet:
+ return numpy.zeros((nstates,) + pol_shape)
+
mo_coeff = tdobj._scf.mo_coeff
mo_occ = tdobj._scf.mo_occ
orbo = mo_coeff[:,mo_occ==2]
orbv = mo_coeff[:,mo_occ==0]
- nstates = len(xy)
- pol_shape = ints.shape[:-2]
- nao = ints.shape[-1]
-
#Incompatible to old numpy version
#ints = numpy.einsum('...pq,pi,qj->...ij', ints, orbo.conj(), orbv)
ints = lib.einsum('xpq,pi,qj->xij', ints.reshape(-1,nao,nao), orbo.conj(), orbv)
You can use your favorite agent to work on questions and provide answers. We provide a minimal working agent in scripts/run_agent.py. You can run it by:
python scripts/run_agent.py \
--question-file data/questions/msb_converted.jsonl \
--api-key $OPENAI_API_KEY \
--output-dir outputs/Evaluate the answer by running python evaluate_questions.py --questions <questions_file> --answers <answers_dir> with your questions file and answers directory. For example, to evaluate the example questions in data/questions and the reference answers in data/answers, run:
# Evaluate Einstein Toolkit questions
python evaluate_questions.py \
--questions data/questions/et_converted.jsonl \
--answers data/answers \
--output data/eval/et_eval.json
# Evaluate Multi-SWE-bench questions
python evaluate_questions.py \
--questions data/questions/msb_converted.jsonl \
--answers data/answers \
--output data/eval/msb_eval.jsonAInsteinBench/
├── evaluate_questions.py # Unified evaluator
├── ainsteinbench/
│ ├── question.py # Question class
│ └── utils/ # Utility modules
├── curation/
│ ├── et/ # ET data curation
│ │ ├── ET_evaluator.py # Original ET evaluator
│ │ └── config_server.json # ET configuration
│ └── msb/ # MSB data curation
│ ├── log_parser.py # log parser
│ ├── AMReXCodes/ # AMReX questions
│ └── pyscf/ # PySCF questions
├── scripts/
├── data/
│ ├── questions/ # Unified format questions
│ ├── answers/ # Reference answers
│ ├── demo/ # Demo questions (non-runnable)
│ └── raw/ # Raw datasets
└── tests/
├── test_question.py # Question class tests
└── test_config.py # Config tests
Both question types share a common structure with various question "content", "environment", "answer", "test", and "scoring_config". Currently, we support the following types of questions:
The format should follow:
{
"question_id": "ET_ADMConstraints_InitSymBound",
"question_type": "einstein_toolkit",
"description": "Implement symmetry boundary initialization for ADM Constraints",
"content": {
"thorn_name": "EinsteinAnalysis/ADMConstraints",
"src_filename": "InitSymBound.F",
"interface": "...",
"schedule": "...",
"param": "...",
"configuration": "...",
"context": "...",
"combined_doc_context": "..."
},
"environment": {
"env_type": "docker",
"docker_image": "rynge/einsteintoolkit",
"working_directory": "/opt/Cactus",
"setup_commands": ["..."]
},
"answer": {
"src_code": "/* C/C++/Fortran implementation */"
},
"test": {
"test_names": ["test1", "test2"],
"benchmark_base": "Cactus/arrangements",
"comparison_method": "numerical_tolerance"
},
"scoring_config": {
"build_points": 40,
"test_run_points": 10,
"test_accuracy_points": 50,
"tolerances": {
"rtol": 1e-6,
"atol": 1e-12
}
}
}For a full example, see data/questions/et_converted.jsonl.
The format should follow:
{
"question_id": "MSB_AMReX_pr1234",
"question_type": "multi_swe_bench",
"description": "Fix boundary condition bug in AMReX Fortran interface",
"content": {
"org": "AMReX-Codes",
"repo": "amrex",
"pr_number": 1234,
"issue_title": "...",
"issue_body": "...",
"base_commit": "abc123",
"resolved_issues": [...],
"commits": [...]
},
"environment": {
"env_type": "docker",
"docker_image": "mswebench/amrex-codes_m_amrex:pr-1234",
"working_directory": "/testbed",
"needs_build": true
},
"answer": {
"fix_patch": "diff --git ..."
},
"test": {
"test_patch": "diff --git ...",
"pass_criteria": "all_tests_pass"
},
"scoring_config": {
"scoring_method": "binary",
"resolve_points": 100
}
}For a full example, see data/questions/msb_converted.jsonl.
If you have questions in the original multi-swe-bench format or Einstein Toolkit format, we provide several handy scripts converting them to the AInsteinBench question format.
# Convert Einstein Toolkit dataset
python3 scripts/convert_et_to_unified.py \
--source local --input data/raw/et_dataset.jsonl \
--output data/questions/et_questions.jsonl \
--thorn-to-tests curation/et/thorn_to_tests.txt
# Convert Multi-SWE-bench dataset
python3 scripts/convert_msb_to_unified.py \
--input data/raw/msb_dataset.jsonl \
--output data/questions/msb_questions.jsonlEdit the relevant part of curation/et/config.json:
{
"docker": {
"image_name": "rynge/einsteintoolkit",
"working_directory": "/opt/Cactus",
"build_timeout": 600,
"test_timeout": 1200
},
"evaluation": {
"build_points": 40,
"test_run_points": 10,
"test_accuracy_points": 50,
"tolerances": {
"rtol": 1e-6,
"atol": 1e-12
}
}
}To add support for a new repository in Multi-SWE-bench, add log parser module with parse_log method into curation/msb/NewRepo/:
from multi_swe_bench.harness.instance import TestResult
class NewRepoInstance:
def parse_log(self, log: str) -> TestResult:
# Parse test results from log
return TestResult(
passed_count=...,
failed_count=...,
passed_tests=set([...]),
failed_tests=set([...])
)