OpenGameEval is an evaluation framework for testing LLMs on Roblox game development tasks. This repository contains open-sourced evaluation scripts and tools for running automated assessments in the Roblox Studio environment.
The LLM Leaderboard summarizes benchmark results and progress for all evaluated Large Language Models in this repository. LLM_LEADERBOARD.md
You'll need a Roblox account. If you don't have one, create a free account at roblox.com.
To interact with the OpenGameEval API, you need to create an OpenCloud API key:
- Navigate to Creator Hub and log in. Make sure you are viewing as user, not group.
- Go to All tools (or OpenCloud) > API Keys
- Create a new key with:
- Access Permissions:
studio-evaluations - Operations:
create - Set an expiration date (recommended: 90 days)
- Access Permissions:
- Save and copy the generated key, which will be used as <OPEN_GAME_EVAL_API_KEY> in following commands.
git clone https://github.com/Roblox/open-game-eval.git
cd open-game-evalThe project uses uv for dependency management. Install dependencies:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with Homebrew
brew install uv
# Or with pip
pip install uvImportant: You must provide your own LLM credentials (--llm-name and --llm-api-key) to run evaluations.
You may save your API keys in a file named .env. See .env.example for a sample.
# Set envvar
export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key>
export ANTHROPIC_API_KEY=<your-anthropic-api-key>
# Pass in OpenGameEval API key and LLM API key (required)
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_GAME_EVAL_API_KEY \
--llm-name "claude" \
--llm-api-key $ANTHROPIC_API_KEYIt should show the status being "submitted" with a url, through which you can check the status of the eval with the Roblox account that owns the API key logged in.
Evals/001_make_cars_faster.lua : Submitted - https://apis.roblox.com/open-eval-api/v1/eval-records/b7647585-5e1f-46b5-a8be-797539b65cc5It is common for an eval to take 3-4 minutes to run and gather results. The script polls result every 10 seconds and print a status update every 30 seconds.
Once completed, it will return whether the eval run is successful or not. The default timeout is 10 minutes.
Evals/001_make_cars_faster.lua : Success
Success rate: 100.00% (1/1) After eval completed, a result object will be returned as a part of http response. It is accessible through https://apis.roblox.com/open-eval-api/v1/eval-records/{jobId}
The eval is considered as a pass only if all checks are passed.
"results": [
{
"mode": "[EDIT]",
"result": {
"passes": 1,
"fails": 0,
"checks": 1,
"warning": "",
"error": "",
"interruptions": []
}
}
],
passes: Number of checks passed.fails: Number of checks failed.checks: Total number of checks. Equals to passes + fails.warnings: Number of warnings received when running the eval.error: Number of errors received when running the eval.
# Set envvar
export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key>
export ANTHROPIC_API_KEY=<your-anthropic-api-key>
# Run all evaluations
uv run invoke_eval.py --files "Evals/*.lua" --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEY
# Run specific pattern
uv run invoke_eval.py --files "Evals/0*_*.lua" --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEY
# Run with concurrency limit
uv run invoke_eval.py --files "Evals/*.lua" --max-concurrent 5 --api-key $OPEN_GAME_EVAL_API_KEY --llm-name "claude" --llm-api-key $ANTHROPIC_API_KEYPlease make sure the LLM_API_KEY is the correct key corresponding to the model provider.
# Set envvar
export OPEN_GAME_EVAL_API_KEY=<your-open-game-eval-api-key>
export GEMINI_API_KEY=<your-gemini-api-key>
export ANTHROPIC_API_KEY=<your-anthropic-api-key>
export OPENAI_API_KEY=<your-openai-api-key>
# With Gemini
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_GAME_EVAL_API_KEY \
--llm-name "gemini" \
--llm-model-version "gemini-2.5-pro" \
--llm-api-key $GEMINI_API_KEY
# With Claude
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_GAME_EVAL_API_KEY \
--llm-name "claude" \
--llm-model-version "claude-sonnet-4-5-20250929" \
--llm-api-key $ANTHROPIC_API_KEY
# With OpenAI
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_GAME_EVAL_API_KEY \
--llm-name "openai" \
--llm-model-version "gpt-5" \
--llm-api-key $OPENAI_API_KEYuv run invoke_eval.py [OPTIONS]
Required Options:
--files TEXT [TEXT ...] Lua files to evaluate (supports wildcards)
--api-key TEXT Open Cloud API key studio-evaluation (or set OPEN_GAME_EVAL_API_KEY env var)
Required if running evals through LLM (not using reference mode):
--llm-name TEXT Name of provider: claude | gemini | openai (REQUIRED)
--llm-api-key TEXT LLM API key (REQUIRED, or set LLM_API_KEY env var)
Optional:
--llm-model-version TEXT LLM model version, e.g. claude-4-sonnet-20250514
--llm-url TEXT LLM endpoint URL. Not yet supported, please put a placeholder string here.
--max-concurrent INTEGER Maximum concurrent evaluations
--use-reference-mode Use reference mode for evaluation. This skips LLM and uses reference code for debugging eval contributions.
--verbose-headers Output HTTP request and response headers for debuggingNote: --llm-name and --llm-api-key are required to ensure evaluations use your own LLM API key. The only exception is --use-reference-mode, which doesn't call an LLM.
Available model-versions:
- For Gemini models (provider-name: “gemini”)
- gemini-2.5-pro
- For Claude models (provider-name: “claude”)
- claude-4-sonnet-20250514
- claude-sonnet-4-5-20250929
- claude-haiku-4-5-20251001
- For OpenAI models (provider-name: “openai”)
- gpt-5
- gpt-5-mini
To ensure the stability of public API, we implement rate limiting. Exceeding these limits will result in an 429 Too Many Requests status code.
Endpoint: POST /open-eval-api/v1/eval
| Limit Type | Rate | Time Window |
|---|---|---|
| Per API Key | 50 requests | Per hour |
| Per API Key | 100 requests | Per day |
| Per IP Address | 100 requests | Per day |
Endpoint: GET /open-eval-api/v1/eval-records/{jobId}
| Limit Type | Rate | Time Window |
|---|---|---|
| Per API Key | 60 requests | Per minute |
| Per IP Address | 60 requests | Per minute |
- LLM Name/API Key Required: You must provide
--llm-nameand--llm-api-key(or setLLM_API_KEYin.env). You will use your own LLM credentials for evaluations. - API Key Not Found: Ensure your Open Game Eval API key is set in the
.envfile or passed via--api-key. See.env.exampleas an example. - Permission Denied: Verify your API key has proper scope (
studio-evaluation:create). - Timeout Errors: Evaluations have a 10-minute timeout.
- File Not Found: Check file paths and ensure evaluation files exist.
- SSL certificate verify failed: Find the
Install Certificates.commandin finder and execute it. (See details and other solutions) - No output from Lua: If the eval failed with error
Error occurred, no output from Lua, it is caused by incorrect LLM info being passed in. Please double-checkllm-api-keyis correct, andllm-model-versionis one of the available versions listed.
https://apis.roblox.com/open-eval-api/v1
curl -X POST 'https://apis.roblox.com/open-eval-api/v1/eval' \
--header 'Content-Type: application/json' \
--header "x-api-key: $OPEN_GAME_EVAL_API_KEY" \
--data "$(jq -n --rawfile script Evals/001_make_cars_faster.lua '{
name: "make_cars_faster",
description: "Evaluation on making cars faster",
input_script: $script,
custom_llm_info: {
name: "provider-name", // ← Provider only, claude | gemini | openai
api_key: "your-provider-api-key",
model_version: "model-version", // ← see example model versions below
url: "dummy_url_not_effective",
}
}')"curl 'https://apis.roblox.com/open-eval-api/v1/eval-records/{job_id}' \
--header "x-api-key: $OPEN_GAME_EVAL_API_KEY"QUEUED: Job is waiting to be processedPENDING: Job is being processedCOMPLETED: Job finished successfullyFAILED: Job failed
Each evaluation file follows this structure:
local eval: BaseEval = {
scenario_name = "001_make_cars_faster", -- Name of the eval
prompt = {
{
{
role = "user",
content = "Make the cars of this game 2x faster", -- User prompt
}
}
},
place = "racing.rbxl", --Name of placefile used. Currently only supports Roblox templates.
}
-- Setup necessary changes to the placefile before evaluation
eval.setup = function()
-- Create necessary set up to placefile, including selection
end
-- Reference function (optional, used when running evals with use-reference-mode)
eval.reference = function()
-- Expected behavior implementation. They are intentionally left blank in this set for the purpose of evaluation.
end
-- Validation function
eval.check_scene = function()
-- Checks for edit mode
end
eval.check_game = function()
-- Checks for play mode
end
return evalThis repository contains open-source evaluation scripts. To contribute:
- Fork the repository
- Create evaluation scripts following the established format
- Test your evaluations thoroughly
- Submit a pull request with clear documentation
This project is part of Roblox's open-source initiative. Please refer to the repository's license file for details.
- Contact the Roblox team for API access and permissions