[feat] Add `DaytonaRunner` for code `evaluators` #3258

junaway · 2025-12-20T00:10:55Z

No description provided.

vercel · 2025-12-20T00:11:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jan 5, 2026 2:59pm

Copilot

Pull request overview

This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.

Key Changes

Replaced legacy custom_code_run with new sdk_custom_code_run that uses the SDK's workflow-based evaluator system
Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments

Reviewed changes

Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`api/oss/src/services/evaluators_service.py`	Implements new SDK-based custom code runner function that delegates to workflow system
`api/oss/src/resources/evaluators/evaluators.py`	Updates default code template with deprecation note for app_params
`sdk/agenta/sdk/workflows/runners/daytona.py`	Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox
`sdk/agenta/sdk/workflows/runners/local.py`	Exposes built-in Python types (dict, list, str, etc.) to restricted environment
`sdk/agenta/sdk/decorators/running.py`	Adds fallback to request.credentials in credential resolution chain
`web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts`	Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx`	Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx`	Skips indentation transformation for Python/code, maintains it for JSON/YAML
`examples/python/evaluators/openai/*.py`	Adds OpenAI SDK evaluators for testing API availability and exact match comparisons
`examples/python/evaluators/numpy/*.py`	Adds NumPy evaluators for testing library availability and character counting
`examples/python/evaluators/basic/*.py`	Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation
`examples/python/evaluators/ag/*.py`	Adds Agenta API endpoint evaluators for health, secrets, and config endpoints
`examples/python/evaluators/*.md`	Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/python/evaluators/numpy/dependency_check.py

examples/python/evaluators/ag/secrets_check.py

examples/python/evaluators/ag/configs_check.py

sdk/agenta/sdk/workflows/runners/daytona.py

web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts

examples/python/evaluators/openai/exact_match.py

examples/python/evaluators/ag/health_check.py

examples/python/evaluators/openai/dependency_check.py

examples/python/evaluators/numpy/dependency_check.py

…ck-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdk/agenta/sdk/workflows/handlers.py

sdk/agenta/sdk/workflows/runners/daytona.py

api/oss/src/services/evaluators_service.py

examples/python/evaluators/openai/exact_match.py

sdk/agenta/sdk/workflows/runners/local.py

sdk/agenta/sdk/workflows/runners/daytona.py

examples/python/evaluators/openai/dependency_check.py

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

Copilot

Pull request overview

Copilot reviewed 48 out of 55 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-02T10:30:47Z

sdk/agenta/sdk/workflows/runners/local.py

        """
-        Execute provided Python code safely using RestrictedPython.
+        Execute provided Python code directly.


The LocalRunner now executes code directly without any sandboxing or restrictions, but the docstring still references "safe execution". The comment on line 8 says "Local code runner using direct Python execution" which is accurate, but the run method docstring should be updated to reflect that this is NOT safe execution and should only be used in trusted environments.

Copilot · 2026-01-02T10:30:47Z

sdk/agenta/sdk/workflows/sandbox.py

+    Execute the provided code safely.

-    Uses the configured runner (local RestrictedPython or remote Daytona)
+    Uses the configured runner (local or remote Daytona)


The docstring for execute_code_safely still says the function executes code "safely", but with the LocalRunner now using direct exec() without restrictions, this is misleading. The function name and docstring should be updated to reflect that safety depends on the runner implementation, and LocalRunner is not actually safe.

Copilot · 2026-01-02T10:30:48Z

web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts

+            // NO transformation for Python/code - keep indent exactly as-is
+            // Just add the indent as a plain text node (preserves spaces AND tabs)
+            if (indent.length > 0) {
+                codeLine.append($createCodeHighlightNode(indent, "plain", false, null))


The comment on line 248 says "NO transformation for Python/code - keep indent exactly as-is" which is accurate, but then the comment on line 249 says "Just add the indent as a plain text node (preserves spaces AND tabs)". These could be combined into a single, clearer comment explaining that for Python/JS/TS, indentation is preserved exactly as pasted (both spaces and tabs) by inserting it as a plain text node.

Copilot · 2026-01-02T10:30:48Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            runtime = runtime or "python"
+
+            # Select general snapshot
+            snapshot_id = os.getenv("DAYTONA_SNAPSHOT")


The environment variable name has changed from AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON to DAYTONA_SNAPSHOT. This appears to be a breaking change that could affect existing deployments. Consider either maintaining backward compatibility by checking both variable names, or documenting this breaking change clearly in migration notes.

Copilot · 2026-01-02T10:30:48Z

sdk/agenta/sdk/workflows/runners/daytona.py

-            if response_error:
-                log.error(f"Sandbox execution error: {response_error}")
-                raise RuntimeError(f"Sandbox execution failed: {response_error}")
+            if response_exit_code and response_exit_code != 0:


The code checks if response_exit_code is truthy before checking if it's non-zero. However, if exit_code is 0 (success), the expression response_exit_code and response_exit_code != 0 would be False (correct). But if exit_code is None (when the attribute doesn't exist), this would also be False, potentially masking errors. Consider explicitly checking if response_exit_code is not None and response_exit_code != 0 to distinguish between "no exit code" and "exit code is 0".

Copilot · 2026-01-02T10:30:51Z

api/oss/src/resources/evaluators/evaluators.py

+                    "code": "from typing import Dict, Union, Any\n\n\ndef evaluate(\n    app_params: Dict[str, str],  # deprecated; currently receives {}\n    inputs: Dict[str, str],\n    output: Union[str, Dict[str, Any]],\n    correct_answer: str,\n) -> float:\n    if output == correct_answer:\n        return 1.0\n    return 0.0\n",
+                },
+                "description": "Exact match evaluator implemented in Python.",
+            },
+            {
+                "key": "javascript_default",
+                "name": "Exact Match (JavaScript)",
+                "values": {
+                    "requires_llm_api_keys": False,
+                    "runtime": "javascript",
+                    "correct_answer_key": "correct_answer",
+                    "code": 'function evaluate(appParams, inputs, output, correctAnswer) {\n  void appParams\n  void inputs\n\n  const outputStr =\n    typeof output === "string" ? output : JSON.stringify(output)\n\n  return outputStr === String(correctAnswer) ? 1.0 : 0.0\n}\n',
+                },
+                "description": "Exact match evaluator implemented in JavaScript.",
+            },
+            {
+                "key": "typescript_default",
+                "name": "Exact Match (TypeScript)",
+                "values": {
+                    "requires_llm_api_keys": False,
+                    "runtime": "typescript",
+                    "correct_answer_key": "correct_answer",
+                    "code": 'type OutputValue = string | Record<string, unknown>\n\nfunction evaluate(\n  app_params: Record<string, string>,\n  inputs: Record<string, string>,\n  output: OutputValue,\n  correct_answer: string\n): number {\n  void app_params\n  void inputs\n\n  const outputStr =\n    (typeof output === "string" ? output : JSON.stringify(output)) as string\n\n  return outputStr === String(correct_answer) ? 1.0 : 0.0\n}\n',
+                },


The preset code values are stored as long single-line strings with embedded newlines (\n). This makes the code difficult to read and maintain in the resource file. Consider using multiline strings or loading these presets from separate files to improve readability and maintainability of the evaluator preset code.

Copilot · 2026-01-02T10:30:51Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            response = sandbox.process.code_run(wrapped_code)
+            response_stdout = response.result if hasattr(response, "result") else ""


The response handling uses response.result as the stdout content on line 250, but the production code comment history shows that previously it was response.stdout. The test script on line 119 also uses resp.result. However, there's no clear documentation about the Daytona API version being used. Consider documenting which Daytona SDK version this code is compatible with to avoid confusion about the correct attribute names.

Copilot · 2026-01-02T10:30:51Z

sdk/agenta/sdk/workflows/runners/daytona.py

            if not snapshot_id:
                raise RuntimeError(
-                    "AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON environment variable is required. "
-                    "Set it to the Daytona sandbox ID or snapshot name you want to use."
+                    f"No Daytona snapshot configured for runtime '{runtime}'. "
+                    f"Set DAYTONA_SNAPSHOT environment variable."
                )


The error message references runtime variable but uses a generic message format. When DAYTONA_SNAPSHOT is not set, the error message says "No Daytona snapshot configured for runtime '{runtime}'", but the snapshot selection logic doesn't actually vary by runtime - it uses the same DAYTONA_SNAPSHOT for all runtimes. This could be misleading. Consider clarifying the error message to reflect that a single snapshot is used for all runtimes.

Copilot · 2026-01-02T10:30:52Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            agenta_api_key = (
+                agenta_credentials[7:]
+                if agenta_credentials.startswith("ApiKey ")
+                else ""
+            )


The code extracts API key from credentials by checking if it starts with "ApiKey " and slicing from position 7, but if the credentials string is exactly "ApiKey " (with no actual key following), this would result in an empty string, which would still be added to env vars. Consider adding validation to ensure the extracted API key is non-empty.

Copilot · 2026-01-02T10:30:52Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            # Fallback: attempt to extract a JSON object containing "result"
+            for line in reversed(output_lines):
+                if "result" not in line:
+                    continue
+                start = line.find("{")
+                end = line.rfind("}")
+                if start == -1 or end == -1 or end <= start:
+                    continue
+                try:
+                    result_obj = json.loads(line[start : end + 1])


The fallback result parsing logic has a potential issue. The code finds the last occurrence of '}' with rfind("}"), but this could match a closing brace that isn't part of the result JSON object. For example, if the output contains nested JSON or code snippets, this could incorrectly identify a brace position. Consider using a more robust JSON extraction approach or validating that the extracted substring is actually valid JSON before attempting to parse it.

…k-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 47 out of 54 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-04T22:00:54Z

examples/python/evaluators/openai/exact_match.py

+    Makes a simple OpenAI call and encodes the response and API key into a float.
+
+    Args:
+        app_params: Can include "model"; uses OPENAI_API_KEY from env.
+        inputs: Input data (not used)
+        output: Output string to compare
+        correct_answer: Expected answer string
+
+    Returns:
+        float: Encoded value when the API call succeeds,
+               0.5 if no API key is available,
+               0.25 on runtime errors,
+               0.0 if the OpenAI SDK is missing.
+    """
+    try:
+        from openai import OpenAI
+    except ImportError:
+        # OpenAI SDK not installed
+        return 0.0
+
+    try:
+        # Get API key
+        api_key = os.environ.get("OPENAI_API_KEY")
+        if not api_key:
+            return 0.5
+
+        # Initialize client
+        client = OpenAI(api_key=api_key)
+
+        # Convert output to string
+        if isinstance(output, dict):
+            output_str = json.dumps(output)
+        else:
+            output_str = str(output)
+
+        # Convert correct answer to string
+        answer_str = str(correct_answer)
+
+        # Simple prompt asking if they match
+        match_prompt = f"""Compare these two strings and determine if they are exactly the same.
+
+String 1: {output_str}
+String 2: {answer_str}
+
+Respond with ONLY "yes" if they are exactly the same, or "no" if they are different.
+Do not include any other text in your response."""
+
+        # Make API call
+        response = client.chat.completions.create(
+            model=app_params.get("model", "gpt-4o-mini"),
+            messages=[{"role": "user", "content": match_prompt}],
+            max_tokens=10,
+            temperature=0.0,
+        )
+
+        # Get the response
+        result = response.choices[0].message.content.strip().lower()
+
+        return encode_string_in_decimals(result + " " + api_key)
+
+    except Exception:
+        # OpenAI SDK installed but something went wrong
+        return 0.25


The file is named "exact_match.py" but the evaluator doesn't actually perform an exact match comparison. Instead, it calls the OpenAI API to compare strings and encodes the result. This is misleading naming - the file should be renamed to something like "openai_api_test.py" or "openai_smoke_test.py" to accurately reflect its purpose.

Suggested change

Makes a simple OpenAI call and encodes the response and API key into a float.

Args:

app_params: Can include "model"; uses OPENAI_API_KEY from env.

inputs: Input data (not used)

output: Output string to compare

correct_answer: Expected answer string

Returns:

float: Encoded value when the API call succeeds,

0.5 if no API key is available,

0.25 on runtime errors,

0.0 if the OpenAI SDK is missing.

"""

try:

from openai import OpenAI

except ImportError:

# OpenAI SDK not installed

return 0.0

try:

# Get API key

api_key = os.environ.get("OPENAI_API_KEY")

if not api_key:

return 0.5

# Initialize client

client = OpenAI(api_key=api_key)

# Convert output to string

if isinstance(output, dict):

output_str = json.dumps(output)

else:

output_str = str(output)

# Convert correct answer to string

answer_str = str(correct_answer)

# Simple prompt asking if they match

match_prompt = f"""Compare these two strings and determine if they are exactly the same.

String 1: {output_str}

String 2: {answer_str}

Respond with ONLY "yes" if they are exactly the same, or "no" if they are different.

Do not include any other text in your response."""

# Make API call

response = client.chat.completions.create(

model=app_params.get("model", "gpt-4o-mini"),

messages=[{"role": "user", "content": match_prompt}],

max_tokens=10,

temperature=0.0,

)

# Get the response

result = response.choices[0].message.content.strip().lower()

return encode_string_in_decimals(result + " " + api_key)

except Exception:

# OpenAI SDK installed but something went wrong

return 0.25

Perform an exact match comparison between ``output`` and ``correct_answer``.

Args:

app_params: Unused configuration parameters (kept for interface compatibility).

inputs: Input data (unused by this exact match evaluator).

output: Output value to compare; if a dict, it is JSON-serialized.

correct_answer: Expected answer string.

Returns:

float: ``1.0`` if the values match exactly as strings, otherwise ``0.0``.

"""

# Convert output to a string, serializing dictionaries deterministically.

if isinstance(output, dict):

output_str = json.dumps(output, sort_keys=True)

else:

output_str = str(output)

# Convert correct answer to string.

answer_str = str(correct_answer)

return 1.0 if output_str == answer_str else 0.0

Copilot · 2026-01-04T22:00:54Z

examples/python/evaluators/basic/default_preset.py

+    output: Union[str, Dict[str, Any]],  # output of the llm app
+    correct_answer: str,  # contains the testset row
+) -> float:
+    if output in correct_answer:


The condition on line 10 uses output in correct_answer which checks if output is a substring of correct_answer. This logic seems incorrect for an "exact match" evaluator (which is suggested by the name "default_preset"). It should likely be output == correct_answer to perform an actual exact match comparison.

Suggested change

if output in correct_answer:

if output == correct_answer:

Copilot · 2026-01-04T22:00:55Z

sdk/agenta/sdk/workflows/runners/local.py

+from typing import Any, Dict, Union, Optional

 from agenta.sdk.workflows.runners.base import CodeRunner
 from agenta.sdk.utils.lazy import _load_restrictedpython


The import statement for _load_restrictedpython on line 4 is unused and should be removed since the LocalRunner no longer uses RestrictedPython.

Copilot · 2026-01-04T22:00:55Z

examples/typescript/evaluators/basic/default_preset.ts

+    const outputStr =
+      typeof output === "string" ? output : JSON.stringify(output)
+
+    return outputStr.includes(correct_answer) ? 1.0 : 0.0


The TypeScript evaluator uses includes() for string matching on line 25, which checks for substring containment rather than exact equality. This is inconsistent with the "Exact Match" naming in the preset configuration. This should likely be === for exact comparison to match the expected behavior.

Suggested change

return outputStr.includes(correct_answer) ? 1.0 : 0.0

return outputStr === correct_answer ? 1.0 : 0.0

Copilot · 2026-01-04T22:00:55Z

examples/python/evaluators/openai/exact_match.py

+        # Get the response
+        result = response.choices[0].message.content.strip().lower()
+
+        return encode_string_in_decimals(result + " " + api_key)


The function returns an encoded API key on line 86, which could expose sensitive information. Even though it's obfuscated using octal encoding, this is a security concern as the API key can be decoded from the returned float value. This appears to be a test/demo file, but it should not be exposing API keys in any form in the return value.

Copilot · 2026-01-04T22:00:55Z

examples/python/evaluators/numpy/exact_match.py

+"""
+NumPy Character Count Match Test
+=================================
+
+Simple NumPy operation that counts characters in strings (like exact match but with NumPy).
+"""
+
+from typing import Dict, Union, Any
+import json
+
+
+def evaluate(
+    app_params: Dict[str, str],
+    inputs: Dict[str, str],
+    output: Union[str, Dict[str, Any]],
+    correct_answer: str
+) -> float:
+    """
+    Tests NumPy functionality by counting characters in strings.
+
+    A simple operation that requires NumPy: converts strings to character arrays,
+    counts the length using NumPy, and checks if they match.
+    This is like an exact match test but proves NumPy is functional.
+
+    Args:
+        app_params: Application parameters (not used)
+        inputs: Input data (not used)
+        output: Output string to compare
+        correct_answer: Expected answer string
+
+    Returns:
+        float: 1.0 if character counts match, 0.0 otherwise
+
+    Example:
+        output = "The capital of France is Paris."
+        correct_answer = "The capital of France is Paris."
+        Returns: 1.0 (same length, 31 chars)
+
+        output = "Paris"
+        correct_answer = "The capital of France is Paris."
+        Returns: 0.0 (different lengths: 5 vs 31)
+    """
+    try:
+        import numpy as np
+    except ImportError:
+        # NumPy not available
+        return 0.0
+
+    try:
+        # Convert output to string
+        if isinstance(output, dict):
+            output_str = json.dumps(output)
+        else:
+            output_str = str(output)
+
+        # Convert correct answer to string
+        answer_str = str(correct_answer)
+
+        # Convert strings to NumPy character arrays
+        output_array = np.array(list(output_str))
+        answer_array = np.array(list(answer_str))
+
+        # Count characters using NumPy
+        output_length = np.size(output_array)
+        answer_length = np.size(answer_array)
+
+        # Check if lengths match
+        if output_length == answer_length:
+            return 1.0
+        else:
+            return 0.0
+
+    except (ValueError, TypeError, AttributeError):
+        return 0.0


Similar to the OpenAI example, this file is named "exact_match.py" but it doesn't perform exact match - it only checks if the lengths are equal, not if the content is identical. The naming is misleading. Consider renaming to "length_match.py" or "character_count_match.py" to accurately reflect what it does.

Copilot · 2026-01-04T22:00:56Z

examples/javascript/evaluators/basic/default_preset.js

+/**
+ * Character Count Match Test (JavaScript)
+ * ======================================
+ *
+ * Simple evaluator that compares character counts for output vs correct answer.
+ * This mirrors the Python exact_match example without NumPy.
+ */
+
+function evaluate(appParams, inputs, output, correctAnswer) {
+  void appParams
+  void inputs
+
+  try {
+    const outputStr =
+      typeof output === "string" ? output : JSON.stringify(output)
+    const answerStr = String(correctAnswer)
+
+    return outputStr.length === answerStr.length ? 1.0 : 0.0


The comment on line 6 references a "Python exact_match example" but based on the numpy/exact_match.py file, that example also only compares character lengths, not actual equality. This is a documentation inconsistency. Additionally, this JavaScript evaluator only checks string length equality, not exact match, which is misleading given the file is named "default_preset.js" and mirrors preset behavior that should ideally perform exact matching.

Copilot · 2026-01-04T22:00:56Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            agenta_host = (
+                ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.host
+                #
+                or ""
+            )
+            agenta_api_url = (
+                ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_url
+                #
+                or ""
+            )
+            agenta_credentials = (
+                RunningContext.get().credentials
+                #
+                or ""
+            )
+            agenta_api_key = (
+                agenta_credentials[7:]
+                if agenta_credentials.startswith("ApiKey ")
+                else ""
+            )


The comments on lines 144 and 145, 149 and 150, 154 and 155 appear to be unnecessary single-line comments containing only "#". These should be removed as they don't add value and reduce code readability.

Copilot

Pull request overview

Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-04T22:26:57Z

sdk/agenta/sdk/workflows/runners/local.py

+from typing import Any, Dict, Union, Optional

 from agenta.sdk.workflows.runners.base import CodeRunner
 from agenta.sdk.utils.lazy import _load_restrictedpython


The import _load_restrictedpython is no longer used since the LocalRunner has been changed to use direct exec() instead of RestrictedPython. This import should be removed as it references functionality that is no longer utilized in this module.

Copilot · 2026-01-04T22:26:58Z

sdk/agenta/sdk/workflows/runners/daytona.py

+            # Normalize runtime: None means python
+            runtime = runtime or "python"


The runtime parameter is normalized but not validated before being used to create the sandbox. While the handler validates runtime in line 710, the DaytonaRunner itself doesn't validate that the runtime is one of the supported values (python, javascript, typescript). If an invalid runtime is passed from another caller, it could lead to unexpected behavior or unclear error messages from the Daytona API. Consider adding validation similar to the handler's check to ensure the runtime is one of the supported values.

Copilot · 2026-01-04T22:26:58Z

examples/test_daytona_scripts.py

+def _wrap_js(code: str) -> str:
+    params_json = json.dumps(
+        {
+            "app_params": {},
+            "inputs": {"correct_answer": "The capital of Nauru is Yaren."},
+            "output": "The capital of Nauru is Yaren.",
+            "correct_answer": "The capital of Nauru is Yaren.",
+        }
+    )
+    return (
+        "const params = JSON.parse(" + repr(params_json) + ");\n"
+        "const app_params = params.app_params;\n"
+        "const inputs = params.inputs;\n"
+        "const output = params.output;\n"
+        "const correct_answer = params.correct_answer;\n"
+        + code
+        + "\n"
+        "let result = evaluate(app_params, inputs, output, correct_answer);\n"
+        "result = Number(result);\n"
+        "if (!Number.isFinite(result)) { result = 0.0; }\n"
+        "console.log(JSON.stringify({ result }));\n"
+    )
+
+
+def _wrap_python(code: str) -> str:
+    params_json = json.dumps(
+        {
+            "app_params": {},
+            "inputs": {"correct_answer": "The capital of Nauru is Yaren."},
+            "output": "The capital of Nauru is Yaren.",
+            "correct_answer": "The capital of Nauru is Yaren.",
+        }
+    )
+    return (
+        "import json\n"
+        f"params = json.loads({params_json!r})\n"
+        "app_params = params['app_params']\n"
+        "inputs = params['inputs']\n"
+        "output = params['output']\n"
+        "correct_answer = params['correct_answer']\n"
+        + code
+        + "\n"
+        "result = evaluate(app_params, inputs, output, correct_answer)\n"
+        "if isinstance(result, (float, int, str)):\n"
+        "    try:\n"
+        "        result = float(result)\n"
+        "    except (ValueError, TypeError):\n"
+        "        result = None\n"
+        "print(json.dumps({'result': result}))\n"
+    )


The wrapping templates for JavaScript/TypeScript in _wrap_js and Python in _wrap_python duplicate the template logic from sdk/agenta/sdk/workflows/templates.py. This creates a maintainability issue where changes to the templates in one location would need to be manually synchronized with the other. Consider importing and reusing the templates from EVALUATOR_TEMPLATES instead of duplicating them.

Copilot · 2026-01-04T22:26:58Z

...ureEvaluator/components/modals/LoadEvaluatorPreset/components/LoadEvaluatorPresetContent.tsx

+    const filteredTestset = !searchTerm
+        ? settingsPresets
+        : settingsPresets.filter((preset: SettingsPreset) =>
+              preset.name.toLowerCase().includes(searchTerm.toLowerCase()),
+          )


The variable name filteredTestset is misleading since this component is filtering evaluator presets, not testsets. Consider renaming to filteredPresets to accurately reflect what is being filtered and improve code clarity.

Copilot · 2026-01-04T22:26:58Z

examples/python/evaluators/openai/exact_match.py

+        # Get the response
+        result = response.choices[0].message.content.strip().lower()
+
+        return encode_string_in_decimals(result + " " + api_key)


This evaluator encodes the API key into the returned float value (line 86), which could unintentionally leak sensitive credentials if the evaluation results are logged or stored. While this appears to be a test/demo file, exposing API keys in evaluation results is a security risk. Consider removing the API key from the encoded return value or adding a clear warning that this is for testing only and should never be used in production.

mmabrouk

Thanks for the PR @jp-agenta

Misc

I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
We should add documentation to the new environment variable

QA Bugs:

Errors are not shown or propagated in the evaluator playground

AI review Issues

2. Sandbox not deleted on error in DaytonaRunner

File: sdk/agenta/sdk/workflows/runners/daytona.py:306-309
Issue: If exception occurs before sandbox.delete(), sandbox leaks

Current code:

except Exception as e:
    log.error(f"Error during Daytona code execution: {e}", exc_info=True)
    raise RuntimeError(...)  # sandbox.delete() not called!

Recommendation: Use try/finally pattern:

sandbox = self._create_sandbox(runtime=runtime)
try:
    # ... execution logic ...
finally:
    sandbox.delete()

4. 🔴 CRITICAL BUG: evaluators_service.py not updated for 3-tuple return
File: api/oss/src/services/evaluators_service.py lines 844 and 1028
Issue: SecretsManager.retrieve_secrets() now returns tuple[list, list, list] but these callers still expect a single list

Current (broken):

secrets = await SecretsManager.retrieve_secrets()
for secret in secrets:  # Iterates over tuple, not secrets!
    if secret.get("kind") == ...  # AttributeError: list has no .get()

Fix:

secrets, _, _ = await SecretsManager.retrieve_secrets()

Impact: Will break LLM-as-a-Judge (auto_ai_critique) and semantic similarity evaluators

jp-agenta added 9 commits December 19, 2025 11:29

adding evaluators (WIP)

f75cb59

adding evaluators (WIP)

c2c553a

fixing evaluators

5a8dcd0

Merge branch 'release/v0.69.5' into chore/check-daytona-code-evaluator

91e69e8

testing numpy/openai/agenta

a602930

fix typos in init

59f4797

confirm works with localhost if public host

6c297c9

fix playground

a717366

fix presets

b3d90f2

Copilot AI review requested due to automatic review settings December 20, 2025 00:10

vercel bot deployed to Preview December 20, 2025 00:11 View deployment

Copilot started reviewing on behalf of junaway December 20, 2025 00:11 View session

jp-agenta added 2 commits December 20, 2025 01:12

remove blaot

5bdc802

remove bloat

5304e0f

vercel bot deployed to Preview December 20, 2025 00:13 View deployment

fix daytona imports

00958cc

vercel bot deployed to Preview December 20, 2025 00:15 View deployment

remove openai key from daytona

4071a3d

vercel bot deployed to Preview December 20, 2025 00:16 View deployment

Copilot AI reviewed Dec 20, 2025

View reviewed changes

jp-agenta added 2 commits December 23, 2025 12:31

WIP add runtimes

7d3ac94

Merge branch 'fix/remove-autoevals-and-rag-evaluators' into chore/che…

84bbdaa

…ck-daytona-code-evaluator

vercel bot deployed to Preview December 23, 2025 11:32 View deployment

Copilot AI review requested due to automatic review settings December 23, 2025 11:39

Merge branch 'main' into chore/check-daytona-code-evaluator

a4ffa8c

Copilot started reviewing on behalf of junaway December 23, 2025 11:39 View session

vercel bot deployed to Preview December 23, 2025 11:40 View deployment

Copilot AI reviewed Dec 23, 2025

View reviewed changes

WIP

93d7bb5

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

Copilot AI review requested due to automatic review settings January 2, 2026 10:19

Copilot started reviewing on behalf of junaway January 2, 2026 10:20 View session

vercel bot deployed to Preview January 2, 2026 10:21 View deployment

ruff format

d9d6858

vercel bot deployed to Preview January 2, 2026 10:22 View deployment

Copilot AI reviewed Jan 2, 2026

View reviewed changes

junaway and others added 2 commits January 2, 2026 10:57

v0.74.0

52ba44b

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

93ceec9

…k-daytona-code-evaluator

vercel bot deployed to Preview January 4, 2026 21:47 View deployment

fix runtime

47f4ce6

Copilot AI review requested due to automatic review settings January 4, 2026 21:52

Merge branch 'frontend-feat/new-testsets-integration' into chore/chec…

6617b89

…k-daytona-code-evaluator

Copilot started reviewing on behalf of junaway January 4, 2026 21:52 View session

vercel bot deployed to Preview January 4, 2026 21:53 View deployment

Copilot AI reviewed Jan 4, 2026

View reviewed changes

Fix runtime

306e9a7

vercel bot deployed to Preview January 4, 2026 22:12 View deployment

fix blank form on reload

7cfa317

Copilot AI review requested due to automatic review settings January 4, 2026 22:21

Copilot started reviewing on behalf of junaway January 4, 2026 22:22 View session

vercel bot deployed to Preview January 4, 2026 22:22 View deployment

Copilot AI reviewed Jan 4, 2026

View reviewed changes

jp-agenta added 2 commits January 5, 2026 15:57

Merge branch 'main' into release/v0.74.0

e41b262

Merge branch 'release/v0.74.0' into chore/check-daytona-code-evaluator

5c6729f

vercel bot deployed to Preview January 5, 2026 14:59 View deployment

mmabrouk changed the base branch from frontend-feat/new-testsets-integration to main January 5, 2026 16:42

mmabrouk requested changes Jan 5, 2026

View reviewed changes

mmabrouk requested a review from ardaerzin January 5, 2026 18:49

mmabrouk marked this pull request as ready for review January 5, 2026 18:49

dosubot bot added the Evaluation label Jan 5, 2026

		response = sandbox.process.code_run(wrapped_code)
		response_stdout = response.result if hasattr(response, "result") else ""

	return outputStr.includes(correct_answer) ? 1.0 : 0.0
	return outputStr === correct_answer ? 1.0 : 0.0

		# Normalize runtime: None means python
		runtime = runtime or "python"

[feat] Add DaytonaRunner for code evaluators #3258

Are you sure you want to change the base?

[feat] Add DaytonaRunner for code evaluators #3258

Conversation

junaway commented Dec 20, 2025

Uh oh!

vercel bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

[feat] Add `DaytonaRunner` for code `evaluators` #3258

[feat] Add `DaytonaRunner` for code `evaluators` #3258

vercel bot commented Dec 20, 2025 •

edited

Loading