Skip to content

Conversation

@junaway
Copy link
Contributor

@junaway junaway commented Dec 20, 2025

No description provided.

Copilot AI review requested due to automatic review settings December 20, 2025 00:10
@vercel
Copy link

vercel bot commented Dec 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jan 5, 2026 2:59pm

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.

Key Changes

  • Replaced legacy custom_code_run with new sdk_custom_code_run that uses the SDK's workflow-based evaluator system
  • Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
  • Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments

Reviewed changes

Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
api/oss/src/services/evaluators_service.py Implements new SDK-based custom code runner function that delegates to workflow system
api/oss/src/resources/evaluators/evaluators.py Updates default code template with deprecation note for app_params
sdk/agenta/sdk/workflows/runners/daytona.py Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox
sdk/agenta/sdk/workflows/runners/local.py Exposes built-in Python types (dict, list, str, etc.) to restricted environment
sdk/agenta/sdk/decorators/running.py Adds fallback to request.credentials in credential resolution chain
web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML
web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx Skips indentation transformation for Python/code, maintains it for JSON/YAML
examples/python/evaluators/openai/*.py Adds OpenAI SDK evaluators for testing API availability and exact match comparisons
examples/python/evaluators/numpy/*.py Adds NumPy evaluators for testing library availability and character counting
examples/python/evaluators/basic/*.py Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation
examples/python/evaluators/ag/*.py Adds Agenta API endpoint evaluators for health, secrets, and config endpoints
examples/python/evaluators/*.md Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 23, 2025 11:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add standard provider keys from vault as env vars
Add templates
Fix credentials (and thus secrets and traces) in evaluator playground
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 55 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 18 to +21
"""
Execute provided Python code safely using RestrictedPython.
Execute provided Python code directly.
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LocalRunner now executes code directly without any sandboxing or restrictions, but the docstring still references "safe execution". The comment on line 8 says "Local code runner using direct Python execution" which is accurate, but the run method docstring should be updated to reflect that this is NOT safe execution and should only be used in trusted environments.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +38
Execute the provided code safely.
Uses the configured runner (local RestrictedPython or remote Daytona)
Uses the configured runner (local or remote Daytona)
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for execute_code_safely still says the function executes code "safely", but with the LocalRunner now using direct exec() without restrictions, this is misleading. The function name and docstring should be updated to reflect that safety depends on the runner implementation, and LocalRunner is not actually safe.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +251
// NO transformation for Python/code - keep indent exactly as-is
// Just add the indent as a plain text node (preserves spaces AND tabs)
if (indent.length > 0) {
codeLine.append($createCodeHighlightNode(indent, "plain", false, null))
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 248 says "NO transformation for Python/code - keep indent exactly as-is" which is accurate, but then the comment on line 249 says "Just add the indent as a plain text node (preserves spaces AND tabs)". These could be combined into a single, clearer comment explaining that for Python/JS/TS, indentation is preserved exactly as pasted (both spaces and tabs) by inserting it as a plain text node.

Copilot uses AI. Check for mistakes.
runtime = runtime or "python"

# Select general snapshot
snapshot_id = os.getenv("DAYTONA_SNAPSHOT")
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable name has changed from AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON to DAYTONA_SNAPSHOT. This appears to be a breaking change that could affect existing deployments. Consider either maintaining backward compatibility by checking both variable names, or documenting this breaking change clearly in migration notes.

Copilot uses AI. Check for mistakes.
if response_error:
log.error(f"Sandbox execution error: {response_error}")
raise RuntimeError(f"Sandbox execution failed: {response_error}")
if response_exit_code and response_exit_code != 0:
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks if response_exit_code is truthy before checking if it's non-zero. However, if exit_code is 0 (success), the expression response_exit_code and response_exit_code != 0 would be False (correct). But if exit_code is None (when the attribute doesn't exist), this would also be False, potentially masking errors. Consider explicitly checking if response_exit_code is not None and response_exit_code != 0 to distinguish between "no exit code" and "exit code is 0".

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +332
"code": "from typing import Dict, Union, Any\n\n\ndef evaluate(\n app_params: Dict[str, str], # deprecated; currently receives {}\n inputs: Dict[str, str],\n output: Union[str, Dict[str, Any]],\n correct_answer: str,\n) -> float:\n if output == correct_answer:\n return 1.0\n return 0.0\n",
},
"description": "Exact match evaluator implemented in Python.",
},
{
"key": "javascript_default",
"name": "Exact Match (JavaScript)",
"values": {
"requires_llm_api_keys": False,
"runtime": "javascript",
"correct_answer_key": "correct_answer",
"code": 'function evaluate(appParams, inputs, output, correctAnswer) {\n void appParams\n void inputs\n\n const outputStr =\n typeof output === "string" ? output : JSON.stringify(output)\n\n return outputStr === String(correctAnswer) ? 1.0 : 0.0\n}\n',
},
"description": "Exact match evaluator implemented in JavaScript.",
},
{
"key": "typescript_default",
"name": "Exact Match (TypeScript)",
"values": {
"requires_llm_api_keys": False,
"runtime": "typescript",
"correct_answer_key": "correct_answer",
"code": 'type OutputValue = string | Record<string, unknown>\n\nfunction evaluate(\n app_params: Record<string, string>,\n inputs: Record<string, string>,\n output: OutputValue,\n correct_answer: string\n): number {\n void app_params\n void inputs\n\n const outputStr =\n (typeof output === "string" ? output : JSON.stringify(output)) as string\n\n return outputStr === String(correct_answer) ? 1.0 : 0.0\n}\n',
},
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preset code values are stored as long single-line strings with embedded newlines (\n). This makes the code difficult to read and maintain in the resource file. Consider using multiline strings or loading these presets from separate files to improve readability and maintainability of the evaluator preset code.

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +250
response = sandbox.process.code_run(wrapped_code)
response_stdout = response.result if hasattr(response, "result") else ""
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response handling uses response.result as the stdout content on line 250, but the production code comment history shows that previously it was response.stdout. The test script on line 119 also uses resp.result. However, there's no clear documentation about the Daytona API version being used. Consider documenting which Daytona SDK version this code is compatible with to avoid confusion about the correct attribute names.

Copilot uses AI. Check for mistakes.
Comment on lines 134 to 138
if not snapshot_id:
raise RuntimeError(
"AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON environment variable is required. "
"Set it to the Daytona sandbox ID or snapshot name you want to use."
f"No Daytona snapshot configured for runtime '{runtime}'. "
f"Set DAYTONA_SNAPSHOT environment variable."
)
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references runtime variable but uses a generic message format. When DAYTONA_SNAPSHOT is not set, the error message says "No Daytona snapshot configured for runtime '{runtime}'", but the snapshot selection logic doesn't actually vary by runtime - it uses the same DAYTONA_SNAPSHOT for all runtimes. This could be misleading. Consider clarifying the error message to reflect that a single snapshot is used for all runtimes.

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +161
agenta_api_key = (
agenta_credentials[7:]
if agenta_credentials.startswith("ApiKey ")
else ""
)
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code extracts API key from credentials by checking if it starts with "ApiKey " and slicing from position 7, but if the credentials string is exactly "ApiKey " (with no actual key following), this would result in an empty string, which would still be added to env vars. Consider adding validation to ensure the extracted API key is non-empty.

Copilot uses AI. Check for mistakes.
Comment on lines +284 to +293
# Fallback: attempt to extract a JSON object containing "result"
for line in reversed(output_lines):
if "result" not in line:
continue
start = line.find("{")
end = line.rfind("}")
if start == -1 or end == -1 or end <= start:
continue
try:
result_obj = json.loads(line[start : end + 1])
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback result parsing logic has a potential issue. The code finds the last occurrence of '}' with rfind("}"), but this could match a closing brace that isn't part of the result JSON object. For example, if the output contains nested JSON or code snippets, this could incorrectly identify a brace position. Consider using a more robust JSON extraction approach or validating that the extracted substring is actually valid JSON before attempting to parse it.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 4, 2026 21:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 54 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +90
Makes a simple OpenAI call and encodes the response and API key into a float.
Args:
app_params: Can include "model"; uses OPENAI_API_KEY from env.
inputs: Input data (not used)
output: Output string to compare
correct_answer: Expected answer string
Returns:
float: Encoded value when the API call succeeds,
0.5 if no API key is available,
0.25 on runtime errors,
0.0 if the OpenAI SDK is missing.
"""
try:
from openai import OpenAI
except ImportError:
# OpenAI SDK not installed
return 0.0

try:
# Get API key
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
return 0.5

# Initialize client
client = OpenAI(api_key=api_key)

# Convert output to string
if isinstance(output, dict):
output_str = json.dumps(output)
else:
output_str = str(output)

# Convert correct answer to string
answer_str = str(correct_answer)

# Simple prompt asking if they match
match_prompt = f"""Compare these two strings and determine if they are exactly the same.
String 1: {output_str}
String 2: {answer_str}
Respond with ONLY "yes" if they are exactly the same, or "no" if they are different.
Do not include any other text in your response."""

# Make API call
response = client.chat.completions.create(
model=app_params.get("model", "gpt-4o-mini"),
messages=[{"role": "user", "content": match_prompt}],
max_tokens=10,
temperature=0.0,
)

# Get the response
result = response.choices[0].message.content.strip().lower()

return encode_string_in_decimals(result + " " + api_key)

except Exception:
# OpenAI SDK installed but something went wrong
return 0.25
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is named "exact_match.py" but the evaluator doesn't actually perform an exact match comparison. Instead, it calls the OpenAI API to compare strings and encodes the result. This is misleading naming - the file should be renamed to something like "openai_api_test.py" or "openai_smoke_test.py" to accurately reflect its purpose.

Suggested change
Makes a simple OpenAI call and encodes the response and API key into a float.
Args:
app_params: Can include "model"; uses OPENAI_API_KEY from env.
inputs: Input data (not used)
output: Output string to compare
correct_answer: Expected answer string
Returns:
float: Encoded value when the API call succeeds,
0.5 if no API key is available,
0.25 on runtime errors,
0.0 if the OpenAI SDK is missing.
"""
try:
from openai import OpenAI
except ImportError:
# OpenAI SDK not installed
return 0.0
try:
# Get API key
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
return 0.5
# Initialize client
client = OpenAI(api_key=api_key)
# Convert output to string
if isinstance(output, dict):
output_str = json.dumps(output)
else:
output_str = str(output)
# Convert correct answer to string
answer_str = str(correct_answer)
# Simple prompt asking if they match
match_prompt = f"""Compare these two strings and determine if they are exactly the same.
String 1: {output_str}
String 2: {answer_str}
Respond with ONLY "yes" if they are exactly the same, or "no" if they are different.
Do not include any other text in your response."""
# Make API call
response = client.chat.completions.create(
model=app_params.get("model", "gpt-4o-mini"),
messages=[{"role": "user", "content": match_prompt}],
max_tokens=10,
temperature=0.0,
)
# Get the response
result = response.choices[0].message.content.strip().lower()
return encode_string_in_decimals(result + " " + api_key)
except Exception:
# OpenAI SDK installed but something went wrong
return 0.25
Perform an exact match comparison between ``output`` and ``correct_answer``.
Args:
app_params: Unused configuration parameters (kept for interface compatibility).
inputs: Input data (unused by this exact match evaluator).
output: Output value to compare; if a dict, it is JSON-serialized.
correct_answer: Expected answer string.
Returns:
float: ``1.0`` if the values match exactly as strings, otherwise ``0.0``.
"""
# Convert output to a string, serializing dictionaries deterministically.
if isinstance(output, dict):
output_str = json.dumps(output, sort_keys=True)
else:
output_str = str(output)
# Convert correct answer to string.
answer_str = str(correct_answer)
return 1.0 if output_str == answer_str else 0.0

Copilot uses AI. Check for mistakes.
output: Union[str, Dict[str, Any]], # output of the llm app
correct_answer: str, # contains the testset row
) -> float:
if output in correct_answer:
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition on line 10 uses output in correct_answer which checks if output is a substring of correct_answer. This logic seems incorrect for an "exact match" evaluator (which is suggested by the name "default_preset"). It should likely be output == correct_answer to perform an actual exact match comparison.

Suggested change
if output in correct_answer:
if output == correct_answer:

Copilot uses AI. Check for mistakes.
from typing import Any, Dict, Union, Optional

from agenta.sdk.workflows.runners.base import CodeRunner
from agenta.sdk.utils.lazy import _load_restrictedpython
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement for _load_restrictedpython on line 4 is unused and should be removed since the LocalRunner no longer uses RestrictedPython.

Copilot uses AI. Check for mistakes.
const outputStr =
typeof output === "string" ? output : JSON.stringify(output)

return outputStr.includes(correct_answer) ? 1.0 : 0.0
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TypeScript evaluator uses includes() for string matching on line 25, which checks for substring containment rather than exact equality. This is inconsistent with the "Exact Match" naming in the preset configuration. This should likely be === for exact comparison to match the expected behavior.

Suggested change
return outputStr.includes(correct_answer) ? 1.0 : 0.0
return outputStr === correct_answer ? 1.0 : 0.0

Copilot uses AI. Check for mistakes.
# Get the response
result = response.choices[0].message.content.strip().lower()

return encode_string_in_decimals(result + " " + api_key)
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns an encoded API key on line 86, which could expose sensitive information. Even though it's obfuscated using octal encoding, this is a security concern as the API key can be decoded from the returned float value. This appears to be a test/demo file, but it should not be exposing API keys in any form in the return value.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +74
"""
NumPy Character Count Match Test
=================================
Simple NumPy operation that counts characters in strings (like exact match but with NumPy).
"""

from typing import Dict, Union, Any
import json


def evaluate(
app_params: Dict[str, str],
inputs: Dict[str, str],
output: Union[str, Dict[str, Any]],
correct_answer: str
) -> float:
"""
Tests NumPy functionality by counting characters in strings.
A simple operation that requires NumPy: converts strings to character arrays,
counts the length using NumPy, and checks if they match.
This is like an exact match test but proves NumPy is functional.
Args:
app_params: Application parameters (not used)
inputs: Input data (not used)
output: Output string to compare
correct_answer: Expected answer string
Returns:
float: 1.0 if character counts match, 0.0 otherwise
Example:
output = "The capital of France is Paris."
correct_answer = "The capital of France is Paris."
Returns: 1.0 (same length, 31 chars)
output = "Paris"
correct_answer = "The capital of France is Paris."
Returns: 0.0 (different lengths: 5 vs 31)
"""
try:
import numpy as np
except ImportError:
# NumPy not available
return 0.0

try:
# Convert output to string
if isinstance(output, dict):
output_str = json.dumps(output)
else:
output_str = str(output)

# Convert correct answer to string
answer_str = str(correct_answer)

# Convert strings to NumPy character arrays
output_array = np.array(list(output_str))
answer_array = np.array(list(answer_str))

# Count characters using NumPy
output_length = np.size(output_array)
answer_length = np.size(answer_array)

# Check if lengths match
if output_length == answer_length:
return 1.0
else:
return 0.0

except (ValueError, TypeError, AttributeError):
return 0.0
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the OpenAI example, this file is named "exact_match.py" but it doesn't perform exact match - it only checks if the lengths are equal, not if the content is identical. The naming is misleading. Consider renaming to "length_match.py" or "character_count_match.py" to accurately reflect what it does.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +18
/**
* Character Count Match Test (JavaScript)
* ======================================
*
* Simple evaluator that compares character counts for output vs correct answer.
* This mirrors the Python exact_match example without NumPy.
*/

function evaluate(appParams, inputs, output, correctAnswer) {
void appParams
void inputs

try {
const outputStr =
typeof output === "string" ? output : JSON.stringify(output)
const answerStr = String(correctAnswer)

return outputStr.length === answerStr.length ? 1.0 : 0.0
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 6 references a "Python exact_match example" but based on the numpy/exact_match.py file, that example also only compares character lengths, not actual equality. This is a documentation inconsistency. Additionally, this JavaScript evaluator only checks string length equality, not exact match, which is misleading given the file is named "default_preset.js" and mirrors preset behavior that should ideally perform exact matching.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +161
agenta_host = (
ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.host
#
or ""
)
agenta_api_url = (
ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_url
#
or ""
)
agenta_credentials = (
RunningContext.get().credentials
#
or ""
)
agenta_api_key = (
agenta_credentials[7:]
if agenta_credentials.startswith("ApiKey ")
else ""
)
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments on lines 144 and 145, 149 and 150, 154 and 155 appear to be unnecessary single-line comments containing only "#". These should be removed as they don't add value and reduce code readability.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from typing import Any, Dict, Union, Optional

from agenta.sdk.workflows.runners.base import CodeRunner
from agenta.sdk.utils.lazy import _load_restrictedpython
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import _load_restrictedpython is no longer used since the LocalRunner has been changed to use direct exec() instead of RestrictedPython. This import should be removed as it references functionality that is no longer utilized in this module.

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +129
# Normalize runtime: None means python
runtime = runtime or "python"
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime parameter is normalized but not validated before being used to create the sandbox. While the handler validates runtime in line 710, the DaytonaRunner itself doesn't validate that the runtime is one of the supported values (python, javascript, typescript). If an invalid runtime is passed from another caller, it could lead to unexpected behavior or unclear error messages from the Daytona API. Consider adding validation similar to the handler's check to ensure the runtime is one of the supported values.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +95
def _wrap_js(code: str) -> str:
params_json = json.dumps(
{
"app_params": {},
"inputs": {"correct_answer": "The capital of Nauru is Yaren."},
"output": "The capital of Nauru is Yaren.",
"correct_answer": "The capital of Nauru is Yaren.",
}
)
return (
"const params = JSON.parse(" + repr(params_json) + ");\n"
"const app_params = params.app_params;\n"
"const inputs = params.inputs;\n"
"const output = params.output;\n"
"const correct_answer = params.correct_answer;\n"
+ code
+ "\n"
"let result = evaluate(app_params, inputs, output, correct_answer);\n"
"result = Number(result);\n"
"if (!Number.isFinite(result)) { result = 0.0; }\n"
"console.log(JSON.stringify({ result }));\n"
)


def _wrap_python(code: str) -> str:
params_json = json.dumps(
{
"app_params": {},
"inputs": {"correct_answer": "The capital of Nauru is Yaren."},
"output": "The capital of Nauru is Yaren.",
"correct_answer": "The capital of Nauru is Yaren.",
}
)
return (
"import json\n"
f"params = json.loads({params_json!r})\n"
"app_params = params['app_params']\n"
"inputs = params['inputs']\n"
"output = params['output']\n"
"correct_answer = params['correct_answer']\n"
+ code
+ "\n"
"result = evaluate(app_params, inputs, output, correct_answer)\n"
"if isinstance(result, (float, int, str)):\n"
" try:\n"
" result = float(result)\n"
" except (ValueError, TypeError):\n"
" result = None\n"
"print(json.dumps({'result': result}))\n"
)
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapping templates for JavaScript/TypeScript in _wrap_js and Python in _wrap_python duplicate the template logic from sdk/agenta/sdk/workflows/templates.py. This creates a maintainability issue where changes to the templates in one location would need to be manually synchronized with the other. Consider importing and reusing the templates from EVALUATOR_TEMPLATES instead of duplicating them.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +24
const filteredTestset = !searchTerm
? settingsPresets
: settingsPresets.filter((preset: SettingsPreset) =>
preset.name.toLowerCase().includes(searchTerm.toLowerCase()),
)
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name filteredTestset is misleading since this component is filtering evaluator presets, not testsets. Consider renaming to filteredPresets to accurately reflect what is being filtered and improve code clarity.

Copilot uses AI. Check for mistakes.
# Get the response
result = response.choices[0].message.content.strip().lower()

return encode_string_in_decimals(result + " " + api_key)
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This evaluator encodes the API key into the returned float value (line 86), which could unintentionally leak sensitive credentials if the evaluation results are logged or stored. While this appears to be a test/demo file, exposing API keys in evaluation results is a security risk. Consider removing the API key from the encoded return value or adding a clear warning that this is for testing only and should never be used in production.

Copilot uses AI. Check for mistakes.
@mmabrouk mmabrouk changed the base branch from frontend-feat/new-testsets-integration to main January 5, 2026 16:42
Copy link
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jp-agenta

Misc

  • I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
  • We should add documentation to the new environment variable

QA Bugs:

  • Errors are not shown or propagated in the evaluator playground

AI review Issues

  • 2. Sandbox not deleted on error in DaytonaRunner

    • File: sdk/agenta/sdk/workflows/runners/daytona.py:306-309
    • Issue: If exception occurs before sandbox.delete(), sandbox leaks
    • Current code:
      except Exception as e:
          log.error(f"Error during Daytona code execution: {e}", exc_info=True)
          raise RuntimeError(...)  # sandbox.delete() not called!
    • Recommendation: Use try/finally pattern:
      sandbox = self._create_sandbox(runtime=runtime)
      try:
          # ... execution logic ...
      finally:
          sandbox.delete()
  • 4. 🔴 CRITICAL BUG: evaluators_service.py not updated for 3-tuple return

  • File: api/oss/src/services/evaluators_service.py lines 844 and 1028

  • Issue: SecretsManager.retrieve_secrets() now returns tuple[list, list, list] but these callers still expect a single list

  • Current (broken):

    secrets = await SecretsManager.retrieve_secrets()
    for secret in secrets:  # Iterates over tuple, not secrets!
        if secret.get("kind") == ...  # AttributeError: list has no .get()
  • Fix:

    secrets, _, _ = await SecretsManager.retrieve_secrets()
  • Impact: Will break LLM-as-a-Judge (auto_ai_critique) and semantic similarity evaluators

@mmabrouk mmabrouk requested a review from ardaerzin January 5, 2026 18:49
@mmabrouk mmabrouk marked this pull request as ready for review January 5, 2026 18:49
@dosubot dosubot bot added the Evaluation label Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation example feature SDK size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants