-
Notifications
You must be signed in to change notification settings - Fork 434
[feat] Add DaytonaRunner for code evaluators
#3258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.
Key Changes
- Replaced legacy
custom_code_runwith newsdk_custom_code_runthat uses the SDK's workflow-based evaluator system - Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
- Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments
Reviewed changes
Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
api/oss/src/services/evaluators_service.py |
Implements new SDK-based custom code runner function that delegates to workflow system |
api/oss/src/resources/evaluators/evaluators.py |
Updates default code template with deprecation note for app_params |
sdk/agenta/sdk/workflows/runners/daytona.py |
Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox |
sdk/agenta/sdk/workflows/runners/local.py |
Exposes built-in Python types (dict, list, str, etc.) to restricted environment |
sdk/agenta/sdk/decorators/running.py |
Adds fallback to request.credentials in credential resolution chain |
web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts |
Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML |
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx |
Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML |
web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx |
Skips indentation transformation for Python/code, maintains it for JSON/YAML |
examples/python/evaluators/openai/*.py |
Adds OpenAI SDK evaluators for testing API availability and exact match comparisons |
examples/python/evaluators/numpy/*.py |
Adds NumPy evaluators for testing library availability and character counting |
examples/python/evaluators/basic/*.py |
Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation |
examples/python/evaluators/ag/*.py |
Adds Agenta API endpoint evaluators for health, secrets, and config endpoints |
examples/python/evaluators/*.md |
Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ck-daytona-code-evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 48 out of 55 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| Execute provided Python code safely using RestrictedPython. | ||
| Execute provided Python code directly. |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LocalRunner now executes code directly without any sandboxing or restrictions, but the docstring still references "safe execution". The comment on line 8 says "Local code runner using direct Python execution" which is accurate, but the run method docstring should be updated to reflect that this is NOT safe execution and should only be used in trusted environments.
| Execute the provided code safely. | ||
| Uses the configured runner (local RestrictedPython or remote Daytona) | ||
| Uses the configured runner (local or remote Daytona) |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for execute_code_safely still says the function executes code "safely", but with the LocalRunner now using direct exec() without restrictions, this is misleading. The function name and docstring should be updated to reflect that safety depends on the runner implementation, and LocalRunner is not actually safe.
| // NO transformation for Python/code - keep indent exactly as-is | ||
| // Just add the indent as a plain text node (preserves spaces AND tabs) | ||
| if (indent.length > 0) { | ||
| codeLine.append($createCodeHighlightNode(indent, "plain", false, null)) |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 248 says "NO transformation for Python/code - keep indent exactly as-is" which is accurate, but then the comment on line 249 says "Just add the indent as a plain text node (preserves spaces AND tabs)". These could be combined into a single, clearer comment explaining that for Python/JS/TS, indentation is preserved exactly as pasted (both spaces and tabs) by inserting it as a plain text node.
| runtime = runtime or "python" | ||
|
|
||
| # Select general snapshot | ||
| snapshot_id = os.getenv("DAYTONA_SNAPSHOT") |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable name has changed from AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON to DAYTONA_SNAPSHOT. This appears to be a breaking change that could affect existing deployments. Consider either maintaining backward compatibility by checking both variable names, or documenting this breaking change clearly in migration notes.
| if response_error: | ||
| log.error(f"Sandbox execution error: {response_error}") | ||
| raise RuntimeError(f"Sandbox execution failed: {response_error}") | ||
| if response_exit_code and response_exit_code != 0: |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code checks if response_exit_code is truthy before checking if it's non-zero. However, if exit_code is 0 (success), the expression response_exit_code and response_exit_code != 0 would be False (correct). But if exit_code is None (when the attribute doesn't exist), this would also be False, potentially masking errors. Consider explicitly checking if response_exit_code is not None and response_exit_code != 0 to distinguish between "no exit code" and "exit code is 0".
| "code": "from typing import Dict, Union, Any\n\n\ndef evaluate(\n app_params: Dict[str, str], # deprecated; currently receives {}\n inputs: Dict[str, str],\n output: Union[str, Dict[str, Any]],\n correct_answer: str,\n) -> float:\n if output == correct_answer:\n return 1.0\n return 0.0\n", | ||
| }, | ||
| "description": "Exact match evaluator implemented in Python.", | ||
| }, | ||
| { | ||
| "key": "javascript_default", | ||
| "name": "Exact Match (JavaScript)", | ||
| "values": { | ||
| "requires_llm_api_keys": False, | ||
| "runtime": "javascript", | ||
| "correct_answer_key": "correct_answer", | ||
| "code": 'function evaluate(appParams, inputs, output, correctAnswer) {\n void appParams\n void inputs\n\n const outputStr =\n typeof output === "string" ? output : JSON.stringify(output)\n\n return outputStr === String(correctAnswer) ? 1.0 : 0.0\n}\n', | ||
| }, | ||
| "description": "Exact match evaluator implemented in JavaScript.", | ||
| }, | ||
| { | ||
| "key": "typescript_default", | ||
| "name": "Exact Match (TypeScript)", | ||
| "values": { | ||
| "requires_llm_api_keys": False, | ||
| "runtime": "typescript", | ||
| "correct_answer_key": "correct_answer", | ||
| "code": 'type OutputValue = string | Record<string, unknown>\n\nfunction evaluate(\n app_params: Record<string, string>,\n inputs: Record<string, string>,\n output: OutputValue,\n correct_answer: string\n): number {\n void app_params\n void inputs\n\n const outputStr =\n (typeof output === "string" ? output : JSON.stringify(output)) as string\n\n return outputStr === String(correct_answer) ? 1.0 : 0.0\n}\n', | ||
| }, |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preset code values are stored as long single-line strings with embedded newlines (\n). This makes the code difficult to read and maintain in the resource file. Consider using multiline strings or loading these presets from separate files to improve readability and maintainability of the evaluator preset code.
| response = sandbox.process.code_run(wrapped_code) | ||
| response_stdout = response.result if hasattr(response, "result") else "" |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response handling uses response.result as the stdout content on line 250, but the production code comment history shows that previously it was response.stdout. The test script on line 119 also uses resp.result. However, there's no clear documentation about the Daytona API version being used. Consider documenting which Daytona SDK version this code is compatible with to avoid confusion about the correct attribute names.
| if not snapshot_id: | ||
| raise RuntimeError( | ||
| "AGENTA_SERVICES_SANDBOX_SNAPSHOT_PYTHON environment variable is required. " | ||
| "Set it to the Daytona sandbox ID or snapshot name you want to use." | ||
| f"No Daytona snapshot configured for runtime '{runtime}'. " | ||
| f"Set DAYTONA_SNAPSHOT environment variable." | ||
| ) |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message references runtime variable but uses a generic message format. When DAYTONA_SNAPSHOT is not set, the error message says "No Daytona snapshot configured for runtime '{runtime}'", but the snapshot selection logic doesn't actually vary by runtime - it uses the same DAYTONA_SNAPSHOT for all runtimes. This could be misleading. Consider clarifying the error message to reflect that a single snapshot is used for all runtimes.
| agenta_api_key = ( | ||
| agenta_credentials[7:] | ||
| if agenta_credentials.startswith("ApiKey ") | ||
| else "" | ||
| ) |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code extracts API key from credentials by checking if it starts with "ApiKey " and slicing from position 7, but if the credentials string is exactly "ApiKey " (with no actual key following), this would result in an empty string, which would still be added to env vars. Consider adding validation to ensure the extracted API key is non-empty.
| # Fallback: attempt to extract a JSON object containing "result" | ||
| for line in reversed(output_lines): | ||
| if "result" not in line: | ||
| continue | ||
| start = line.find("{") | ||
| end = line.rfind("}") | ||
| if start == -1 or end == -1 or end <= start: | ||
| continue | ||
| try: | ||
| result_obj = json.loads(line[start : end + 1]) |
Copilot
AI
Jan 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback result parsing logic has a potential issue. The code finds the last occurrence of '}' with rfind("}"), but this could match a closing brace that isn't part of the result JSON object. For example, if the output contains nested JSON or code snippets, this could incorrectly identify a brace position. Consider using a more robust JSON extraction approach or validating that the extracted substring is actually valid JSON before attempting to parse it.
…k-daytona-code-evaluator
…k-daytona-code-evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 47 out of 54 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Makes a simple OpenAI call and encodes the response and API key into a float. | ||
| Args: | ||
| app_params: Can include "model"; uses OPENAI_API_KEY from env. | ||
| inputs: Input data (not used) | ||
| output: Output string to compare | ||
| correct_answer: Expected answer string | ||
| Returns: | ||
| float: Encoded value when the API call succeeds, | ||
| 0.5 if no API key is available, | ||
| 0.25 on runtime errors, | ||
| 0.0 if the OpenAI SDK is missing. | ||
| """ | ||
| try: | ||
| from openai import OpenAI | ||
| except ImportError: | ||
| # OpenAI SDK not installed | ||
| return 0.0 | ||
|
|
||
| try: | ||
| # Get API key | ||
| api_key = os.environ.get("OPENAI_API_KEY") | ||
| if not api_key: | ||
| return 0.5 | ||
|
|
||
| # Initialize client | ||
| client = OpenAI(api_key=api_key) | ||
|
|
||
| # Convert output to string | ||
| if isinstance(output, dict): | ||
| output_str = json.dumps(output) | ||
| else: | ||
| output_str = str(output) | ||
|
|
||
| # Convert correct answer to string | ||
| answer_str = str(correct_answer) | ||
|
|
||
| # Simple prompt asking if they match | ||
| match_prompt = f"""Compare these two strings and determine if they are exactly the same. | ||
| String 1: {output_str} | ||
| String 2: {answer_str} | ||
| Respond with ONLY "yes" if they are exactly the same, or "no" if they are different. | ||
| Do not include any other text in your response.""" | ||
|
|
||
| # Make API call | ||
| response = client.chat.completions.create( | ||
| model=app_params.get("model", "gpt-4o-mini"), | ||
| messages=[{"role": "user", "content": match_prompt}], | ||
| max_tokens=10, | ||
| temperature=0.0, | ||
| ) | ||
|
|
||
| # Get the response | ||
| result = response.choices[0].message.content.strip().lower() | ||
|
|
||
| return encode_string_in_decimals(result + " " + api_key) | ||
|
|
||
| except Exception: | ||
| # OpenAI SDK installed but something went wrong | ||
| return 0.25 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is named "exact_match.py" but the evaluator doesn't actually perform an exact match comparison. Instead, it calls the OpenAI API to compare strings and encodes the result. This is misleading naming - the file should be renamed to something like "openai_api_test.py" or "openai_smoke_test.py" to accurately reflect its purpose.
| Makes a simple OpenAI call and encodes the response and API key into a float. | |
| Args: | |
| app_params: Can include "model"; uses OPENAI_API_KEY from env. | |
| inputs: Input data (not used) | |
| output: Output string to compare | |
| correct_answer: Expected answer string | |
| Returns: | |
| float: Encoded value when the API call succeeds, | |
| 0.5 if no API key is available, | |
| 0.25 on runtime errors, | |
| 0.0 if the OpenAI SDK is missing. | |
| """ | |
| try: | |
| from openai import OpenAI | |
| except ImportError: | |
| # OpenAI SDK not installed | |
| return 0.0 | |
| try: | |
| # Get API key | |
| api_key = os.environ.get("OPENAI_API_KEY") | |
| if not api_key: | |
| return 0.5 | |
| # Initialize client | |
| client = OpenAI(api_key=api_key) | |
| # Convert output to string | |
| if isinstance(output, dict): | |
| output_str = json.dumps(output) | |
| else: | |
| output_str = str(output) | |
| # Convert correct answer to string | |
| answer_str = str(correct_answer) | |
| # Simple prompt asking if they match | |
| match_prompt = f"""Compare these two strings and determine if they are exactly the same. | |
| String 1: {output_str} | |
| String 2: {answer_str} | |
| Respond with ONLY "yes" if they are exactly the same, or "no" if they are different. | |
| Do not include any other text in your response.""" | |
| # Make API call | |
| response = client.chat.completions.create( | |
| model=app_params.get("model", "gpt-4o-mini"), | |
| messages=[{"role": "user", "content": match_prompt}], | |
| max_tokens=10, | |
| temperature=0.0, | |
| ) | |
| # Get the response | |
| result = response.choices[0].message.content.strip().lower() | |
| return encode_string_in_decimals(result + " " + api_key) | |
| except Exception: | |
| # OpenAI SDK installed but something went wrong | |
| return 0.25 | |
| Perform an exact match comparison between ``output`` and ``correct_answer``. | |
| Args: | |
| app_params: Unused configuration parameters (kept for interface compatibility). | |
| inputs: Input data (unused by this exact match evaluator). | |
| output: Output value to compare; if a dict, it is JSON-serialized. | |
| correct_answer: Expected answer string. | |
| Returns: | |
| float: ``1.0`` if the values match exactly as strings, otherwise ``0.0``. | |
| """ | |
| # Convert output to a string, serializing dictionaries deterministically. | |
| if isinstance(output, dict): | |
| output_str = json.dumps(output, sort_keys=True) | |
| else: | |
| output_str = str(output) | |
| # Convert correct answer to string. | |
| answer_str = str(correct_answer) | |
| return 1.0 if output_str == answer_str else 0.0 |
| output: Union[str, Dict[str, Any]], # output of the llm app | ||
| correct_answer: str, # contains the testset row | ||
| ) -> float: | ||
| if output in correct_answer: |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition on line 10 uses output in correct_answer which checks if output is a substring of correct_answer. This logic seems incorrect for an "exact match" evaluator (which is suggested by the name "default_preset"). It should likely be output == correct_answer to perform an actual exact match comparison.
| if output in correct_answer: | |
| if output == correct_answer: |
| from typing import Any, Dict, Union, Optional | ||
|
|
||
| from agenta.sdk.workflows.runners.base import CodeRunner | ||
| from agenta.sdk.utils.lazy import _load_restrictedpython |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement for _load_restrictedpython on line 4 is unused and should be removed since the LocalRunner no longer uses RestrictedPython.
| const outputStr = | ||
| typeof output === "string" ? output : JSON.stringify(output) | ||
|
|
||
| return outputStr.includes(correct_answer) ? 1.0 : 0.0 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TypeScript evaluator uses includes() for string matching on line 25, which checks for substring containment rather than exact equality. This is inconsistent with the "Exact Match" naming in the preset configuration. This should likely be === for exact comparison to match the expected behavior.
| return outputStr.includes(correct_answer) ? 1.0 : 0.0 | |
| return outputStr === correct_answer ? 1.0 : 0.0 |
| # Get the response | ||
| result = response.choices[0].message.content.strip().lower() | ||
|
|
||
| return encode_string_in_decimals(result + " " + api_key) |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns an encoded API key on line 86, which could expose sensitive information. Even though it's obfuscated using octal encoding, this is a security concern as the API key can be decoded from the returned float value. This appears to be a test/demo file, but it should not be exposing API keys in any form in the return value.
| """ | ||
| NumPy Character Count Match Test | ||
| ================================= | ||
| Simple NumPy operation that counts characters in strings (like exact match but with NumPy). | ||
| """ | ||
|
|
||
| from typing import Dict, Union, Any | ||
| import json | ||
|
|
||
|
|
||
| def evaluate( | ||
| app_params: Dict[str, str], | ||
| inputs: Dict[str, str], | ||
| output: Union[str, Dict[str, Any]], | ||
| correct_answer: str | ||
| ) -> float: | ||
| """ | ||
| Tests NumPy functionality by counting characters in strings. | ||
| A simple operation that requires NumPy: converts strings to character arrays, | ||
| counts the length using NumPy, and checks if they match. | ||
| This is like an exact match test but proves NumPy is functional. | ||
| Args: | ||
| app_params: Application parameters (not used) | ||
| inputs: Input data (not used) | ||
| output: Output string to compare | ||
| correct_answer: Expected answer string | ||
| Returns: | ||
| float: 1.0 if character counts match, 0.0 otherwise | ||
| Example: | ||
| output = "The capital of France is Paris." | ||
| correct_answer = "The capital of France is Paris." | ||
| Returns: 1.0 (same length, 31 chars) | ||
| output = "Paris" | ||
| correct_answer = "The capital of France is Paris." | ||
| Returns: 0.0 (different lengths: 5 vs 31) | ||
| """ | ||
| try: | ||
| import numpy as np | ||
| except ImportError: | ||
| # NumPy not available | ||
| return 0.0 | ||
|
|
||
| try: | ||
| # Convert output to string | ||
| if isinstance(output, dict): | ||
| output_str = json.dumps(output) | ||
| else: | ||
| output_str = str(output) | ||
|
|
||
| # Convert correct answer to string | ||
| answer_str = str(correct_answer) | ||
|
|
||
| # Convert strings to NumPy character arrays | ||
| output_array = np.array(list(output_str)) | ||
| answer_array = np.array(list(answer_str)) | ||
|
|
||
| # Count characters using NumPy | ||
| output_length = np.size(output_array) | ||
| answer_length = np.size(answer_array) | ||
|
|
||
| # Check if lengths match | ||
| if output_length == answer_length: | ||
| return 1.0 | ||
| else: | ||
| return 0.0 | ||
|
|
||
| except (ValueError, TypeError, AttributeError): | ||
| return 0.0 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the OpenAI example, this file is named "exact_match.py" but it doesn't perform exact match - it only checks if the lengths are equal, not if the content is identical. The naming is misleading. Consider renaming to "length_match.py" or "character_count_match.py" to accurately reflect what it does.
| /** | ||
| * Character Count Match Test (JavaScript) | ||
| * ====================================== | ||
| * | ||
| * Simple evaluator that compares character counts for output vs correct answer. | ||
| * This mirrors the Python exact_match example without NumPy. | ||
| */ | ||
|
|
||
| function evaluate(appParams, inputs, output, correctAnswer) { | ||
| void appParams | ||
| void inputs | ||
|
|
||
| try { | ||
| const outputStr = | ||
| typeof output === "string" ? output : JSON.stringify(output) | ||
| const answerStr = String(correctAnswer) | ||
|
|
||
| return outputStr.length === answerStr.length ? 1.0 : 0.0 |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 6 references a "Python exact_match example" but based on the numpy/exact_match.py file, that example also only compares character lengths, not actual equality. This is a documentation inconsistency. Additionally, this JavaScript evaluator only checks string length equality, not exact match, which is misleading given the file is named "default_preset.js" and mirrors preset behavior that should ideally perform exact matching.
| agenta_host = ( | ||
| ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.host | ||
| # | ||
| or "" | ||
| ) | ||
| agenta_api_url = ( | ||
| ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_url | ||
| # | ||
| or "" | ||
| ) | ||
| agenta_credentials = ( | ||
| RunningContext.get().credentials | ||
| # | ||
| or "" | ||
| ) | ||
| agenta_api_key = ( | ||
| agenta_credentials[7:] | ||
| if agenta_credentials.startswith("ApiKey ") | ||
| else "" | ||
| ) |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments on lines 144 and 145, 149 and 150, 154 and 155 appear to be unnecessary single-line comments containing only "#". These should be removed as they don't add value and reduce code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from typing import Any, Dict, Union, Optional | ||
|
|
||
| from agenta.sdk.workflows.runners.base import CodeRunner | ||
| from agenta.sdk.utils.lazy import _load_restrictedpython |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import _load_restrictedpython is no longer used since the LocalRunner has been changed to use direct exec() instead of RestrictedPython. This import should be removed as it references functionality that is no longer utilized in this module.
| # Normalize runtime: None means python | ||
| runtime = runtime or "python" |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime parameter is normalized but not validated before being used to create the sandbox. While the handler validates runtime in line 710, the DaytonaRunner itself doesn't validate that the runtime is one of the supported values (python, javascript, typescript). If an invalid runtime is passed from another caller, it could lead to unexpected behavior or unclear error messages from the Daytona API. Consider adding validation similar to the handler's check to ensure the runtime is one of the supported values.
| def _wrap_js(code: str) -> str: | ||
| params_json = json.dumps( | ||
| { | ||
| "app_params": {}, | ||
| "inputs": {"correct_answer": "The capital of Nauru is Yaren."}, | ||
| "output": "The capital of Nauru is Yaren.", | ||
| "correct_answer": "The capital of Nauru is Yaren.", | ||
| } | ||
| ) | ||
| return ( | ||
| "const params = JSON.parse(" + repr(params_json) + ");\n" | ||
| "const app_params = params.app_params;\n" | ||
| "const inputs = params.inputs;\n" | ||
| "const output = params.output;\n" | ||
| "const correct_answer = params.correct_answer;\n" | ||
| + code | ||
| + "\n" | ||
| "let result = evaluate(app_params, inputs, output, correct_answer);\n" | ||
| "result = Number(result);\n" | ||
| "if (!Number.isFinite(result)) { result = 0.0; }\n" | ||
| "console.log(JSON.stringify({ result }));\n" | ||
| ) | ||
|
|
||
|
|
||
| def _wrap_python(code: str) -> str: | ||
| params_json = json.dumps( | ||
| { | ||
| "app_params": {}, | ||
| "inputs": {"correct_answer": "The capital of Nauru is Yaren."}, | ||
| "output": "The capital of Nauru is Yaren.", | ||
| "correct_answer": "The capital of Nauru is Yaren.", | ||
| } | ||
| ) | ||
| return ( | ||
| "import json\n" | ||
| f"params = json.loads({params_json!r})\n" | ||
| "app_params = params['app_params']\n" | ||
| "inputs = params['inputs']\n" | ||
| "output = params['output']\n" | ||
| "correct_answer = params['correct_answer']\n" | ||
| + code | ||
| + "\n" | ||
| "result = evaluate(app_params, inputs, output, correct_answer)\n" | ||
| "if isinstance(result, (float, int, str)):\n" | ||
| " try:\n" | ||
| " result = float(result)\n" | ||
| " except (ValueError, TypeError):\n" | ||
| " result = None\n" | ||
| "print(json.dumps({'result': result}))\n" | ||
| ) |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapping templates for JavaScript/TypeScript in _wrap_js and Python in _wrap_python duplicate the template logic from sdk/agenta/sdk/workflows/templates.py. This creates a maintainability issue where changes to the templates in one location would need to be manually synchronized with the other. Consider importing and reusing the templates from EVALUATOR_TEMPLATES instead of duplicating them.
| const filteredTestset = !searchTerm | ||
| ? settingsPresets | ||
| : settingsPresets.filter((preset: SettingsPreset) => | ||
| preset.name.toLowerCase().includes(searchTerm.toLowerCase()), | ||
| ) |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name filteredTestset is misleading since this component is filtering evaluator presets, not testsets. Consider renaming to filteredPresets to accurately reflect what is being filtered and improve code clarity.
| # Get the response | ||
| result = response.choices[0].message.content.strip().lower() | ||
|
|
||
| return encode_string_in_decimals(result + " " + api_key) |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This evaluator encodes the API key into the returned float value (line 86), which could unintentionally leak sensitive credentials if the evaluation results are logged or stored. While this appears to be a test/demo file, exposing API keys in evaluation results is a security risk. Consider removing the API key from the encoded return value or adding a clear warning that this is for testing only and should never be used in production.
mmabrouk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jp-agenta
Misc
- I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
- We should add documentation to the new environment variable
QA Bugs:
- Errors are not shown or propagated in the evaluator playground
AI review Issues
-
2. Sandbox not deleted on error in DaytonaRunner
- File:
sdk/agenta/sdk/workflows/runners/daytona.py:306-309 - Issue: If exception occurs before
sandbox.delete(), sandbox leaks - Current code:
except Exception as e: log.error(f"Error during Daytona code execution: {e}", exc_info=True) raise RuntimeError(...) # sandbox.delete() not called!
- Recommendation: Use
try/finallypattern:sandbox = self._create_sandbox(runtime=runtime) try: # ... execution logic ... finally: sandbox.delete()
- File:
-
4. 🔴 CRITICAL BUG:
evaluators_service.pynot updated for 3-tuple return -
File:
api/oss/src/services/evaluators_service.pylines 844 and 1028 -
Issue:
SecretsManager.retrieve_secrets()now returnstuple[list, list, list]but these callers still expect a single list -
Current (broken):
secrets = await SecretsManager.retrieve_secrets() for secret in secrets: # Iterates over tuple, not secrets! if secret.get("kind") == ... # AttributeError: list has no .get()
-
Fix:
secrets, _, _ = await SecretsManager.retrieve_secrets()
-
Impact: Will break LLM-as-a-Judge (
auto_ai_critique) and semantic similarity evaluators
No description provided.