Skip to content

feat(ir): return callable CompiledProgram from ir.compile()#961

Closed
Hzfengsy wants to merge 1 commit intohw-native-sys:mainfrom
Hzfengsy:issue-958-callable-compile-api
Closed

feat(ir): return callable CompiledProgram from ir.compile()#961
Hzfengsy wants to merge 1 commit intohw-native-sys:mainfrom
Hzfengsy:issue-958-callable-compile-api

Conversation

@Hzfengsy
Copy link
Copy Markdown
Member

Summary

  • Make ir.compile() return a CompiledProgram object that is callable with torch tensors (Triton-like API)
  • Maintain full backward compatibility via __str__ and __fspath__ — existing code using ir.compile() as a path string continues to work
  • Add execute_compiled() runtime helper for executing pre-compiled programs with user-provided tensors

API

# Compile once
compiled = ir.compile(MyProgram)

# In-place style: output tensor passed as argument
compiled(a, b, c)  # c modified on device

# Return style: output allocated and returned
c = compiled(a, b)

# Backward compat: still works as a path
os.path.join(compiled, "kernels")  # fine

Key design decisions

  • Orchestration function metadata extracted lazily from IR (param_directions, return_types) to detect outputs vs inputs automatically
  • DataType → torch.dtype mapping uses string keys because nanobind DataType instances are not singletons (hash differs for equal values)
  • Deferred runtime imports in __call__ to avoid circular dependency between ir and runtime modules

Fixes #958

Test plan

  • 21 unit tests covering backward compat, metadata extraction, arg validation, output allocation, and ir.compile() return type
  • Full test suite passes (3448 tests, 0 failures)
  • All pre-commit hooks pass (ruff, pyright, formatting)

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 10, 2026 10:06
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

ir.compile() now returns a new CompiledProgram object (callable) instead of a path string. CompiledProgram wraps compiled IR, lazily extracts orchestration metadata, supports in-place and return-style calls, and delegates execution to runtime.execute_compiled. Runtime exposes execute_compiled to run precompiled artifacts.

Changes

Cohort / File(s) Summary
Module exports
python/pypto/ir/__init__.py, python/pypto/runtime/__init__.py
Re-exported CompiledProgram and execute_compiled, added them to respective __all__.
Compilation interface
python/pypto/ir/compile.py
compile() now returns CompiledProgram, adds platform: str = "a2a3sim" and device_id: int = 0, uses TYPE_CHECKING-guarded import and local runtime import to avoid cycles.
Compiled program class
python/pypto/ir/compiled_program.py
New CompiledProgram class: stores program/output_dir/platform/device_id, lazy metadata extraction (_extract_param_infos), path-like behavior (__str__, __fspath__, __eq__, __hash__), callable __call__ supporting in-place and return-style invocation, and builds full arg lists for execution.
Runtime execution
python/pypto/runtime/runner.py
Added execute_compiled() to run compiled artifacts: optional header patching, golden script generation (_write_call_golden), input serialization, and dispatch to _execute_on_device().
Tests
tests/ut/ir/test_compiled_program.py
New unit tests covering metadata extraction, path-like semantics, equality/representation, argument validation, _build_full_args allocation behavior, and that ir.compile() returns a usable CompiledProgram whose path exists.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CP as CompiledProgram
    participant Runner as runtime.execute_compiled
    participant Device as _execute_on_device

    User->>CP: __call__(*torch.Tensor)
    activate CP

    CP->>CP: _get_metadata() / _extract_param_infos() (lazy)
    Note over CP: Determine param names, directions,\noutput_indices, shapes, dtypes

    alt In-place style (args == total params)
        CP->>CP: validate arg count (in-place)
        CP->>CP: _build_full_args() (identity)
    else Return style (args == input params)
        CP->>CP: validate arg count (return-style)
        CP->>CP: _build_full_args() (allocate outputs)
    end

    CP->>Runner: execute_compiled(work_dir, tensors, param_infos, output_indices, platform, device_id)
    activate Runner

    Runner->>Runner: optionally patch headers
    Runner->>Runner: _write_call_golden() (optional)
    Runner->>Runner: serialize inputs to work_dir/cache/Default_inputs.pt
    Runner->>Device: _execute_on_device(...)
    activate Device
    Device-->>Runner: execution complete
    deactivate Device

    deactivate Runner

    alt Return style
        CP-->>User: return tensor or tuple[tensors]
    else In-place style
        CP-->>User: return None
    end

    deactivate CP
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops, compiled and spry,

I nibble metadata on the fly.
Pass your tensors, in-place or new,
Kernels run—no golden cue.
A tiny hop, and outputs fly!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: making ir.compile() return a callable CompiledProgram object.
Description check ✅ Passed The description is well-detailed and relates to the changeset, explaining the new CompiledProgram API, backward compatibility, key design decisions, and test coverage.
Linked Issues check ✅ Passed The PR implements all core objectives from issue #958: CompiledProgram callable API with in-place and return-style execution, lazy metadata extraction, backward compatibility via str/fspath, and the execute_compiled() runtime helper.
Out of Scope Changes check ✅ Passed All code changes are scoped to implementing the CompiledProgram callable API and related infrastructure; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a CompiledProgram wrapper for the ir.compile output, providing a Triton-like callable API to execute compiled programs directly with torch tensors. The implementation supports both in-place and return-style calling conventions while preserving backward compatibility for path-based artifact access. Review feedback identifies a serialization mismatch in the runtime runner, potential runtime crashes when automatically allocating tensors for dynamic shapes, and opportunities to improve type inference for scalar parameters.

Comment thread python/pypto/runtime/runner.py Outdated
Comment thread python/pypto/ir/compiled_program.py Outdated
Comment thread python/pypto/ir/compiled_program.py Outdated
Comment thread python/pypto/runtime/runner.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
python/pypto/ir/compiled_program.py (3)

95-95: Add strict=True to zip for defensive programming.

While orch_func.params and orch_func.param_directions should always have matching lengths by IR construction, adding strict=True provides an early fail-fast if this invariant is ever violated.

♻️ Proposed fix
-    for i, (param, direction) in enumerate(zip(orch_func.params, orch_func.param_directions)):
+    for i, (param, direction) in enumerate(zip(orch_func.params, orch_func.param_directions, strict=True)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compiled_program.py` at line 95, Update the enumerate‑zip
loop that iterates over orch_func.params and orch_func.param_directions to use
zip(..., strict=True) to enforce that the two iterables have the same length;
specifically, modify the loop that currently reads for i, (param, direction) in
enumerate(zip(orch_func.params, orch_func.param_directions)) to pass strict=True
to zip so a mismatch will raise immediately and fail fast.

40-52: Missing dtype mappings for some PyPTO types.

The mapping is missing several DataType values exported from pypto.ir: FP4, FP8E4M3FN, FP8E5M2, HF4, HF8, INT4, UINT16, UINT32, UINT64. While PyTorch may not support all of these natively, users attempting return-style calls with these types will get an unhelpful "Unsupported dtype" error.

Consider adding a note in the error message at line 310 about which dtypes are supported, or mapping to the closest supported type where reasonable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compiled_program.py` around lines 40 - 52, The
_DATATYPE_TO_TORCH mapping in compiled_program.py lacks several DataType keys
(FP4, FP8E4M3FN, FP8E5M2, HF4, HF8, INT4, UINT16, UINT32, UINT64) causing
generic "Unsupported dtype" failures; update the _DATATYPE_TO_TORCH dict to
include entries for these types mapped to the closest torch.dtype (e.g., map
INT4→torch.int8 or a promoted int, UINT16/UINT32/UINT64→torch.int32/torch.int64
as appropriate, FP8/FP4/HF variants→torch.float16 or torch.bfloat16 where
reasonable) and/or adjust the error-generation site that raises the "Unsupported
dtype" message to enumerate the supported keys from _DATATYPE_TO_TORCH so
callers see which dtypes are supported; ensure you reference the dict name
_DATATYPE_TO_TORCH and the exact error message string "Unsupported dtype" when
making the change.

197-205: Equality and hash semantics may surprise users.

__eq__ compares only output_dir, ignoring platform, device_id, and program. Two CompiledProgram instances pointing to the same directory but with different platforms will be considered equal. This is likely intentional for backward compatibility with string paths, but could cause subtle bugs if users expect full equality.

Consider documenting this behavior in the class docstring, e.g., "Equality is based solely on output directory for backward compatibility with path strings."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compiled_program.py` around lines 197 - 205, The __eq__ and
__hash__ implementations for CompiledProgram only use _output_dir (and accept
str/PathLike) which can make instances with different platform, device_id, or
program compare equal; update the CompiledProgram class docstring to explicitly
state that equality and hashing are based solely on output directory for
backward compatibility with path strings, mention the specific symbols involved
(__eq__, __hash__, _output_dir, and the attributes platform, device_id, program)
so callers understand the semantics; do not change behavior in this PR—just
document it clearly.
tests/ut/ir/test_compiled_program.py (2)

169-176: Use raw string for regex pattern with metacharacters.

The pattern "expects 3 .* or 2" contains the . metacharacter. Use a raw string to make intent clear and satisfy static analysis.

♻️ Proposed fix
         # Program has 3 params (2 in + 1 out), with return.
         # Valid: 3 args (in-place) or 2 args (return style)
-        with pytest.raises(TypeError, match="expects 3 .* or 2"):
+        with pytest.raises(TypeError, match=r"expects 3 .* or 2"):
             cp(a)  # 1 arg is neither 3 nor 2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ut/ir/test_compiled_program.py` around lines 169 - 176, The test
test_wrong_arg_count_with_return uses pytest.raises(..., match="expects 3 .* or
2") which uses a regex with metacharacters in a normal string; change the match
argument to a raw string (e.g. r"expects 3 .* or 2") so the regex is interpreted
correctly in the assertion for CompiledProgram/ cp call in that test.

93-108: Minor: Prefix unused variable with underscore.

Per static analysis, out_idx at line 95 is unused. Prefix with underscore to indicate intentional discard.

♻️ Proposed fix
     def test_extracts_param_names_and_directions(self):
         prog = _make_program_with_orchestration()
-        infos, out_idx, _ = _extract_param_infos(prog)
+        infos, _out_idx, _ = _extract_param_infos(prog)
 
         assert len(infos) == 3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ut/ir/test_compiled_program.py` around lines 93 - 108, In
test_extracts_param_names_and_directions, the returned out_idx from
_extract_param_infos(prog) is unused; change the local variable name to _out_idx
(i.e., unpack as infos, _out_idx, _ = _extract_param_infos(prog)) to mark it as
intentionally ignored; keep the second test test_output_indices unchanged since
it asserts out_idx there.
python/pypto/ir/compile.py (1)

52-54: Consider documenting the platform/device_id interaction with backend_type.

RunConfig.__post_init__ auto-corrects platform to match backend_type (e.g., if backend_type=Ascend950, it ensures platform starts with "a5"). The compile() function doesn't perform this validation, so users could inadvertently create a CompiledProgram with mismatched platform and backend_type.

Either add validation here, or document in the docstring that platform should match backend_type, or delegate validation to CompiledProgram.__init__.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compile.py` around lines 52 - 54, The compile() function may
produce a CompiledProgram whose platform string can mismatch backend_type
because RunConfig.__post_init__ enforces platform/back-end consistency; add
validation in compile() (or document and delegate to CompiledProgram.__init__)
to ensure platform matches backend_type: call the same normalization/validation
logic used by RunConfig.__post_init__ (or replicate its checks) before
constructing/returning CompiledProgram, and raise or correct platform/device_id
when inconsistent; reference RunConfig.__post_init__, compile(), and
CompiledProgram.__init__ so reviewers can locate and update the
validation/normalization path.
python/pypto/runtime/runner.py (1)

879-906: Consider validating platform parameter.

Unlike RunConfig.__post_init__ (lines 278-288) which validates and auto-corrects platform values, execute_compiled passes the platform directly to _execute_on_device without validation. An invalid platform value could propagate to CodeRunner and cause unclear errors.

♻️ Suggested validation
 def execute_compiled(
     work_dir: Path,
     tensors: list[torch.Tensor],
     param_infos: list,
     output_indices: list[int],
     *,
     platform: str,
     device_id: int,
     pto_isa_commit: str | None = None,
 ) -> None:
+    valid_platforms = ("a2a3sim", "a2a3", "a5sim", "a5")
+    if platform not in valid_platforms:
+        raise ValueError(f"Invalid platform {platform!r}. Expected one of {valid_platforms}.")
+
     work_dir = Path(work_dir)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/runtime/runner.py` around lines 879 - 906, The execute_compiled
function currently forwards the platform argument to _execute_on_device without
any validation; add the same validation/normalization logic used in
RunConfig.__post_init__ (e.g., normalize aliases, enforce allowed platforms, or
raise a clear ValueError) at the start of execute_compiled before calling
_execute_on_device so invalid platform strings are corrected or rejected with a
clear message; reference the existing validation in RunConfig.__post_init__ for
the exact normalization rules and apply them to the platform parameter in
execute_compiled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/pypto/ir/__init__.py`:
- Line 44: The type stub python/pypto/ir/__init__.pyi is missing CompiledProgram
in its exported __all__ list; update the __all__ list (the list of strings
around where other classes are exported) to include "CompiledProgram" so the
stub matches the runtime __init__.py which imports CompiledProgram from
.compiled_program, ensuring the public API in the stub and runtime are
synchronized.

In `@python/pypto/ir/compiled_program.py`:
- Around line 303-314: The allocation loop currently calls
torch.zeros(info.shape, ...) for outputs but doesn't guard against dynamic
dimensions (-1), which causes a RuntimeError; update the loop that iterates over
param_infos and output_set to validate info.shape before calling torch.zeros: if
any dimension in info.shape is negative (e.g., -1) raise a clear ValueError like
"Cannot allocate output tensor 'name': dynamic dimension -1 in shape" (include
info.name and info.dtype in the message for context) instead of attempting
torch.zeros; keep existing checks for info.shape is None and unsupported dtype,
and only call torch.zeros when all dimensions are non-negative.

In `@python/pypto/runtime/runner.py`:
- Around line 953-956: The fallback generate_inputs loop currently creates
tensors with hardcoded torch.float32; update it to use the parameter's real
dtype by mapping info.dtype to a torch dtype (e.g. call
_to_torch_dtype(info.dtype)) instead of torch.float32, importing _to_torch_dtype
from compiled_program (or duplicating that mapping) and fall back to
torch.float32 only if info.dtype is None/unknown; update the line that builds
the tuple for each entry in param_infos to pass
dtype=_to_torch_dtype(info.dtype) (with a safe default) so dtype mismatches are
avoided.
- Around line 917-921: The cache saved in _write_call_golden uses a list of
(name, tensor) tuples (created as inputs_data and saved to Default_inputs.pt)
but _load_inputs (used by _cached_gen in _install_golden_inputs_patch) expects
dicts with "kind","name","data"; fix by writing the cache in the same format as
_save_inputs: transform param_infos and tensors into a list of dicts with keys
"kind" (use the appropriate kind from param_infos or a default), "name" (from
info.name) and "data" (the tensor) and save that, or alternatively modify
_write_call_golden to call the existing _save_inputs helper instead of directly
torch.save; ensure Default_inputs.pt uses the same schema as _load_inputs so
_cached_gen can load it without error.

---

Nitpick comments:
In `@python/pypto/ir/compile.py`:
- Around line 52-54: The compile() function may produce a CompiledProgram whose
platform string can mismatch backend_type because RunConfig.__post_init__
enforces platform/back-end consistency; add validation in compile() (or document
and delegate to CompiledProgram.__init__) to ensure platform matches
backend_type: call the same normalization/validation logic used by
RunConfig.__post_init__ (or replicate its checks) before constructing/returning
CompiledProgram, and raise or correct platform/device_id when inconsistent;
reference RunConfig.__post_init__, compile(), and CompiledProgram.__init__ so
reviewers can locate and update the validation/normalization path.

In `@python/pypto/ir/compiled_program.py`:
- Line 95: Update the enumerate‑zip loop that iterates over orch_func.params and
orch_func.param_directions to use zip(..., strict=True) to enforce that the two
iterables have the same length; specifically, modify the loop that currently
reads for i, (param, direction) in enumerate(zip(orch_func.params,
orch_func.param_directions)) to pass strict=True to zip so a mismatch will raise
immediately and fail fast.
- Around line 40-52: The _DATATYPE_TO_TORCH mapping in compiled_program.py lacks
several DataType keys (FP4, FP8E4M3FN, FP8E5M2, HF4, HF8, INT4, UINT16, UINT32,
UINT64) causing generic "Unsupported dtype" failures; update the
_DATATYPE_TO_TORCH dict to include entries for these types mapped to the closest
torch.dtype (e.g., map INT4→torch.int8 or a promoted int,
UINT16/UINT32/UINT64→torch.int32/torch.int64 as appropriate, FP8/FP4/HF
variants→torch.float16 or torch.bfloat16 where reasonable) and/or adjust the
error-generation site that raises the "Unsupported dtype" message to enumerate
the supported keys from _DATATYPE_TO_TORCH so callers see which dtypes are
supported; ensure you reference the dict name _DATATYPE_TO_TORCH and the exact
error message string "Unsupported dtype" when making the change.
- Around line 197-205: The __eq__ and __hash__ implementations for
CompiledProgram only use _output_dir (and accept str/PathLike) which can make
instances with different platform, device_id, or program compare equal; update
the CompiledProgram class docstring to explicitly state that equality and
hashing are based solely on output directory for backward compatibility with
path strings, mention the specific symbols involved (__eq__, __hash__,
_output_dir, and the attributes platform, device_id, program) so callers
understand the semantics; do not change behavior in this PR—just document it
clearly.

In `@python/pypto/runtime/runner.py`:
- Around line 879-906: The execute_compiled function currently forwards the
platform argument to _execute_on_device without any validation; add the same
validation/normalization logic used in RunConfig.__post_init__ (e.g., normalize
aliases, enforce allowed platforms, or raise a clear ValueError) at the start of
execute_compiled before calling _execute_on_device so invalid platform strings
are corrected or rejected with a clear message; reference the existing
validation in RunConfig.__post_init__ for the exact normalization rules and
apply them to the platform parameter in execute_compiled.

In `@tests/ut/ir/test_compiled_program.py`:
- Around line 169-176: The test test_wrong_arg_count_with_return uses
pytest.raises(..., match="expects 3 .* or 2") which uses a regex with
metacharacters in a normal string; change the match argument to a raw string
(e.g. r"expects 3 .* or 2") so the regex is interpreted correctly in the
assertion for CompiledProgram/ cp call in that test.
- Around line 93-108: In test_extracts_param_names_and_directions, the returned
out_idx from _extract_param_infos(prog) is unused; change the local variable
name to _out_idx (i.e., unpack as infos, _out_idx, _ =
_extract_param_infos(prog)) to mark it as intentionally ignored; keep the second
test test_output_indices unchanged since it asserts out_idx there.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1073bb3d-11bc-4cf4-9eb5-5959a2ad2780

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf1c65 and 6327606.

📒 Files selected for processing (6)
  • python/pypto/ir/__init__.py
  • python/pypto/ir/compile.py
  • python/pypto/ir/compiled_program.py
  • python/pypto/runtime/__init__.py
  • python/pypto/runtime/runner.py
  • tests/ut/ir/test_compiled_program.py

Comment thread python/pypto/ir/__init__.py
Comment thread python/pypto/ir/compiled_program.py
Comment thread python/pypto/runtime/runner.py Outdated
Comment thread python/pypto/runtime/runner.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python IR compilation API so pypto.ir.compile() returns a callable CompiledProgram object (Triton-like execution style) while preserving path-like backward compatibility, and adds a runtime helper to execute precompiled artifacts with user-provided tensors.

Changes:

  • Add pypto.ir.CompiledProgram wrapper with __call__, lazy orchestration metadata extraction, and __str__/__fspath__ for path-like behavior.
  • Change pypto.ir.compile() to return CompiledProgram (with platform/device binding for later execution).
  • Add pypto.runtime.execute_compiled() plus a minimal auto-generated golden script for device execution via the existing runner infrastructure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/pypto/ir/compile.py Returns CompiledProgram instead of str, adds platform/device_id wiring.
python/pypto/ir/compiled_program.py New callable wrapper and orchestration signature metadata extraction/allocation logic.
python/pypto/ir/__init__.py Re-export CompiledProgram from pypto.ir.
python/pypto/runtime/runner.py Adds execute_compiled() and _write_call_golden() to run compiled outputs with provided tensors.
python/pypto/runtime/__init__.py Re-export execute_compiled.
tests/ut/ir/test_compiled_program.py New unit tests for backward-compat, metadata extraction, and basic call/alloc behavior.

Comment thread python/pypto/runtime/runner.py Outdated
Comment thread python/pypto/ir/compiled_program.py Outdated
Comment thread python/pypto/ir/compiled_program.py Outdated
Comment thread python/pypto/ir/compiled_program.py
Comment thread tests/ut/ir/test_compiled_program.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
python/pypto/ir/compiled_program.py (1)

309-317: ⚠️ Potential issue | 🟠 Major

Return-style allocation fails for dynamic output dimensions.

Dynamic dims are represented as -1 (Line 105), but Line 317 passes them directly to torch.zeros(...), which raises at runtime.

🐛 Proposed fix
             if i in output_set:
                 # Allocate output tensor from IR metadata
                 if info.shape is None:
                     raise ValueError(f"Cannot allocate output tensor {info.name!r}: no shape in IR")
+                if any(d < 0 for d in info.shape):
+                    raise ValueError(
+                        f"Cannot allocate output tensor {info.name!r}: dynamic dimensions {info.shape}. "
+                        "Use in-place calling style and pass output tensors explicitly."
+                    )
                 torch_dtype = _to_torch_dtype(info.dtype)
                 if torch_dtype is None:
                     raise ValueError(f"Unsupported dtype {info.dtype} for output tensor {info.name!r}")
                 all_tensors.append(torch.zeros(info.shape, dtype=torch_dtype))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compiled_program.py` around lines 309 - 317, The allocation
code for outputs in the loop over param_infos fails when info.shape contains
dynamic dims encoded as -1 because torch.zeros(...) rejects negative sizes;
before calling torch.zeros, map info.shape to a concrete size tuple replacing
any -1 with 1 (e.g. size = tuple(dim if dim != -1 else 1 for dim in
info.shape)), then call torch.zeros(size, dtype=torch_dtype) and append to
all_tensors; keep using _to_torch_dtype to get torch_dtype and raise as before
if None.
python/pypto/runtime/runner.py (1)

925-930: ⚠️ Potential issue | 🔴 Critical

Input cache format is incompatible with _load_inputs and drops user tensors.

At Line 928, Default_inputs.pt is saved as list-of-tuples, but _load_inputs() expects list-of-dicts. That causes cache load failure and fallback to placeholder tensors, so callable execution can ignore provided inputs.

🐛 Proposed fix
     # 3. Save user tensors to cache so the golden_inputs_patch can load them
     cache_dir = work_dir / "cache"
     cache_dir.mkdir(exist_ok=True)
-    inputs_data = [(info.name, tensor) for info, tensor in zip(param_infos, tensors)]
-    torch.save(inputs_data, cache_dir / "Default_inputs.pt")
+    inputs_data = [(info.name, tensor) for info, tensor in zip(param_infos, tensors, strict=True)]
+    _save_inputs(inputs_data, cache_dir / "Default_inputs.pt")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/runtime/runner.py` around lines 925 - 930, The cache is being
saved as a list-of-tuples but _load_inputs() expects a list-of-dicts, causing
user tensors to be dropped; change the saved format so inputs_data is a list of
dicts (e.g. each entry includes the parameter name and the tensor under the keys
_load_inputs_ expects) and write that to cache_dir / "Default_inputs.pt"; ensure
you still use param_infos and tensors to build the list and keep the same
filename and cache_dir usage so _load_inputs() can successfully read the
user-provided tensors.
🧹 Nitpick comments (2)
python/pypto/ir/compile.py (1)

41-41: Move inline # noqa suppressions to Ruff config.

Please avoid inline suppressions in python/pypto/ and add targeted per-file ignores in ruff.toml instead (with rationale), for both PLR0913 and PLC0415.

Based on learnings: In hw-native-sys/pypto, prefer fixing root causes in linter configuration (per-file-ignores in ruff.toml) rather than inline # noqa under python/pypto.

Also applies to: 175-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compile.py` at line 41, Remove the inline "# noqa"
suppressions from the compile function (def compile(...)) and any other inline
PLR0913/PLC0415 suppressions in the python/pypto package, and instead add
targeted per-file-ignores in ruff.toml for the affected files: declare the
specific codes (PLR0913, PLC0415) under per-file-ignores with a short rationale
comment explaining why the linter is suppressed for that file; then run Ruff to
confirm no inline "# noqa" remain and adjust the ruff.toml entries if other
files in python/pypto require the same targeted ignores.
python/pypto/ir/compiled_program.py (1)

277-277: Use Ruff per-file ignore instead of inline # noqa.

Please move this suppression to ruff.toml per-file-ignores (with rationale) instead of inline # noqa.

Based on learnings: In hw-native-sys/pypto, Python files under python/pypto should prefer ruff.toml per-file-ignores over inline suppression comments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/compiled_program.py` at line 277, The inline suppression "#
noqa: PLC0415" on the import "from pypto.runtime.runner import execute_compiled"
in compiled_program.py should be removed and instead add a per-file ignore entry
in ruff.toml for this file; update ruff.toml's per-file-ignores to include the
module path "python/pypto/ir/compiled_program.py": ["PLC0415"] with a short
rationale comment (e.g., "import moved to avoid circular import at runtime"),
then delete the inline "# noqa" so the file-level ignore documents the exception
centrally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@python/pypto/ir/compiled_program.py`:
- Around line 309-317: The allocation code for outputs in the loop over
param_infos fails when info.shape contains dynamic dims encoded as -1 because
torch.zeros(...) rejects negative sizes; before calling torch.zeros, map
info.shape to a concrete size tuple replacing any -1 with 1 (e.g. size =
tuple(dim if dim != -1 else 1 for dim in info.shape)), then call
torch.zeros(size, dtype=torch_dtype) and append to all_tensors; keep using
_to_torch_dtype to get torch_dtype and raise as before if None.

In `@python/pypto/runtime/runner.py`:
- Around line 925-930: The cache is being saved as a list-of-tuples but
_load_inputs() expects a list-of-dicts, causing user tensors to be dropped;
change the saved format so inputs_data is a list of dicts (e.g. each entry
includes the parameter name and the tensor under the keys _load_inputs_ expects)
and write that to cache_dir / "Default_inputs.pt"; ensure you still use
param_infos and tensors to build the list and keep the same filename and
cache_dir usage so _load_inputs() can successfully read the user-provided
tensors.

---

Nitpick comments:
In `@python/pypto/ir/compile.py`:
- Line 41: Remove the inline "# noqa" suppressions from the compile function
(def compile(...)) and any other inline PLR0913/PLC0415 suppressions in the
python/pypto package, and instead add targeted per-file-ignores in ruff.toml for
the affected files: declare the specific codes (PLR0913, PLC0415) under
per-file-ignores with a short rationale comment explaining why the linter is
suppressed for that file; then run Ruff to confirm no inline "# noqa" remain and
adjust the ruff.toml entries if other files in python/pypto require the same
targeted ignores.

In `@python/pypto/ir/compiled_program.py`:
- Line 277: The inline suppression "# noqa: PLC0415" on the import "from
pypto.runtime.runner import execute_compiled" in compiled_program.py should be
removed and instead add a per-file ignore entry in ruff.toml for this file;
update ruff.toml's per-file-ignores to include the module path
"python/pypto/ir/compiled_program.py": ["PLC0415"] with a short rationale
comment (e.g., "import moved to avoid circular import at runtime"), then delete
the inline "# noqa" so the file-level ignore documents the exception centrally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ccf14cca-eec9-4f93-b821-10ef1a2553f4

📥 Commits

Reviewing files that changed from the base of the PR and between 6327606 and df719d4.

📒 Files selected for processing (6)
  • python/pypto/ir/__init__.py
  • python/pypto/ir/compile.py
  • python/pypto/ir/compiled_program.py
  • python/pypto/runtime/__init__.py
  • python/pypto/runtime/runner.py
  • tests/ut/ir/test_compiled_program.py
✅ Files skipped from review due to trivial changes (1)
  • python/pypto/runtime/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/pypto/ir/init.py

@Hzfengsy Hzfengsy force-pushed the issue-958-callable-compile-api branch 5 times, most recently from 04f4cff to 097696b Compare April 10, 2026 11:10
Make ir.compile() return a CompiledProgram object that can be called
with torch tensors (Triton-like API), while maintaining backward
compatibility via __str__ and __fspath__.

Two calling styles:
  compiled(a, b, c)  # in-place: c modified on device
  c = compiled(a, b) # return: output allocated and returned

Fixes hw-native-sys#958
@Hzfengsy Hzfengsy force-pushed the issue-958-callable-compile-api branch from 097696b to 99d7f32 Compare April 10, 2026 11:24
@Hzfengsy Hzfengsy marked this pull request as draft April 10, 2026 11:47
@Hzfengsy
Copy link
Copy Markdown
Member Author

Closing in favor of #1055.

@Hzfengsy Hzfengsy closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[RFC] Triton-like callable API: ir.compile() returns callable, compiled(*tensors)

2 participants