Feature vpto by mouliangyu · Pull Request #471 · hw-native-sys/PTOAS

mouliangyu · 2026-04-12T15:03:00Z

No description provided.

Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add detailed mode parameter documentation (mode=0 vs mode=1) - Add 'Why get_buf/rls_buf is More Programmer-Friendly' section: - No manual priming/draining for ping/pong loops - No loop peeling for complex/nested loop dependencies - Simpler mental model (buffer ID + program order) - Add quick example comparison showing set_flag overhead vs get_buf simplicity - Update Example 2 and 3b with explicit mode=0 in code - Update comparison table with 'Loop peeling' row

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces the VPTO backend for the PTO compiler, including new IR definitions, lowering passes, and an LLVM emission helper. My review identified a critical bug in PTOVPTOExpandBridgeOps.cpp where an incorrect TypeRange is passed to pto::VldusOp, a security risk in pto.py due to the use of eval(), and several architectural concerns regarding the robustness of external tool dependencies and IR validation logic.

gemini-code-assist · 2026-04-12T15:09:48Z

lib/PTO/Transforms/PTOVPTOExpandBridgeOps.cpp

+    auto load = rewriter.create<pto::VldusOp>(
+        op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()},
+        ValueRange{loadPtr, align});


The TypeRange provided to rewriter.create<pto::VldusOp> contains 3 types, but the TableGen definition for VldusOp in VPTOOps.td only specifies 2 results (result and updated_align). Providing an incorrect number of result types will cause a verification failure or crash at runtime. The third type loadPtr.getType() should be removed to align with the operation's definition.

Suggested change

auto load = rewriter.create<pto::VldusOp>(

op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()},

ValueRange{loadPtr, align});

auto load = rewriter.create<pto::VldusOp>(

op.getLoc(), TypeRange{vecType, alignType},

ValueRange{loadPtr, align});

gemini-code-assist · 2026-04-12T15:09:48Z

python/pto/dialects/pto.py

+        value = eval(compile(expr, self.py_fn.__code__.co_filename, "eval"),
+                     globals_dict, {})


Using eval() on code derived from user-provided function annotations is a security risk. While this is a DSL for kernel authoring, it allows arbitrary code execution if the input source is not fully trusted. Consider using safer alternatives like typing.get_type_hints or a restricted AST evaluator that only allows specific type-related expressions.

gemini-code-assist · 2026-04-12T15:09:48Z

lib/PTO/Transforms/VPTOLLVMEmitterHelper.cpp

+queryDefaultTargetAttrs(const VPTOEmissionOptions &options,
+                        llvm::raw_ostream &diagOS) {
+  static llvm::StringMap<QueriedTargetAttrs> cache;
+
+  if (options.targetTriple.empty() || options.march.empty() ||
+      options.aicoreArch.empty()) {
+    diagOS << "VPTO LLVM emission failed: missing target query options\n";
+    return failure();
+  }
+
+  std::string cacheKey =
+      options.targetTriple + "|" + options.march + "|" + options.aicoreArch;
+  if (auto it = cache.find(cacheKey); it != cache.end())
+    return it->second;
+
+  auto bisheng = llvm::sys::findProgramByName("bisheng");
+  if (!bisheng) {
+    diagOS << "VPTO LLVM emission failed: unable to find 'bisheng' in PATH\n";
+    return failure();
+  }
+  const std::string &bishengPath = *bisheng;
+
+  llvm::SmallString<64> inputPath;
+  llvm::SmallString<64> outputPath;
+  int inputFD = -1;
+  int outputFD = -1;
+  if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
+                                                   "c", inputFD, inputPath)) {
+    diagOS << "VPTO LLVM emission failed: cannot create bisheng query input: "
+           << ec.message() << "\n";
+    return failure();
+  }
+  if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
+                                                   "ll", outputFD, outputPath)) {
+    llvm::sys::fs::remove(inputPath);
+    llvm::sys::Process::SafelyCloseFileDescriptor(inputFD);
+    diagOS << "VPTO LLVM emission failed: cannot create bisheng query output: "
+           << ec.message() << "\n";
+    return failure();
+  }
+
+  auto cleanup = llvm::make_scope_exit([&]() {
+    llvm::sys::fs::remove(inputPath);
+    llvm::sys::fs::remove(outputPath);
+  });
+
+  {
+    llvm::raw_fd_ostream inputOS(inputFD, /*shouldClose=*/false);
+    inputOS << "void f(void) {}\n";
+  }
+  llvm::sys::Process::SafelyCloseFileDescriptor(inputFD);
+  llvm::sys::Process::SafelyCloseFileDescriptor(outputFD);
+
+  llvm::SmallString<128> stderrPath;
+  int stderrFD = -1;
+  if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
+                                                   "stderr", stderrFD,
+                                                   stderrPath)) {
+    diagOS << "VPTO LLVM emission failed: cannot create bisheng query stderr: "
+           << ec.message() << "\n";
+    return failure();
+  }
+  auto stderrCleanup = llvm::make_scope_exit([&]() {
+    llvm::sys::fs::remove(stderrPath);
+  });
+  llvm::sys::Process::SafelyCloseFileDescriptor(stderrFD);
+
+  llvm::SmallVector<std::string> argStorage = {
+      bishengPath,
+      ("--target=" + options.targetTriple),
+      ("-march=" + options.march),
+      ("--cce-aicore-arch=" + options.aicoreArch),
+      "--cce-aicore-only",
+      "-x",
+      "c",
+      inputPath.str().str(),
+      "-S",
+      "-emit-llvm",
+      "-o",
+      outputPath.str().str(),
+  };
+  llvm::SmallVector<llvm::StringRef> args;
+  args.reserve(argStorage.size());
+  for (const std::string &arg : argStorage)
+    args.push_back(arg);
+
+  std::string execErr;
+  bool execFailed = false;
+  int rc = llvm::sys::ExecuteAndWait(
+      bishengPath, args, std::nullopt,
+      {std::nullopt, std::nullopt, llvm::StringRef(stderrPath)}, 0, 0,
+      &execErr, &execFailed);
+
+  auto stderrBuffer = llvm::MemoryBuffer::getFile(stderrPath);
+  llvm::StringRef stderrText =
+      stderrBuffer ? stderrBuffer.get()->getBuffer() : llvm::StringRef();
+
+  if (execFailed || rc != 0) {
+    diagOS << "VPTO LLVM emission failed: bisheng target query failed\n";
+    diagOS << "Command:";
+    for (llvm::StringRef arg : args)
+      diagOS << " " << arg;
+    diagOS << "\n";
+    if (!execErr.empty())
+      diagOS << execErr << "\n";
+    if (!stderrText.empty())
+      diagOS << stderrText << "\n";
+    return failure();
+  }
+
+  auto outputBuffer = llvm::MemoryBuffer::getFile(outputPath);
+  if (!outputBuffer) {
+    diagOS << "VPTO LLVM emission failed: cannot read bisheng query output\n";
+    return failure();
+  }
+
+  FailureOr<std::string> targetCPU =
+      extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-cpu");
+  FailureOr<std::string> targetFeatures =
+      extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-features");
+  if (failed(targetCPU) || failed(targetFeatures)) {
+    diagOS << "VPTO LLVM emission failed: cannot parse bisheng target attrs\n";
+    diagOS << outputBuffer.get()->getBuffer() << "\n";
+    return failure();
+  }
+
+  QueriedTargetAttrs attrs{*targetCPU, *targetFeatures};
+  cache[cacheKey] = attrs;
+  return attrs;
+}
+


The queryDefaultTargetAttrs function executes an external binary (bisheng) and parses its output to derive target attributes. This approach is fragile as it depends on the environment's PATH, the presence of the CANN toolkit, and the stability of the compiler's output format. It also introduces significant overhead during the emission phase. A more robust approach would be to pass these target attributes explicitly via tool flags or a configuration file.

gemini-code-assist · 2026-04-12T15:09:48Z

lib/PTO/Transforms/PTOVPTOPtrBoundary.cpp

+static LogicalResult eraseDeadVPTOMemRefScaffold(ModuleOp module) {
+  bool erasedAny = true;
+  while (erasedAny) {
+    erasedAny = false;
+    SmallVector<pto::CastPtrOp> trivialCasts;
+    SmallVector<Operation *> deadOps;
+    module.walk([&](Operation *op) {
+      if (auto castOp = dyn_cast<pto::CastPtrOp>(op)) {
+        if (isTrivialVPTOBoundaryCastPtr(castOp)) {
+          trivialCasts.push_back(castOp);
+          return;
+        }
+        if (castOp->use_empty())
+          deadOps.push_back(op);
+        return;
+      }
+
+      if (!op->use_empty())
+        return;
+      if (isa<pto::PointerCastOp, pto::BindTileOp, memref::ReinterpretCastOp,
+              memref::SubViewOp, memref::MemorySpaceCastOp>(op))
+        deadOps.push_back(op);
+    });
+
+    for (pto::CastPtrOp castOp : trivialCasts) {
+      if (!castOp->getBlock())
+        continue;
+      castOp.getResult().replaceAllUsesWith(castOp.getInput());
+      castOp.erase();
+      erasedAny = true;
+    }
+
+    for (Operation *op : deadOps) {
+      if (!op->getBlock())
+        continue;
+      op->erase();
+      erasedAny = true;
+    }
+  }
+  return success();
+}


The eraseDeadVPTOMemRefScaffold function uses a manual while loop for fixed-point iteration to erase dead operations. This can be inefficient for large modules. MLIR provides the GreedyPatternRewriteDriver which is designed to handle such canonicalizations and dead code elimination more efficiently and idiomatically.

gemini-code-assist · 2026-04-12T15:09:48Z

lib/PTO/Transforms/PTOValidateVPTOIR.cpp

+                                                       StringRef vectorRole) {
+    auto actual = VPTOLegalityHelper::getMaskGranularity(maskType);
+    auto expected = VPTOLegalityHelper::inferMaskGranularityFromType(vectorType);
+    if (!actual || !expected || *actual == *expected)


The validateMaskMatchesVectorFamily function returns success() if expected is nullopt. However, inferMaskGranularityFromType returns nullopt for 64-bit element types (e.g., i64). This means that predicated operations on 64-bit vectors are silently skipped during mask granularity validation, which could lead to invalid IR passing through this stage.

mouliangyu and others added 14 commits April 12, 2026 22:19

feat: add vpto backend

a3c0202

clarify block query docs and trim conversion section

61db31d

Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct event ID explanation in comparison table

5cfeaf4

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

fix: clarify event ID management in comparison table

75fe162

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

fix: simplify event ID explanation and drain example

d388222

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

fix: correct set_flag/wait_flag count in quick example

5b8a96e

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

fix: add concrete 1:N example for loop peeling comparison

4050f81

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

fix: show get_buf/rls_buf inside scf loop for 1:N example

daf9447

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

refactor vpto llvm emiter

e243113

add online softmax (q * k is ready) case

d616c45

docs: add VPTO spec v0.3 release draft

1b46ced

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add PTO-Gym submodule

bc203ef

feat: add PTO-Gym guide skill

19ef3e1

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature vpto#471

Feature vpto#471
mouliangyu wants to merge 14 commits intohw-native-sys:mainfrom
mouliangyu:feature-vpto

mouliangyu commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		value = eval(compile(expr, self.py_fn.__code__.co_filename, "eval"),
		globals_dict, {})

Conversation

mouliangyu commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants