Conversation
Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add detailed mode parameter documentation (mode=0 vs mode=1) - Add 'Why get_buf/rls_buf is More Programmer-Friendly' section: - No manual priming/draining for ping/pong loops - No loop peeling for complex/nested loop dependencies - Simpler mental model (buffer ID + program order) - Add quick example comparison showing set_flag overhead vs get_buf simplicity - Update Example 2 and 3b with explicit mode=0 in code - Update comparison table with 'Loop peeling' row
- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula
- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines
- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc)
- Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions
- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers
- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside
- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the VPTO backend for the PTO compiler, including new IR definitions, lowering passes, and an LLVM emission helper. My review identified a critical bug in PTOVPTOExpandBridgeOps.cpp where an incorrect TypeRange is passed to pto::VldusOp, a security risk in pto.py due to the use of eval(), and several architectural concerns regarding the robustness of external tool dependencies and IR validation logic.
| auto load = rewriter.create<pto::VldusOp>( | ||
| op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()}, | ||
| ValueRange{loadPtr, align}); |
There was a problem hiding this comment.
The TypeRange provided to rewriter.create<pto::VldusOp> contains 3 types, but the TableGen definition for VldusOp in VPTOOps.td only specifies 2 results (result and updated_align). Providing an incorrect number of result types will cause a verification failure or crash at runtime. The third type loadPtr.getType() should be removed to align with the operation's definition.
| auto load = rewriter.create<pto::VldusOp>( | |
| op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()}, | |
| ValueRange{loadPtr, align}); | |
| auto load = rewriter.create<pto::VldusOp>( | |
| op.getLoc(), TypeRange{vecType, alignType}, | |
| ValueRange{loadPtr, align}); |
| value = eval(compile(expr, self.py_fn.__code__.co_filename, "eval"), | ||
| globals_dict, {}) |
There was a problem hiding this comment.
Using eval() on code derived from user-provided function annotations is a security risk. While this is a DSL for kernel authoring, it allows arbitrary code execution if the input source is not fully trusted. Consider using safer alternatives like typing.get_type_hints or a restricted AST evaluator that only allows specific type-related expressions.
| queryDefaultTargetAttrs(const VPTOEmissionOptions &options, | ||
| llvm::raw_ostream &diagOS) { | ||
| static llvm::StringMap<QueriedTargetAttrs> cache; | ||
|
|
||
| if (options.targetTriple.empty() || options.march.empty() || | ||
| options.aicoreArch.empty()) { | ||
| diagOS << "VPTO LLVM emission failed: missing target query options\n"; | ||
| return failure(); | ||
| } | ||
|
|
||
| std::string cacheKey = | ||
| options.targetTriple + "|" + options.march + "|" + options.aicoreArch; | ||
| if (auto it = cache.find(cacheKey); it != cache.end()) | ||
| return it->second; | ||
|
|
||
| auto bisheng = llvm::sys::findProgramByName("bisheng"); | ||
| if (!bisheng) { | ||
| diagOS << "VPTO LLVM emission failed: unable to find 'bisheng' in PATH\n"; | ||
| return failure(); | ||
| } | ||
| const std::string &bishengPath = *bisheng; | ||
|
|
||
| llvm::SmallString<64> inputPath; | ||
| llvm::SmallString<64> outputPath; | ||
| int inputFD = -1; | ||
| int outputFD = -1; | ||
| if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query", | ||
| "c", inputFD, inputPath)) { | ||
| diagOS << "VPTO LLVM emission failed: cannot create bisheng query input: " | ||
| << ec.message() << "\n"; | ||
| return failure(); | ||
| } | ||
| if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query", | ||
| "ll", outputFD, outputPath)) { | ||
| llvm::sys::fs::remove(inputPath); | ||
| llvm::sys::Process::SafelyCloseFileDescriptor(inputFD); | ||
| diagOS << "VPTO LLVM emission failed: cannot create bisheng query output: " | ||
| << ec.message() << "\n"; | ||
| return failure(); | ||
| } | ||
|
|
||
| auto cleanup = llvm::make_scope_exit([&]() { | ||
| llvm::sys::fs::remove(inputPath); | ||
| llvm::sys::fs::remove(outputPath); | ||
| }); | ||
|
|
||
| { | ||
| llvm::raw_fd_ostream inputOS(inputFD, /*shouldClose=*/false); | ||
| inputOS << "void f(void) {}\n"; | ||
| } | ||
| llvm::sys::Process::SafelyCloseFileDescriptor(inputFD); | ||
| llvm::sys::Process::SafelyCloseFileDescriptor(outputFD); | ||
|
|
||
| llvm::SmallString<128> stderrPath; | ||
| int stderrFD = -1; | ||
| if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query", | ||
| "stderr", stderrFD, | ||
| stderrPath)) { | ||
| diagOS << "VPTO LLVM emission failed: cannot create bisheng query stderr: " | ||
| << ec.message() << "\n"; | ||
| return failure(); | ||
| } | ||
| auto stderrCleanup = llvm::make_scope_exit([&]() { | ||
| llvm::sys::fs::remove(stderrPath); | ||
| }); | ||
| llvm::sys::Process::SafelyCloseFileDescriptor(stderrFD); | ||
|
|
||
| llvm::SmallVector<std::string> argStorage = { | ||
| bishengPath, | ||
| ("--target=" + options.targetTriple), | ||
| ("-march=" + options.march), | ||
| ("--cce-aicore-arch=" + options.aicoreArch), | ||
| "--cce-aicore-only", | ||
| "-x", | ||
| "c", | ||
| inputPath.str().str(), | ||
| "-S", | ||
| "-emit-llvm", | ||
| "-o", | ||
| outputPath.str().str(), | ||
| }; | ||
| llvm::SmallVector<llvm::StringRef> args; | ||
| args.reserve(argStorage.size()); | ||
| for (const std::string &arg : argStorage) | ||
| args.push_back(arg); | ||
|
|
||
| std::string execErr; | ||
| bool execFailed = false; | ||
| int rc = llvm::sys::ExecuteAndWait( | ||
| bishengPath, args, std::nullopt, | ||
| {std::nullopt, std::nullopt, llvm::StringRef(stderrPath)}, 0, 0, | ||
| &execErr, &execFailed); | ||
|
|
||
| auto stderrBuffer = llvm::MemoryBuffer::getFile(stderrPath); | ||
| llvm::StringRef stderrText = | ||
| stderrBuffer ? stderrBuffer.get()->getBuffer() : llvm::StringRef(); | ||
|
|
||
| if (execFailed || rc != 0) { | ||
| diagOS << "VPTO LLVM emission failed: bisheng target query failed\n"; | ||
| diagOS << "Command:"; | ||
| for (llvm::StringRef arg : args) | ||
| diagOS << " " << arg; | ||
| diagOS << "\n"; | ||
| if (!execErr.empty()) | ||
| diagOS << execErr << "\n"; | ||
| if (!stderrText.empty()) | ||
| diagOS << stderrText << "\n"; | ||
| return failure(); | ||
| } | ||
|
|
||
| auto outputBuffer = llvm::MemoryBuffer::getFile(outputPath); | ||
| if (!outputBuffer) { | ||
| diagOS << "VPTO LLVM emission failed: cannot read bisheng query output\n"; | ||
| return failure(); | ||
| } | ||
|
|
||
| FailureOr<std::string> targetCPU = | ||
| extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-cpu"); | ||
| FailureOr<std::string> targetFeatures = | ||
| extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-features"); | ||
| if (failed(targetCPU) || failed(targetFeatures)) { | ||
| diagOS << "VPTO LLVM emission failed: cannot parse bisheng target attrs\n"; | ||
| diagOS << outputBuffer.get()->getBuffer() << "\n"; | ||
| return failure(); | ||
| } | ||
|
|
||
| QueriedTargetAttrs attrs{*targetCPU, *targetFeatures}; | ||
| cache[cacheKey] = attrs; | ||
| return attrs; | ||
| } | ||
|
|
There was a problem hiding this comment.
The queryDefaultTargetAttrs function executes an external binary (bisheng) and parses its output to derive target attributes. This approach is fragile as it depends on the environment's PATH, the presence of the CANN toolkit, and the stability of the compiler's output format. It also introduces significant overhead during the emission phase. A more robust approach would be to pass these target attributes explicitly via tool flags or a configuration file.
| static LogicalResult eraseDeadVPTOMemRefScaffold(ModuleOp module) { | ||
| bool erasedAny = true; | ||
| while (erasedAny) { | ||
| erasedAny = false; | ||
| SmallVector<pto::CastPtrOp> trivialCasts; | ||
| SmallVector<Operation *> deadOps; | ||
| module.walk([&](Operation *op) { | ||
| if (auto castOp = dyn_cast<pto::CastPtrOp>(op)) { | ||
| if (isTrivialVPTOBoundaryCastPtr(castOp)) { | ||
| trivialCasts.push_back(castOp); | ||
| return; | ||
| } | ||
| if (castOp->use_empty()) | ||
| deadOps.push_back(op); | ||
| return; | ||
| } | ||
|
|
||
| if (!op->use_empty()) | ||
| return; | ||
| if (isa<pto::PointerCastOp, pto::BindTileOp, memref::ReinterpretCastOp, | ||
| memref::SubViewOp, memref::MemorySpaceCastOp>(op)) | ||
| deadOps.push_back(op); | ||
| }); | ||
|
|
||
| for (pto::CastPtrOp castOp : trivialCasts) { | ||
| if (!castOp->getBlock()) | ||
| continue; | ||
| castOp.getResult().replaceAllUsesWith(castOp.getInput()); | ||
| castOp.erase(); | ||
| erasedAny = true; | ||
| } | ||
|
|
||
| for (Operation *op : deadOps) { | ||
| if (!op->getBlock()) | ||
| continue; | ||
| op->erase(); | ||
| erasedAny = true; | ||
| } | ||
| } | ||
| return success(); | ||
| } |
There was a problem hiding this comment.
The eraseDeadVPTOMemRefScaffold function uses a manual while loop for fixed-point iteration to erase dead operations. This can be inefficient for large modules. MLIR provides the GreedyPatternRewriteDriver which is designed to handle such canonicalizations and dead code elimination more efficiently and idiomatically.
| StringRef vectorRole) { | ||
| auto actual = VPTOLegalityHelper::getMaskGranularity(maskType); | ||
| auto expected = VPTOLegalityHelper::inferMaskGranularityFromType(vectorType); | ||
| if (!actual || !expected || *actual == *expected) |
There was a problem hiding this comment.
The validateMaskMatchesVectorFamily function returns success() if expected is nullopt. However, inferMaskGranularityFromType returns nullopt for 64-bit element types (e.g., i64). This means that predicated operations on 64-bit vectors are silently skipped during mask granularity validation, which could lead to invalid IR passing through this stage.
No description provided.