Skip to content

Feature vpto#471

Open
mouliangyu wants to merge 14 commits intohw-native-sys:mainfrom
mouliangyu:feature-vpto
Open

Feature vpto#471
mouliangyu wants to merge 14 commits intohw-native-sys:mainfrom
mouliangyu:feature-vpto

Conversation

@mouliangyu
Copy link
Copy Markdown

No description provided.

mouliangyu and others added 14 commits April 12, 2026 22:19
Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add detailed mode parameter documentation (mode=0 vs mode=1)
- Add 'Why get_buf/rls_buf is More Programmer-Friendly' section:
  - No manual priming/draining for ping/pong loops
  - No loop peeling for complex/nested loop dependencies
  - Simpler mental model (buffer ID + program order)
- Add quick example comparison showing set_flag overhead vs get_buf simplicity
- Update Example 2 and 3b with explicit mode=0 in code
- Update comparison table with 'Loop peeling' row
- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair)
- get_buf/rls_buf: 1 ID per buffer (handles both directions automatically)
- 8 per pipe-pair is HW limit, not a formula
- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit)
- get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines
- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc)
- Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions
- 4 set_flag + 4 wait_flag (not 8)
- 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers
- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop
- get_buf/rls_buf: same pattern but acquire/release can stay inside or outside
- Acquire/release per slice inside loop
- Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5,
including ISA group references and updated synchronization notes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the VPTO backend for the PTO compiler, including new IR definitions, lowering passes, and an LLVM emission helper. My review identified a critical bug in PTOVPTOExpandBridgeOps.cpp where an incorrect TypeRange is passed to pto::VldusOp, a security risk in pto.py due to the use of eval(), and several architectural concerns regarding the robustness of external tool dependencies and IR validation logic.

Comment on lines +85 to +87
auto load = rewriter.create<pto::VldusOp>(
op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()},
ValueRange{loadPtr, align});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The TypeRange provided to rewriter.create<pto::VldusOp> contains 3 types, but the TableGen definition for VldusOp in VPTOOps.td only specifies 2 results (result and updated_align). Providing an incorrect number of result types will cause a verification failure or crash at runtime. The third type loadPtr.getType() should be removed to align with the operation's definition.

Suggested change
auto load = rewriter.create<pto::VldusOp>(
op.getLoc(), TypeRange{vecType, alignType, loadPtr.getType()},
ValueRange{loadPtr, align});
auto load = rewriter.create<pto::VldusOp>(
op.getLoc(), TypeRange{vecType, alignType},
ValueRange{loadPtr, align});

Comment on lines +749 to +750
value = eval(compile(expr, self.py_fn.__code__.co_filename, "eval"),
globals_dict, {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using eval() on code derived from user-provided function annotations is a security risk. While this is a DSL for kernel authoring, it allows arbitrary code execution if the input source is not fully trusted. Consider using safer alternatives like typing.get_type_hints or a restricted AST evaluator that only allows specific type-related expressions.

Comment on lines +240 to +370
queryDefaultTargetAttrs(const VPTOEmissionOptions &options,
llvm::raw_ostream &diagOS) {
static llvm::StringMap<QueriedTargetAttrs> cache;

if (options.targetTriple.empty() || options.march.empty() ||
options.aicoreArch.empty()) {
diagOS << "VPTO LLVM emission failed: missing target query options\n";
return failure();
}

std::string cacheKey =
options.targetTriple + "|" + options.march + "|" + options.aicoreArch;
if (auto it = cache.find(cacheKey); it != cache.end())
return it->second;

auto bisheng = llvm::sys::findProgramByName("bisheng");
if (!bisheng) {
diagOS << "VPTO LLVM emission failed: unable to find 'bisheng' in PATH\n";
return failure();
}
const std::string &bishengPath = *bisheng;

llvm::SmallString<64> inputPath;
llvm::SmallString<64> outputPath;
int inputFD = -1;
int outputFD = -1;
if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
"c", inputFD, inputPath)) {
diagOS << "VPTO LLVM emission failed: cannot create bisheng query input: "
<< ec.message() << "\n";
return failure();
}
if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
"ll", outputFD, outputPath)) {
llvm::sys::fs::remove(inputPath);
llvm::sys::Process::SafelyCloseFileDescriptor(inputFD);
diagOS << "VPTO LLVM emission failed: cannot create bisheng query output: "
<< ec.message() << "\n";
return failure();
}

auto cleanup = llvm::make_scope_exit([&]() {
llvm::sys::fs::remove(inputPath);
llvm::sys::fs::remove(outputPath);
});

{
llvm::raw_fd_ostream inputOS(inputFD, /*shouldClose=*/false);
inputOS << "void f(void) {}\n";
}
llvm::sys::Process::SafelyCloseFileDescriptor(inputFD);
llvm::sys::Process::SafelyCloseFileDescriptor(outputFD);

llvm::SmallString<128> stderrPath;
int stderrFD = -1;
if (auto ec = llvm::sys::fs::createTemporaryFile("ptoas-vpto-target-query",
"stderr", stderrFD,
stderrPath)) {
diagOS << "VPTO LLVM emission failed: cannot create bisheng query stderr: "
<< ec.message() << "\n";
return failure();
}
auto stderrCleanup = llvm::make_scope_exit([&]() {
llvm::sys::fs::remove(stderrPath);
});
llvm::sys::Process::SafelyCloseFileDescriptor(stderrFD);

llvm::SmallVector<std::string> argStorage = {
bishengPath,
("--target=" + options.targetTriple),
("-march=" + options.march),
("--cce-aicore-arch=" + options.aicoreArch),
"--cce-aicore-only",
"-x",
"c",
inputPath.str().str(),
"-S",
"-emit-llvm",
"-o",
outputPath.str().str(),
};
llvm::SmallVector<llvm::StringRef> args;
args.reserve(argStorage.size());
for (const std::string &arg : argStorage)
args.push_back(arg);

std::string execErr;
bool execFailed = false;
int rc = llvm::sys::ExecuteAndWait(
bishengPath, args, std::nullopt,
{std::nullopt, std::nullopt, llvm::StringRef(stderrPath)}, 0, 0,
&execErr, &execFailed);

auto stderrBuffer = llvm::MemoryBuffer::getFile(stderrPath);
llvm::StringRef stderrText =
stderrBuffer ? stderrBuffer.get()->getBuffer() : llvm::StringRef();

if (execFailed || rc != 0) {
diagOS << "VPTO LLVM emission failed: bisheng target query failed\n";
diagOS << "Command:";
for (llvm::StringRef arg : args)
diagOS << " " << arg;
diagOS << "\n";
if (!execErr.empty())
diagOS << execErr << "\n";
if (!stderrText.empty())
diagOS << stderrText << "\n";
return failure();
}

auto outputBuffer = llvm::MemoryBuffer::getFile(outputPath);
if (!outputBuffer) {
diagOS << "VPTO LLVM emission failed: cannot read bisheng query output\n";
return failure();
}

FailureOr<std::string> targetCPU =
extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-cpu");
FailureOr<std::string> targetFeatures =
extractQuotedLLVMFnAttr(outputBuffer.get()->getBuffer(), "target-features");
if (failed(targetCPU) || failed(targetFeatures)) {
diagOS << "VPTO LLVM emission failed: cannot parse bisheng target attrs\n";
diagOS << outputBuffer.get()->getBuffer() << "\n";
return failure();
}

QueriedTargetAttrs attrs{*targetCPU, *targetFeatures};
cache[cacheKey] = attrs;
return attrs;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The queryDefaultTargetAttrs function executes an external binary (bisheng) and parses its output to derive target attributes. This approach is fragile as it depends on the environment's PATH, the presence of the CANN toolkit, and the stability of the compiler's output format. It also introduces significant overhead during the emission phase. A more robust approach would be to pass these target attributes explicitly via tool flags or a configuration file.

Comment on lines +39 to +79
static LogicalResult eraseDeadVPTOMemRefScaffold(ModuleOp module) {
bool erasedAny = true;
while (erasedAny) {
erasedAny = false;
SmallVector<pto::CastPtrOp> trivialCasts;
SmallVector<Operation *> deadOps;
module.walk([&](Operation *op) {
if (auto castOp = dyn_cast<pto::CastPtrOp>(op)) {
if (isTrivialVPTOBoundaryCastPtr(castOp)) {
trivialCasts.push_back(castOp);
return;
}
if (castOp->use_empty())
deadOps.push_back(op);
return;
}

if (!op->use_empty())
return;
if (isa<pto::PointerCastOp, pto::BindTileOp, memref::ReinterpretCastOp,
memref::SubViewOp, memref::MemorySpaceCastOp>(op))
deadOps.push_back(op);
});

for (pto::CastPtrOp castOp : trivialCasts) {
if (!castOp->getBlock())
continue;
castOp.getResult().replaceAllUsesWith(castOp.getInput());
castOp.erase();
erasedAny = true;
}

for (Operation *op : deadOps) {
if (!op->getBlock())
continue;
op->erase();
erasedAny = true;
}
}
return success();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The eraseDeadVPTOMemRefScaffold function uses a manual while loop for fixed-point iteration to erase dead operations. This can be inefficient for large modules. MLIR provides the GreedyPatternRewriteDriver which is designed to handle such canonicalizations and dead code elimination more efficiently and idiomatically.

StringRef vectorRole) {
auto actual = VPTOLegalityHelper::getMaskGranularity(maskType);
auto expected = VPTOLegalityHelper::inferMaskGranularityFromType(vectorType);
if (!actual || !expected || *actual == *expected)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validateMaskMatchesVectorFamily function returns success() if expected is nullopt. However, inferMaskGranularityFromType returns nullopt for 64-bit element types (e.g., i64). This means that predicated operations on 64-bit vectors are silently skipped during mask granularity validation, which could lead to invalid IR passing through this stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants