Document contracts for all custom operators#1049
Conversation
Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/cdb9185e-44ff-4191-8d4c-b00889e20918 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/cdb9185e-44ff-4191-8d4c-b00889e20918 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR expands docs/custom_ops.md to document the input/output/attribute contracts for custom operators registered under operators/**/*.cc, filling previously missing or TODO sections and adding new operator groups.
Changes:
- Added documentation for multiple previously undocumented tokenizer, math, tensor, audio, vision, and CUDA-gated operators.
- Introduced new documentation sections grouping operators by domain (Audio, Vision, CUDA).
- Replaced several TODO stubs with detailed contracts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot please fix the comments and validate against the c++ documentation and implementations. |
Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/3f620564-a099-495a-8067-d1d71deb349b Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Addressed all reviewer comments in 56a19d6, cross-validated each against the C++ implementations in
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| #### Inputs | ||
|
|
||
| ***input: tensor(string)*** | ||
|
|
||
| 1D string tensor containing the input texts. | ||
|
|
There was a problem hiding this comment.
The CLIPTokenizer contract here is incomplete vs the kernel's OnModelAttach/Compute signature. In addition to vocab, merges, and padding_length, the implementation also accepts optional attributes like model_name and added_token (for added-tokens text), and it has an additional optional add_special_tokens bool input (parsed as an optional input) that controls BOS/EOS insertion. Please document these so users can build nodes that match the actual op interface.
| #### Inputs | |
| ***input: tensor(string)*** | |
| 1D string tensor containing the input texts. | |
| ***model_name: string*** (optional) | |
| Optional model identifier used by the tokenizer implementation when model-specific defaults or special-token behavior are needed. | |
| ***added_token: string*** (optional) | |
| Serialized added-tokens text used to provide extra token definitions beyond `vocab` and `merges`. | |
| #### Inputs | |
| ***input: tensor(string)*** | |
| 1D string tensor containing the input texts. | |
| ***add_special_tokens: tensor(bool)*** (optional) | |
| Optional scalar or 1D boolean input that controls whether special tokens are added during tokenization. When true, the tokenizer inserts the model's BOS/EOS or other required special tokens; when false, tokenization is performed without adding them. |
| ***all_special_ids: string*** (optional) | ||
|
|
||
| expect(node, inputs=[x], outputs=[y], | ||
| name='test_string_length') | ||
| ``` | ||
| </details> | ||
|
|
||
| ### StringConcat | ||
| Comma-separated list of special token ids. | ||
|
|
||
| <details> | ||
| <summary>StringConcat details</summary> | ||
| ***skip_special_tokens: int64_t*** (default is 0) | ||
|
|
||
| Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension. | ||
| When 1, ids in `all_special_ids` are skipped during decoding. |
There was a problem hiding this comment.
BpeDecoder's all_special_ids format is documented as comma-separated, but the implementation parses it line-by-line (newline-separated) using std::stoll and optional tab-delimited payloads. A comma-separated list will not be parsed correctly (only the first id would be read). Please update the docs to reflect the expected newline-separated format (one id per line, or id\t...).
| Converts a ragged int64 tensor to a dense 2D tensor, padding shorter rows with a configurable value. | ||
|
|
||
| #### Attributes | ||
|
|
||
| ***missing_value: int64_t*** (default is -1) | ||
|
|
||
| Value used to pad short rows. | ||
|
|
||
| #### Inputs | ||
|
|
||
| ***input0: tensor(int64)*** | ||
|
|
||
| 1D row-splits tensor indicating the start index of each row within `input3`. | ||
|
|
||
| ***input1: tensor(int64)*** | ||
|
|
||
| 1D tensor of flat indices (unused by some consumers; reserved). | ||
|
|
||
| ***input2: tensor(int64)*** | ||
|
|
||
| 1D tensor of length 2 describing the target dense shape `[num_rows, max_row_width]`. | ||
|
|
||
| ***input3: tensor(int64)*** | ||
|
|
||
| 1D flat values tensor. | ||
|
|
There was a problem hiding this comment.
RaggedTensorToDense contract here doesn't match the current kernel implementation. The code pads using input2[0] (a missing-value input tensor), and uses input3 as the row-splits/index array; it does not use the missing_value attribute or the documented input2 target-shape tensor/input3 flat-values tensor ordering. Please update the documented inputs/attributes to reflect the actual parameter order and padding source (or update the kernel to match this documented API).
| ***downsampling_rate: int64_t*** (default is -1) | ||
|
|
||
| Target sample rate to resample the decoded audio to. When -1, the native sample rate of the decoded stream is used. | ||
|
|
||
| ***stereo_to_mono: int64_t*** (default is 0) | ||
|
|
There was a problem hiding this comment.
AudioDecoder attributes in this doc section don't match the operator implementation. The kernel reads target_sample_rate / target_sample_rates (defaulting to 16000) and stereo_to_mono (defaulting to 1), not downsampling_rate with default -1 and stereo_to_mono default 0. Please update the documented attribute names and defaults to reflect the actual contract (or adjust the implementation if the doc is intended).
| ***downsampling_rate: int64_t*** (default is -1) | |
| Target sample rate to resample the decoded audio to. When -1, the native sample rate of the decoded stream is used. | |
| ***stereo_to_mono: int64_t*** (default is 0) | |
| ***target_sample_rate: int64_t*** (default is 16000) | |
| Target sample rate to resample the decoded audio to. | |
| ***target_sample_rates: ints*** (default is 16000) | |
| Target sample rate(s) to resample the decoded audio to. | |
| ***stereo_to_mono: int64_t*** (default is 1) |
| ***en_normalization: int64_t*** (default is 0) | ||
|
|
||
| #### Inputs | ||
| Apply a minimal English-oriented post-processing step (e.g. undo leading-space markers). | ||
|
|
||
| ***input_1: tensor(string)*** | ||
| ***whitespace_token: string*** (optional) | ||
| ***bos_token: string*** (optional) | ||
| ***eos_token: string*** (optional) | ||
| ***unk_token: string*** (optional) | ||
|
|
||
| The first string tensor. | ||
| Optional overrides for well-known special tokens. |
There was a problem hiding this comment.
BpeDecoder documents whitespace_token as a string override, but the kernel treats whitespace_token as an int64_t flag attribute (0/1) that inserts spaces around special tokens. Please fix the documented type/semantics so they match the actual attribute behavior.
| ***mapping_file_name*** | ||
|
|
||
| the formative mapping table | ||
|
|
||
| ***unmapping_value*** | ||
|
|
||
| the result returned when a vector aren't found in the map |
There was a problem hiding this comment.
VectorToString's documented attribute names (mapping_file_name, unmapping_value) don't match the implementation. The kernel reads attributes named map (mapping table contents) and unk (fallback string). Please update this Attributes section (and the surrounding narrative) to use the actual attribute names so users can construct valid nodes.
| ***mapping_file_name*** | |
| the formative mapping table | |
| ***unmapping_value*** | |
| the result returned when a vector aren't found in the map | |
| ***map*** | |
| the formatted mapping table contents | |
| ***unk*** | |
| the result returned when a vector is not found in the map |
| Removes leading and trailing whitespace characters from every string in the input tensor. Similar to `str.strip()` in Python. | ||
|
|
There was a problem hiding this comment.
The StringStrip behavior described here doesn't match the current implementation: it only treats ASCII whitespace (" \t\n\r\f\v"), and when a string contains only whitespace it is left unchanged (Python's str.strip() would return an empty string). Please either adjust the docs to reflect these semantics or update the implementation to match Python.
|
|
||
| ***segments: tensor(int64)*** | ||
|
|
||
| 2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_sample, end_sample)` indices of a detected segment. |
There was a problem hiding this comment.
SplitSignalSegments output is documented as (begin_sample, end_sample) indices, but the kernel actually outputs segment boundaries in milliseconds (it converts seconds to ms by multiplying by 1000 before writing the int64 outputs). Please update the output description to reflect the correct units.
| 2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_sample, end_sample)` indices of a detected segment. | |
| 2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_ms, end_ms)` boundaries in milliseconds of a detected segment. |
| 2D tensor of shape `[N, 2]` with `(begin, end)` indices, as produced by `SplitSignalSegments`. | ||
|
|
||
| ***merge_gap_ms: tensor(int64)*** | ||
|
|
||
| Scalar gap threshold in milliseconds. Segments separated by less than this value are merged. | ||
|
|
||
| #### Outputs | ||
|
|
||
| ***output: tensor(int64)*** | ||
|
|
||
| 2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries. |
There was a problem hiding this comment.
MergeSignalSegments operates on the segment boundaries produced by SplitSignalSegments, which are in milliseconds. This section currently describes segments as generic (begin, end) indices and doesn't clarify the unit; please specify that both segments and merge_gap_ms are in milliseconds so users don't pass sample indices by mistake.
| 2D tensor of shape `[N, 2]` with `(begin, end)` indices, as produced by `SplitSignalSegments`. | |
| ***merge_gap_ms: tensor(int64)*** | |
| Scalar gap threshold in milliseconds. Segments separated by less than this value are merged. | |
| #### Outputs | |
| ***output: tensor(int64)*** | |
| 2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries. | |
| 2D tensor of shape `[N, 2]` with segment boundaries `(begin_ms, end_ms)` in milliseconds, as produced by `SplitSignalSegments`. | |
| ***merge_gap_ms: tensor(int64)*** | |
| Scalar gap threshold in milliseconds. Segments separated by less than this many milliseconds are merged. | |
| #### Outputs | |
| ***output: tensor(int64)*** | |
| 2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries in milliseconds. |
|
|
||
| ***n_element: tensor(int64)*** | ||
|
|
||
| 1D tensor holding the number of elements in each row. |
There was a problem hiding this comment.
RaggedTensorToSparse input n_element is documented as per-row counts, but the implementation expects a row-splits / prefix-sum array of length num_rows + 1 (it computes each row length as n_element[i] - n_element[i-1] and uses the last element as total value count). Please update the input description accordingly.
| 1D tensor holding the number of elements in each row. | |
| 1D row-splits / prefix-sum tensor of length `num_rows + 1`, where each entry gives the cumulative number of elements up to that row. The length of row `i` is `n_element[i + 1] - n_element[i]`, and the last value is the total number of elements. |
docs/custom_ops.mdwas missing or stubbed-out (TODO) contracts for a large portion of the operators registered underoperators/**/*.cc. This PR fills in those gaps so every registered op has documented inputs, outputs, and attributes.Changes to
docs/custom_ops.mdInverse,NegPos,SegmentExtraction,SegmentSum,RaggedTensorToSparse,RaggedTensorToDense,StringSplit,StringUpper,StringLower,StringECMARegexSplitWithOffsets,StringRaggedTensorToDense,StringMapping,StringHashFast.CLIPTokenizer,RobertaTokenizer,SpmTokenizer,HfBertTokenizer,HfJsonTokenizer,SentencepieceDecoder,BpeDecoder,TrieTokenizer,TrieDetokenizer,BlingFireSentenceBreaker.StftNorm,SplitSignalSegments,MergeSignalSegments.StringStrip.AudioDecoder.DecodeImage,EncodeImage,DrawBoundingBoxes,GaussianBlur,ImageDecoder,ImageReader.USE_CUDA):FastGelu,MulSigmoid,MulMulSigmoid,NegXPlus1,ReplaceZero,AddSharedInput,MulSharedInput,ScatterNDOfShape,MaskedScatterNDOfShape,Transpose2DCastFP16,Transpose2DCastFP32.Notes
maskedValueonMaskedScatterNDOfShape) to stay accurate; renaming would be a breaking change and is out of scope.StringUpperis documented as ASCII-only (::toupperover raw bytes) whileStringLoweris documented as Unicode-aware (decodes UTF-8 intochar32_tviaustring), matching current behavior inoperators/text/string_upper.ccandstring_lower.cc.StringSliceremains documented but is not registered in any currentOrtOpLoader; left untouched since the issue is about adding missing docs, not pruning.