Document contracts for all custom operators by Copilot · Pull Request #1049 · microsoft/onnxruntime-extensions

Copilot · 2026-04-17T22:57:48Z

docs/custom_ops.md was missing or stubbed-out (TODO) contracts for a large portion of the operators registered under operators/**/*.cc. This PR fills in those gaps so every registered op has documented inputs, outputs, and attributes.

Changes to `docs/custom_ops.md`

Filled in TODO stubs: Inverse, NegPos, SegmentExtraction, SegmentSum, RaggedTensorToSparse, RaggedTensorToDense, StringSplit, StringUpper, StringLower, StringECMARegexSplitWithOffsets, StringRaggedTensorToDense, StringMapping, StringHashFast.
Added missing tokenizer ops: CLIPTokenizer, RobertaTokenizer, SpmTokenizer, HfBertTokenizer, HfJsonTokenizer, SentencepieceDecoder, BpeDecoder, TrieTokenizer, TrieDetokenizer, BlingFireSentenceBreaker.
Added missing math ops: StftNorm, SplitSignalSegments, MergeSignalSegments.
Added missing text op: StringStrip.
New "Audio operators" section: AudioDecoder.
New "Vision operators" section: DecodeImage, EncodeImage, DrawBoundingBoxes, GaussianBlur, ImageDecoder, ImageReader.
New "CUDA operators" section (gated on USE_CUDA): FastGelu, MulSigmoid, MulMulSigmoid, NegXPlus1, ReplaceZero, AddSharedInput, MulSharedInput, ScatterNDOfShape, MaskedScatterNDOfShape, Transpose2DCastFP16, Transpose2DCastFP32.

Notes

Attribute names are kept verbatim from source (e.g. maskedValue on MaskedScatterNDOfShape) to stay accurate; renaming would be a breaking change and is out of scope.
StringUpper is documented as ASCII-only (::toupper over raw bytes) while StringLower is documented as Unicode-aware (decodes UTF-8 into char32_t via ustring), matching current behavior in operators/text/string_upper.cc and string_lower.cc.
StringSlice remains documented but is not registered in any current OrtOpLoader; left untouched since the issue is about adding missing docs, not pruning.
Docs-only change; no source, build, or test files were touched.

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/cdb9185e-44ff-4191-8d4c-b00889e20918 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot

Pull request overview

This PR expands docs/custom_ops.md to document the input/output/attribute contracts for custom operators registered under operators/**/*.cc, filling previously missing or TODO sections and adding new operator groups.

Changes:

Added documentation for multiple previously undocumented tokenizer, math, tensor, audio, vision, and CUDA-gated operators.
Introduced new documentation sections grouping operators by domain (Audio, Vision, CUDA).
Replaced several TODO stubs with detailed contracts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justinchuby · 2026-04-18T02:50:17Z

@copilot please fix the comments and validate against the c++ documentation and implementations.

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/3f620564-a099-495a-8067-d1d71deb349b Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot · 2026-04-18T02:55:35Z

@copilot please fix the comments and validate against the c++ documentation and implementations.

Addressed all reviewer comments in 56a19d6, cross-validated each against the C++ implementations in operators/**:

Renamed StringHash/StringHashFast to StringToHashBucket/StringToHashBucketFast (matching operators/text/text.cc).
Fixed StringEqual input names (x, y) and output description (bool equality).
Marked attention_mask as optional for CLIPTokenizer, RobertaTokenizer, SpmTokenizer, HfJsonTokenizer (their kernels take std::optional<ortc::Tensor<int64_t>*>).
Corrected HfBertTokenizer strip_accents default to 0 (matches bert_tokenizer.cc:317) and documented its optional offset_mapping output.
Updated CLIPTokenizer padding_length description to reflect dense-but-dynamic padding behavior.
Rewrote StringRegexSplitWithOffsets and StringECMARegexSplitWithOffsets I/O contracts to the 4-output form (tokens, begin_offsets, end_offsets, row_offsets) and replaced the stale example.
Fixed StringECMARegexReplace example op name and ignore_case description.
Corrected AudioDecoder attributes: removed spurious target_sample_rate, fixed defaults (downsampling_rate=-1, stereo_to_mono=0).
Fixed DecodeImage color_space case ("bgr"/"rgb", default "bgr").
Added color_space attribute to EncodeImage and restricted format to "jpg"/"png".
Noted GaussianBlur constraints (N==1, C==3) from the CV_32FC3 buffer wrapping.
Changed ImageReader input from scalar to 1D tensor of shape [1].
Fixed SentencepieceDecoder output shape to [1].
Fixed StringLength summary, input name, and len("abcde") typo.
Fixed StringJoin "joing" and StringRegexSplitWithOffsets "attern" typos.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T01:18:40Z

+
+#### Inputs
+
+***input: tensor(string)***
+
+1D string tensor containing the input texts.



The CLIPTokenizer contract here is incomplete vs the kernel's OnModelAttach/Compute signature. In addition to vocab, merges, and padding_length, the implementation also accepts optional attributes like model_name and added_token (for added-tokens text), and it has an additional optional add_special_tokens bool input (parsed as an optional input) that controls BOS/EOS insertion. Please document these so users can build nodes that match the actual op interface.

Suggested change

#### Inputs

***input: tensor(string)***

1D string tensor containing the input texts.

***model_name: string*** (optional)

Optional model identifier used by the tokenizer implementation when model-specific defaults or special-token behavior are needed.

***added_token: string*** (optional)

Serialized added-tokens text used to provide extra token definitions beyond `vocab` and `merges`.

#### Inputs

***input: tensor(string)***

1D string tensor containing the input texts.

***add_special_tokens: tensor(bool)*** (optional)

Optional scalar or 1D boolean input that controls whether special tokens are added during tokenization. When true, the tokenizer inserts the model's BOS/EOS or other required special tokens; when false, tokenization is performed without adding them.

Copilot · 2026-04-19T01:18:41Z

+***all_special_ids: string*** (optional)

-expect(node, inputs=[x], outputs=[y],
-       name='test_string_length')
-```
-</details>
-
-### StringConcat 
+Comma-separated list of special token ids.

-<details>
-<summary>StringConcat details</summary>
+***skip_special_tokens: int64_t*** (default is 0)

-Concat the corresponding string in the two string tensor. Two input tensors should have the same dimension.
+When 1, ids in `all_special_ids` are skipped during decoding.


BpeDecoder's all_special_ids format is documented as comma-separated, but the implementation parses it line-by-line (newline-separated) using std::stoll and optional tab-delimited payloads. A comma-separated list will not be parsed correctly (only the first id would be read). Please update the docs to reflect the expected newline-separated format (one id per line, or id\t...).

Copilot · 2026-04-19T01:18:41Z

+Converts a ragged int64 tensor to a dense 2D tensor, padding shorter rows with a configurable value.
+
+#### Attributes
+
+***missing_value: int64_t*** (default is -1)
+
+Value used to pad short rows.
+
+#### Inputs
+
+***input0: tensor(int64)***
+
+1D row-splits tensor indicating the start index of each row within `input3`.
+
+***input1: tensor(int64)***
+
+1D tensor of flat indices (unused by some consumers; reserved).
+
+***input2: tensor(int64)***
+
+1D tensor of length 2 describing the target dense shape `[num_rows, max_row_width]`.
+
+***input3: tensor(int64)***
+
+1D flat values tensor.
+


RaggedTensorToDense contract here doesn't match the current kernel implementation. The code pads using input2[0] (a missing-value input tensor), and uses input3 as the row-splits/index array; it does not use the missing_value attribute or the documented input2 target-shape tensor/input3 flat-values tensor ordering. Please update the documented inputs/attributes to reflect the actual parameter order and padding source (or update the kernel to match this documented API).

Copilot · 2026-04-19T01:18:41Z

+***downsampling_rate: int64_t*** (default is -1)
+
+Target sample rate to resample the decoded audio to. When -1, the native sample rate of the decoded stream is used.
+
+***stereo_to_mono: int64_t*** (default is 0)
+


AudioDecoder attributes in this doc section don't match the operator implementation. The kernel reads target_sample_rate / target_sample_rates (defaulting to 16000) and stereo_to_mono (defaulting to 1), not downsampling_rate with default -1 and stereo_to_mono default 0. Please update the documented attribute names and defaults to reflect the actual contract (or adjust the implementation if the doc is intended).

Suggested change

***downsampling_rate: int64_t*** (default is -1)

Target sample rate to resample the decoded audio to. When -1, the native sample rate of the decoded stream is used.

***stereo_to_mono: int64_t*** (default is 0)

***target_sample_rate: int64_t*** (default is 16000)

Target sample rate to resample the decoded audio to.

***target_sample_rates: ints*** (default is 16000)

Target sample rate(s) to resample the decoded audio to.

***stereo_to_mono: int64_t*** (default is 1)

Copilot · 2026-04-19T01:18:41Z

+***en_normalization: int64_t*** (default is 0)

-#### Inputs
+Apply a minimal English-oriented post-processing step (e.g. undo leading-space markers).

-***input_1: tensor(string)***
+***whitespace_token: string*** (optional)
+***bos_token: string*** (optional)
+***eos_token: string*** (optional)
+***unk_token: string*** (optional)

-The first string tensor.
+Optional overrides for well-known special tokens.


BpeDecoder documents whitespace_token as a string override, but the kernel treats whitespace_token as an int64_t flag attribute (0/1) that inserts spaces around special tokens. Please fix the documented type/semantics so they match the actual attribute behavior.

Copilot · 2026-04-19T01:18:44Z

+***mapping_file_name***
+
+the formative mapping table
+
+***unmapping_value***
+
+the result returned when a vector aren't found in the map


VectorToString's documented attribute names (mapping_file_name, unmapping_value) don't match the implementation. The kernel reads attributes named map (mapping table contents) and unk (fallback string). Please update this Attributes section (and the surrounding narrative) to use the actual attribute names so users can construct valid nodes.

Suggested change

***mapping_file_name***

the formative mapping table

***unmapping_value***

the result returned when a vector aren't found in the map

***map***

the formatted mapping table contents

***unk***

the result returned when a vector is not found in the map

Copilot · 2026-04-19T01:18:44Z

+Removes leading and trailing whitespace characters from every string in the input tensor. Similar to `str.strip()` in Python.
+


The StringStrip behavior described here doesn't match the current implementation: it only treats ASCII whitespace (" \t\n\r\f\v"), and when a string contains only whitespace it is left unchanged (Python's str.strip() would return an empty string). Please either adjust the docs to reflect these semantics or update the implementation to match Python.

Copilot · 2026-04-19T01:18:44Z

+
+***segments: tensor(int64)***
+
+2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_sample, end_sample)` indices of a detected segment.


SplitSignalSegments output is documented as (begin_sample, end_sample) indices, but the kernel actually outputs segment boundaries in milliseconds (it converts seconds to ms by multiplying by 1000 before writing the int64 outputs). Please update the output description to reflect the correct units.

Suggested change

2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_sample, end_sample)` indices of a detected segment.

2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_ms, end_ms)` boundaries in milliseconds of a detected segment.

Copilot · 2026-04-19T01:18:44Z

+2D tensor of shape `[N, 2]` with `(begin, end)` indices, as produced by `SplitSignalSegments`.
+
+***merge_gap_ms: tensor(int64)***
+
+Scalar gap threshold in milliseconds. Segments separated by less than this value are merged.
+
+#### Outputs
+
+***output: tensor(int64)***
+
+2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries.


MergeSignalSegments operates on the segment boundaries produced by SplitSignalSegments, which are in milliseconds. This section currently describes segments as generic (begin, end) indices and doesn't clarify the unit; please specify that both segments and merge_gap_ms are in milliseconds so users don't pass sample indices by mistake.

Suggested change

2D tensor of shape `[N, 2]` with `(begin, end)` indices, as produced by `SplitSignalSegments`.

***merge_gap_ms: tensor(int64)***

Scalar gap threshold in milliseconds. Segments separated by less than this value are merged.

#### Outputs

***output: tensor(int64)***

2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries.

2D tensor of shape `[N, 2]` with segment boundaries `(begin_ms, end_ms)` in milliseconds, as produced by `SplitSignalSegments`.

***merge_gap_ms: tensor(int64)***

Scalar gap threshold in milliseconds. Segments separated by less than this many milliseconds are merged.

#### Outputs

***output: tensor(int64)***

2D tensor of shape `[M, 2]` (M <= N) of the merged segment boundaries in milliseconds.

Copilot · 2026-04-19T01:18:45Z

+
+***n_element: tensor(int64)***
+
+1D tensor holding the number of elements in each row.


RaggedTensorToSparse input n_element is documented as per-row counts, but the implementation expects a row-splits / prefix-sum array of length num_rows + 1 (it computes each row length as n_element[i] - n_element[i-1] and uses the last element as total value count). Please update the input description accordingly.

Suggested change

1D tensor holding the number of elements in each row.

1D row-splits / prefix-sum tensor of length `num_rows + 1`, where each entry gives the cumulative number of elements up to that row. The length of row `i` is `n_element[i + 1] - n_element[i]`, and the last value is the total number of elements.

Initial plan

5824e87

Copilot AI assigned Copilot and justinchuby Apr 17, 2026

Copilot started work on behalf of justinchuby April 17, 2026 22:57 View session

Copilot AI linked an issue Apr 17, 2026 that may be closed by this pull request

Document contracts for all ops #1048

Open

Copilot AI and others added 2 commits April 17, 2026 23:05

Document contracts for all custom operators

1d2377c

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/cdb9185e-44ff-4191-8d4c-b00889e20918 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Clarify ASCII vs Unicode casing for StringUpper/StringLower

290f917

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/cdb9185e-44ff-4191-8d4c-b00889e20918 Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Document contracts for all ops~~ Document contracts for all custom operators Apr 17, 2026

Copilot AI requested a review from justinchuby April 17, 2026 23:07

Copilot finished work on behalf of justinchuby April 17, 2026 23:07

justinchuby marked this pull request as ready for review April 17, 2026 23:50

justinchuby requested a review from a team as a code owner April 17, 2026 23:50

Copilot AI review requested due to automatic review settings April 17, 2026 23:50

Copilot started reviewing on behalf of justinchuby April 17, 2026 23:51 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Copilot started work on behalf of justinchuby April 18, 2026 02:50 View session

Address PR review feedback: align docs with C++ implementations

56a19d6

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-extensions/sessions/3f620564-a099-495a-8067-d1d71deb349b Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Copilot finished work on behalf of justinchuby April 18, 2026 02:56

justinchuby requested a review from Copilot April 19, 2026 01:06

Copilot started reviewing on behalf of justinchuby April 19, 2026 01:06 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

kunal-vaishnavi requested a review from sayanshaw24 April 20, 2026 21:51

		Removes leading and trailing whitespace characters from every string in the input tensor. Similar to `str.strip()` in Python.


		*segments: tensor(int64)*

		2D tensor of shape `[num_segments, 2]` where each row contains the `(begin_sample, end_sample)` indices of a detected segment.


		*n_element: tensor(int64)*

		1D tensor holding the number of elements in each row.

	1D tensor holding the number of elements in each row.
	1D row-splits / prefix-sum tensor of length `num_rows + 1`, where each entry gives the cumulative number of elements up to that row. The length of row `i` is `n_element[i + 1] - n_element[i]`, and the last value is the total number of elements.

Conversation

Copilot AI commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes to docs/custom_ops.md

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Apr 18, 2026

Uh oh!

Copilot AI commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 17, 2026 •

edited

Loading

Changes to `docs/custom_ops.md`