Skip to content

feat: enable handling of URNs alongside URIs#522

Merged
vbarua merged 6 commits into
substrait-io:mainfrom
benbellick:urn-migration4
Oct 6, 2025
Merged

feat: enable handling of URNs alongside URIs#522
vbarua merged 6 commits into
substrait-io:mainfrom
benbellick:urn-migration4

Conversation

@benbellick
Copy link
Copy Markdown
Member

@benbellick benbellick commented Sep 26, 2025

Closes #509

Summary

This PR introduces a graceful URI -> URN migration. See this issue in the upstream substrait repo for details.

The gist is that the library will be able to accept both uris and urns in input plans, but will always output both when producing plans. This is facilitated by keeping a map internally which tracks a bijection between uris and urns. A later PR will drop support for uris.

This is a breaking change.

Recommended PR Review Strategy

To facilitate easier review of this PR, I suggest breaking the commits into three chunks.

  1. The first chunk fully migrates uri to urn without any graceful handling. You can think about this as the target state of the repository once the migration is fully complete. The only other feature added here is to add some light urn validation (to ensure the format is extension:<organization>:<name>).
  2. The second chunk just moves the internally used BidiMap to another file so that it can later be used to internally track the uri<->urn mapping.
  3. The third chunk is the real meat of the PR. This is where the uri<->urn mapping is introduced. This mapping is maintained internally inside of an ExtensionCollection which is now passed around as an argument to many of the classes.

Testing

To robustly test this implementation, a new (light) test framework is introduced, where input plans are presented in JSON format. They then undergo proto -> POJO -> proto roundtrip conversion, and then compared with an expected JSON-format plan. The goal here is to ensure that regardless of the usage of uri or urn in input, both are present in the output. As an example, see here for an expected input plan in which only uri is used, and here for the expected output plan, in which both uri and urn are present.

This test framework may be dropped once the migration is complete, though perhaps it will prove useful for other reasons.

Comment thread core/build.gradle.kts Outdated
dependencies {
testImplementation(platform("org.junit:junit-bom:${JUNIT_VERSION}"))
testImplementation("org.junit.jupiter:junit-jupiter")
testImplementation("com.google.protobuf:protobuf-java-util:${PROTOBUF_VERSION}")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to allow us to read in JSON-based plans for testing purposes.

@@ -0,0 +1,62 @@
package io.substrait.extension;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already existed in ExtensionCollector.java as a private class. Instead, I have moved it into a separate file so that we can also leverage it for a bijection between uris and urns for migration purposes.

@benbellick benbellick force-pushed the urn-migration4 branch 8 times, most recently from 9207bb9 to 6844151 Compare September 27, 2025 17:02
return extensionCollection.getUrnFromUri(uri);
}

private SimpleExtension.FunctionAnchor resolveFunctionAnchor(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbellick benbellick force-pushed the urn-migration4 branch 3 times, most recently from 7d01a77 to a11a419 Compare September 27, 2025 18:01
@benbellick benbellick changed the title Urn migration4 Graceful URI -> URN Migration Sep 27, 2025
@benbellick benbellick marked this pull request as ready for review September 27, 2025 18:12
@benbellick benbellick force-pushed the urn-migration4 branch 6 times, most recently from 3a4d72c to 57cdadd Compare September 29, 2025 21:12
Comment on lines +32 to +48
/** Load a proto Plan from a JSON resource file using JsonFormat */
private Plan loadPlanFromJson(String resourcePath) throws IOException {
try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) {
if (inputStream == null) {
throw new IOException("Resource not found: " + resourcePath);
}

String jsonContent =
new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))
.lines()
.collect(Collectors.joining("\n"));

Plan.Builder planBuilder = Plan.newBuilder();
JsonFormat.parser().merge(jsonContent, planBuilder);
return planBuilder.build();
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbarua is the protobuf JSON format now an official format of Substrait? If not, isn't relying on the protobuf JSON format in the core unit tests making it more official than it used to be? What are your thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a prior iteration of this PR, I did construct the plans by hand via the proto builders, rather than using JSON files. However, they were getting quite unreadable, and so I opted for switching to the JSON approach.

Happy to consider other approaches, but even if substrait doesn't have official JSON support, I don't see harm in using it internally in testing this library. No library APIs are being exposed specifically for JSON.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protobuf specifies a canonical JSON representation for proto3 messages. I guess that, even if there is no official JSON format specifically for Substrait, JSON is an official format for the protobuf objects used to represent a Substrait plan.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking because a while back I proposed to add some convenience methods for JSON serialization in this PR #381 since I had assumed the protobuf JSON format was the official Substrait JSON format which back then turned out to not be the case. I do understand that most people resort to looking at the protobuf JSON format for readability and debugging.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my objection in that PR specifically was around exposing any kind of public JSON producing method that outputs the protobuf JSON format (protojson). If/when we do decide on a first-class Substrait JSON format, I want to avoid the issue of having an existing API that produces protojson because it won't necessarily match the format that we decide on, and would result in another awkward migration.

That is a separate concern from working with / using protojson plans. I don't see any issues with working with protojson directly, especially if it's not exposed to end-users as part of our public API.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a general direction it stills feels a bit inconsistent to me. I would either endorse protobuf JSON as an official format or not make use of it.

Independent of that I'm wondering how the maintenance of these static files would work and how the intended maintenance process would look like. Would we ever need to modify/regenerate them and how?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't make use of JSON here, do you have a suggestions for an alternative testing strategy? I also would say that these JSON can be fully removed once the migration is complete. They are only there to test this specific uri / urn dual functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying I'm against the JSON. I just find the messaging around the role of the protobuf JSON in the Substrait project a bit inconsistent.

I think you said yourself you were previously using the proto builders which would be an alternative.

In a prior iteration of this PR, I did construct the plans by hand via the proto builders, rather than using JSON files. However, they were getting quite unreadable, and so I opted for switching to the JSON approach.

Using those might not be super readable, on the other hand they might have the benefit that the proto builders would break immediately when the deprecated functionality is removed and which might have the benefit that whoever removes the deprecated functionality does not forget to also delete the all the JSON files.

Again, I'm not against using the JSON files. I would just like to make sure we have a consistent (and ideally simple) messaging around the use of the JSON format.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct construction via the builders is an alternative, but using the builders results in unreadable test cases. In my view, the tests should be optimized for readability because their goal is to communicate to the reader the correctness of this implementation.

These tests will also break once the next step of the migration takes place, necessitating their removal. (This is because once the migration is complete, generated plans will no longer emit uri information, which the expected plans do have.)

Also, I will personally shepherd the migration through here, so I will make sure that they are removed later.

I don't have the full context on the history of the JSON substrait format, but my understanding now is that:
a) there is no public facing substrait JSON API that is considered part of the spec, and
b) we can internally use JSON representations of plans as long as we don't publicly expose a JSON API in this library

Those two things are not in conflict. If it would help clarify the matter, I would be happy to add a comment to the code that makes it clear that this JSON is for internal use only.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the full context on the history of the JSON substrait format, but my understanding now is that:
a) there is no public facing substrait JSON API that is considered part of the spec, and

This is my understanding as well. There is a placeholder in the documentation mentioning that there will be a JSON format but in reality there is no officially approved JSON format for Substrait: https://substrait.io/serialization/text_serialization/

b) we can internally use JSON representations of plans as long as we don't publicly expose a JSON API in this library

This is what I'm not sure about and that's why I asked. Since there is no officially approved JSON format but a lot of people are using the protobuf JSON format to work with Substrait plans then de facto it is part of Substrait since people use it. I would prefer if the Substrait project would for example specify: We acknowledge that Substrait uses protobuf and protobuf exists in a binary and JSON serialization form. We do recommend to use the binary serialization format wherever possible and suggest to limit the use of JSON to debugging and testing.

I'm not objecting to your PR and have no issue if it gets merged. I just find the fact that we as the Substrait project claim there will be an official JSON format but it's not the protobuf JSON format while almost everyone is using the protobuf JSON format a bit awkwar. The most straightforward resolution and my preferred would be to just acknowledge that the protobuf JSON is an official format to work with Substrait.

@benbellick benbellick mentioned this pull request Sep 30, 2025
@vbarua vbarua changed the title Graceful URI -> URN Migration feat: enable handling of URNs alongside URIs Sep 30, 2025
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found time for a first pass. Will do another one tomorrow.

Also, I had a suggestion that I turned into a separate PR #529

Comment thread core/src/main/java/io/substrait/extension/BidiMap.java
Comment thread core/src/main/java/io/substrait/extension/SimpleExtension.java Outdated
Comment thread core/src/main/java/io/substrait/extension/ExtensionCollector.java Outdated
Comment on lines +32 to +48
/** Load a proto Plan from a JSON resource file using JsonFormat */
private Plan loadPlanFromJson(String resourcePath) throws IOException {
try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) {
if (inputStream == null) {
throw new IOException("Resource not found: " + resourcePath);
}

String jsonContent =
new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))
.lines()
.collect(Collectors.joining("\n"));

Plan.Builder planBuilder = Plan.newBuilder();
JsonFormat.parser().merge(jsonContent, planBuilder);
return planBuilder.build();
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my objection in that PR specifically was around exposing any kind of public JSON producing method that outputs the protobuf JSON format (protojson). If/when we do decide on a first-class Substrait JSON format, I want to avoid the issue of having an existing API that produces protojson because it won't necessarily match the format that we decide on, and would result in another awkward migration.

That is a separate concern from working with / using protojson plans. I don't see any issues with working with protojson directly, especially if it's not exposed to end-users as part of our public API.

This commit fully moves from URIs to URNs. This is meant as an
intermediate commit in the graceful migration. Later work will re-add
support for URIs in a way such that both URIs and URNs are accepted on
parse and emitted in plans. It also adds necessary kotlin dep to make debugging work.

BREAKING CHANGE: this commit alters the API by dropping support for
URIs and adding support for URNs.
Also introduces a put method to ensure no duplicates, as well as a
merge method.
Implement comprehensive fallback strategy to resolve extensions from either
URN or legacy URI references with conflict detection for protobuf
ambiguity.

Introduce round-trip tests to ensure output always has uri + urn
regardless of combination of uri or urn in input plans.

BREAKING CHANGE: This PR alters the extension loading API to require
both URI explicitly, and URN implicitly in the plan. Extensions which lack
URN will throw an error, and parsed plans which contain a uri/urn
without a matching urn/uri pre-loaded into the context will throw an
error.
@benbellick
Copy link
Copy Markdown
Member Author

@vbarua I have rebased onto main, and incorporated your suggestions. Thanks!

@benbellick benbellick requested a review from vbarua October 1, 2025 15:24
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but changes look good overall.

Comment thread core/src/main/java/io/substrait/extension/ImmutableExtensionLookup.java Outdated
Comment thread core/src/main/java/io/substrait/extension/ImmutableExtensionLookup.java Outdated
Comment on lines +32 to +48
/** Load a proto Plan from a JSON resource file using JsonFormat */
private Plan loadPlanFromJson(String resourcePath) throws IOException {
try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) {
if (inputStream == null) {
throw new IOException("Resource not found: " + resourcePath);
}

String jsonContent =
new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))
.lines()
.collect(Collectors.joining("\n"));

Plan.Builder planBuilder = Plan.newBuilder();
JsonFormat.parser().merge(jsonContent, planBuilder);
return planBuilder.build();
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon does this seem reasonable to you / make sense?

Comment thread core/src/main/java/io/substrait/extension/AbstractExtensionLookup.java Outdated
Comment thread core/src/main/java/io/substrait/extension/SimpleExtension.java Outdated
Comment thread core/src/main/java/io/substrait/extension/ImmutableExtensionLookup.java Outdated
Comment thread core/src/main/java/io/substrait/extension/ImmutableExtensionLookup.java Outdated
Comment thread core/src/main/java/io/substrait/extension/ImmutableExtensionLookup.java Outdated
@vbarua vbarua merged commit 26e1e3e into substrait-io:main Oct 6, 2025
12 checks passed
@benbellick benbellick deleted the urn-migration4 branch October 8, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate extensions from URIs to URNs

4 participants