feat: enable handling of URNs alongside URIs#522
Conversation
4eabe8b to
4a3236a
Compare
| dependencies { | ||
| testImplementation(platform("org.junit:junit-bom:${JUNIT_VERSION}")) | ||
| testImplementation("org.junit.jupiter:junit-jupiter") | ||
| testImplementation("com.google.protobuf:protobuf-java-util:${PROTOBUF_VERSION}") |
There was a problem hiding this comment.
This is necessary to allow us to read in JSON-based plans for testing purposes.
| @@ -0,0 +1,62 @@ | |||
| package io.substrait.extension; | |||
There was a problem hiding this comment.
This already existed in ExtensionCollector.java as a private class. Instead, I have moved it into a separate file so that we can also leverage it for a bijection between uris and urns for migration purposes.
9207bb9 to
6844151
Compare
| return extensionCollection.getUrnFromUri(uri); | ||
| } | ||
|
|
||
| private SimpleExtension.FunctionAnchor resolveFunctionAnchor( |
There was a problem hiding this comment.
7d01a77 to
a11a419
Compare
3a4d72c to
57cdadd
Compare
| /** Load a proto Plan from a JSON resource file using JsonFormat */ | ||
| private Plan loadPlanFromJson(String resourcePath) throws IOException { | ||
| try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) { | ||
| if (inputStream == null) { | ||
| throw new IOException("Resource not found: " + resourcePath); | ||
| } | ||
|
|
||
| String jsonContent = | ||
| new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) | ||
| .lines() | ||
| .collect(Collectors.joining("\n")); | ||
|
|
||
| Plan.Builder planBuilder = Plan.newBuilder(); | ||
| JsonFormat.parser().merge(jsonContent, planBuilder); | ||
| return planBuilder.build(); | ||
| } | ||
| } |
There was a problem hiding this comment.
@vbarua is the protobuf JSON format now an official format of Substrait? If not, isn't relying on the protobuf JSON format in the core unit tests making it more official than it used to be? What are your thoughts?
There was a problem hiding this comment.
In a prior iteration of this PR, I did construct the plans by hand via the proto builders, rather than using JSON files. However, they were getting quite unreadable, and so I opted for switching to the JSON approach.
Happy to consider other approaches, but even if substrait doesn't have official JSON support, I don't see harm in using it internally in testing this library. No library APIs are being exposed specifically for JSON.
There was a problem hiding this comment.
Protobuf specifies a canonical JSON representation for proto3 messages. I guess that, even if there is no official JSON format specifically for Substrait, JSON is an official format for the protobuf objects used to represent a Substrait plan.
There was a problem hiding this comment.
I was asking because a while back I proposed to add some convenience methods for JSON serialization in this PR #381 since I had assumed the protobuf JSON format was the official Substrait JSON format which back then turned out to not be the case. I do understand that most people resort to looking at the protobuf JSON format for readability and debugging.
There was a problem hiding this comment.
Ah, my objection in that PR specifically was around exposing any kind of public JSON producing method that outputs the protobuf JSON format (protojson). If/when we do decide on a first-class Substrait JSON format, I want to avoid the issue of having an existing API that produces protojson because it won't necessarily match the format that we decide on, and would result in another awkward migration.
That is a separate concern from working with / using protojson plans. I don't see any issues with working with protojson directly, especially if it's not exposed to end-users as part of our public API.
There was a problem hiding this comment.
From a general direction it stills feels a bit inconsistent to me. I would either endorse protobuf JSON as an official format or not make use of it.
Independent of that I'm wondering how the maintenance of these static files would work and how the intended maintenance process would look like. Would we ever need to modify/regenerate them and how?
There was a problem hiding this comment.
If we don't make use of JSON here, do you have a suggestions for an alternative testing strategy? I also would say that these JSON can be fully removed once the migration is complete. They are only there to test this specific uri / urn dual functionality.
There was a problem hiding this comment.
I'm not saying I'm against the JSON. I just find the messaging around the role of the protobuf JSON in the Substrait project a bit inconsistent.
I think you said yourself you were previously using the proto builders which would be an alternative.
In a prior iteration of this PR, I did construct the plans by hand via the proto builders, rather than using JSON files. However, they were getting quite unreadable, and so I opted for switching to the JSON approach.
Using those might not be super readable, on the other hand they might have the benefit that the proto builders would break immediately when the deprecated functionality is removed and which might have the benefit that whoever removes the deprecated functionality does not forget to also delete the all the JSON files.
Again, I'm not against using the JSON files. I would just like to make sure we have a consistent (and ideally simple) messaging around the use of the JSON format.
There was a problem hiding this comment.
Direct construction via the builders is an alternative, but using the builders results in unreadable test cases. In my view, the tests should be optimized for readability because their goal is to communicate to the reader the correctness of this implementation.
These tests will also break once the next step of the migration takes place, necessitating their removal. (This is because once the migration is complete, generated plans will no longer emit uri information, which the expected plans do have.)
Also, I will personally shepherd the migration through here, so I will make sure that they are removed later.
I don't have the full context on the history of the JSON substrait format, but my understanding now is that:
a) there is no public facing substrait JSON API that is considered part of the spec, and
b) we can internally use JSON representations of plans as long as we don't publicly expose a JSON API in this library
Those two things are not in conflict. If it would help clarify the matter, I would be happy to add a comment to the code that makes it clear that this JSON is for internal use only.
There was a problem hiding this comment.
I don't have the full context on the history of the JSON substrait format, but my understanding now is that:
a) there is no public facing substrait JSON API that is considered part of the spec, and
This is my understanding as well. There is a placeholder in the documentation mentioning that there will be a JSON format but in reality there is no officially approved JSON format for Substrait: https://substrait.io/serialization/text_serialization/
b) we can internally use JSON representations of plans as long as we don't publicly expose a JSON API in this library
This is what I'm not sure about and that's why I asked. Since there is no officially approved JSON format but a lot of people are using the protobuf JSON format to work with Substrait plans then de facto it is part of Substrait since people use it. I would prefer if the Substrait project would for example specify: We acknowledge that Substrait uses protobuf and protobuf exists in a binary and JSON serialization form. We do recommend to use the binary serialization format wherever possible and suggest to limit the use of JSON to debugging and testing.
I'm not objecting to your PR and have no issue if it gets merged. I just find the fact that we as the Substrait project claim there will be an official JSON format but it's not the protobuf JSON format while almost everyone is using the protobuf JSON format a bit awkwar. The most straightforward resolution and my preferred would be to just acknowledge that the protobuf JSON is an official format to work with Substrait.
| /** Load a proto Plan from a JSON resource file using JsonFormat */ | ||
| private Plan loadPlanFromJson(String resourcePath) throws IOException { | ||
| try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) { | ||
| if (inputStream == null) { | ||
| throw new IOException("Resource not found: " + resourcePath); | ||
| } | ||
|
|
||
| String jsonContent = | ||
| new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) | ||
| .lines() | ||
| .collect(Collectors.joining("\n")); | ||
|
|
||
| Plan.Builder planBuilder = Plan.newBuilder(); | ||
| JsonFormat.parser().merge(jsonContent, planBuilder); | ||
| return planBuilder.build(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Ah, my objection in that PR specifically was around exposing any kind of public JSON producing method that outputs the protobuf JSON format (protojson). If/when we do decide on a first-class Substrait JSON format, I want to avoid the issue of having an existing API that produces protojson because it won't necessarily match the format that we decide on, and would result in another awkward migration.
That is a separate concern from working with / using protojson plans. I don't see any issues with working with protojson directly, especially if it's not exposed to end-users as part of our public API.
This commit fully moves from URIs to URNs. This is meant as an intermediate commit in the graceful migration. Later work will re-add support for URIs in a way such that both URIs and URNs are accepted on parse and emitted in plans. It also adds necessary kotlin dep to make debugging work. BREAKING CHANGE: this commit alters the API by dropping support for URIs and adding support for URNs.
Also introduces a put method to ensure no duplicates, as well as a merge method.
Implement comprehensive fallback strategy to resolve extensions from either URN or legacy URI references with conflict detection for protobuf ambiguity. Introduce round-trip tests to ensure output always has uri + urn regardless of combination of uri or urn in input plans. BREAKING CHANGE: This PR alters the extension loading API to require both URI explicitly, and URN implicitly in the plan. Extensions which lack URN will throw an error, and parsed plans which contain a uri/urn without a matching urn/uri pre-loaded into the context will throw an error.
57cdadd to
7fd825a
Compare
|
@vbarua I have rebased onto main, and incorporated your suggestions. Thanks! |
vbarua
left a comment
There was a problem hiding this comment.
Left some comments, but changes look good overall.
| /** Load a proto Plan from a JSON resource file using JsonFormat */ | ||
| private Plan loadPlanFromJson(String resourcePath) throws IOException { | ||
| try (InputStream inputStream = getClass().getClassLoader().getResourceAsStream(resourcePath)) { | ||
| if (inputStream == null) { | ||
| throw new IOException("Resource not found: " + resourcePath); | ||
| } | ||
|
|
||
| String jsonContent = | ||
| new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) | ||
| .lines() | ||
| .collect(Collectors.joining("\n")); | ||
|
|
||
| Plan.Builder planBuilder = Plan.newBuilder(); | ||
| JsonFormat.parser().merge(jsonContent, planBuilder); | ||
| return planBuilder.build(); | ||
| } | ||
| } |
There was a problem hiding this comment.
@nielspardon does this seem reasonable to you / make sense?
1830d03 to
3fbdd82
Compare
3fbdd82 to
89ce8ec
Compare
Closes #509
Summary
This PR introduces a graceful URI -> URN migration. See this issue in the upstream substrait repo for details.
The gist is that the library will be able to accept both uris and urns in input plans, but will always output both when producing plans. This is facilitated by keeping a map internally which tracks a bijection between uris and urns. A later PR will drop support for uris.
This is a breaking change.
Recommended PR Review Strategy
To facilitate easier review of this PR, I suggest breaking the commits into three chunks.
extension:<organization>:<name>).BidiMapto another file so that it can later be used to internally track the uri<->urn mapping.ExtensionCollectionwhich is now passed around as an argument to many of the classes.Testing
To robustly test this implementation, a new (light) test framework is introduced, where input plans are presented in JSON format. They then undergo proto -> POJO -> proto roundtrip conversion, and then compared with an expected JSON-format plan. The goal here is to ensure that regardless of the usage of uri or urn in input, both are present in the output. As an example, see here for an expected input plan in which only uri is used, and here for the expected output plan, in which both uri and urn are present.
This test framework may be dropped once the migration is complete, though perhaps it will prove useful for other reasons.