Skip to content

Enable Flink projection pushdown for PSC connector#131

Open
KevBrowne wants to merge 4 commits intopinterest:mainfrom
KevBrowne:enable-flink-projection-pushdown
Open

Enable Flink projection pushdown for PSC connector#131
KevBrowne wants to merge 4 commits intopinterest:mainfrom
KevBrowne:enable-flink-projection-pushdown

Conversation

@KevBrowne
Copy link
Contributor

@KevBrowne KevBrowne commented Feb 23, 2026

Summary

This PR implements SupportsProjectionPushDown for the PSC Flink connector, enabling Flink's optimizer to push column projections down to the source. This optimization reduces deserialization overhead by only deserializing the columns actually needed by queries, rather than deserializing the entire schema.

Key changes:

PscDynamicSource now implements SupportsProjectionPushDown
Added applyProjection() method to compute query-specific projections
Introduced format vs output projection separation (keyFormatProjection/valueFormatProjection for deserialization, keyOutputProjection/valueOutputProjection for row assembly)
Updated getScanRuntimeProvider() and createPscDeserializationSchema() to use the new projections

##Test plan

Unit Test

% mvn -pl psc-flink test -Dtest=PscTableCommonUtilsTest,PscDynamicTableFactoryTest,PscProjectionPushdownTest -Dgpg.skip=true -Djacoco.skip=true
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-examples:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-integration-test:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-flink:jar:4.1.4-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.flink:flink-table-planner_${scala.binary.version}:jar -> duplicate declaration of version ${flink.version} @ line 327, column 21
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.flink:flink-json:jar -> duplicate declaration of version ${flink.version} @ line 417, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-logging:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-common:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-flink-logging:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-java-oss:pom:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ line 155, column 21
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO] 
[INFO] --------------------< com.pinterest.psc:psc-flink >---------------------
[INFO] Building psc-flink 4.1.4-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- jacoco-maven-plugin:0.8.5:prepare-agent (prepare-unit-tests) @ psc-flink ---
[INFO] Skipping JaCoCo execution because property jacoco.skip is set.
[INFO] argLine set to empty
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ psc-flink ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ psc-flink ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ psc-flink ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 99 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ psc-flink ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ psc-flink ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscDynamicTableFactoryTest
[INFO] Tests run: 69, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.059 s - in com.pinterest.flink.streaming.connectors.psc.table.PscDynamicTableFactoryTest
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscProjectionPushdownTest
[INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 s - in com.pinterest.flink.streaming.connectors.psc.table.PscProjectionPushdownTest
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscTableCommonUtilsTest
[INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.551 s - in com.pinterest.flink.streaming.connectors.psc.table.PscTableCommonUtilsTest
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 102, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  4.938 s
[INFO] Finished at: 2026-03-06T04:29:39Z
[INFO] ------------------------------------------------------------------------

Internal Testing E2E w/ Flink

After submitting DDL with 42 columns

SELECT * FROM <Table> LIMIT 10; --> Produced RowType w/ all 42 columns

SELECT a, b, c FROM <Table> LIMIT 10; --> Produced RowType w/ 3 columns a,b,c

SELECT a FROM <table> WHERE b = '<value>' LIMIT 10; Produced Rowtype w/ 2 columns a,b

@KevBrowne KevBrowne force-pushed the enable-flink-projection-pushdown branch from 43f57d1 to 5f25ef7 Compare February 23, 2026 21:41
@KevBrowne KevBrowne marked this pull request as ready for review February 24, 2026 14:34
@KevBrowne KevBrowne requested a review from a team as a code owner February 24, 2026 14:34
@KevBrowne KevBrowne force-pushed the enable-flink-projection-pushdown branch from 5f25ef7 to 813dd37 Compare February 24, 2026 14:39
Copy link
Contributor

@nickpan47 nickpan47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check w/ Ashish to see why nested projection is not supported? If you plan to add another PR to supported nested projection, please add a TODO as quick followup. If we don't plan to support it in the first release, I wonder why, given we have tested that partial deserializer actually supports nested pushdown?

@KevBrowne KevBrowne force-pushed the enable-flink-projection-pushdown branch from 813dd37 to 125e731 Compare March 2, 2026 18:05
@KevBrowne KevBrowne requested a review from nickpan47 March 2, 2026 18:10
Copy link
Contributor

@nickpan47 nickpan47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you answer the question on how the nested pushdown will be translated into the selected Thrift metadata needed in the PartialThriftDeserializer? Thanks!

"Projection path must have at least one element but got: %s",
Arrays.toString(path));
// For nested projection, we only need the top-level field index to determine
// which fields to deserialize. The format (e.g., Thrift) handles nested extraction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we pass the list of nested fields as the ThriftMetadata.ThriftStruct metadata to the PartialThriftDeserializer, if we only record the top-level field index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the PR did not show updated.

but to expand on just getting the top level row i.e

int physicalPos = path[0];

as part of that loop I added logic to get all the paths for the top level field. So for instance we have

CREATE TABLE events (
    id INT,           
    user ROW<         
        name STRING,      
        age INT,          
        address ROW<     
            city STRING,      
            zip STRING        
        >
    >,
    timestamp BIGINT  
)

we should now be able to capture

[1, 0], [1, 1], [1, 2], [1, 2, 0], etc.

i also updated the unit tests to capture this as well.

@KevBrowne KevBrowne requested a review from nickpan47 March 6, 2026 04:35
Copy link
Contributor

@nickpan47 nickpan47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Have a comment on the separation of projection variables between top-level only vs nested. PTAL

physicalPos >= 0 && physicalPos < physicalFieldCount,
"Projected field index out of bounds: %s",
physicalPos);
physicalIndexToOutputIndex[physicalPos] = outputPos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be an issue if we have a table foo (a string, b row<key string, value string>>), then we have select b.key, b.value from foo. In this case, physicalIndexToOutputIndex[1] would collide since there are two sub-fields in the same top-level field that need to map to two outputPos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made changes to this because yes physicalIndexToOutputIndex[1] would override losing b.key,

so now instead we properly record the outputPos in a map i.e like

pathsByTopLevelIndex = {
    1: [
        { path: [1, 0], outputPos: 0 },   b.key → output position 0
        { path: [1, 1], outputPos: 1 }   b.value → output position 1
    ]
}

this.valueNestedProjection = valueNestedList.toArray(new int[0][]);

// Remap decoded fields into the projected output row order.
this.keyOutputProjection =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for nested projection that included multiple sub-fields in the same top-level field, keyOutputProjection will only include the last sub-field's output pos as the top-level field's keyOutputProjection[topFieldIndex]? It is the same as valueOutputProjection. Hence, for the example foo with. b.key and b.value above, valueOutputProjection[1]=1, which is the index position of b.value. Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to comment above

* Converts nested projection paths to dot-separated field names.
* Example: [[1, 0], [2]] with schema (a, b ROW&lt;x, y&gt;, c) → ["b.x", "c"]
*/
public static List<String> convertPathsToFieldNames(int[][] paths, DataType dataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used by unit tests? I don't see it access the private static member variables in PscDynamicSource class either. It can be completely moved to the test class.

// Full nested projection paths for each format field.
// Each int[] is a path: [topLevelIndex] for top-level, [topLevelIndex, nestedIndex, ...] for nested.
// Used by formats that support nested projection (e.g., Thrift's PartialThriftDeserializer).
protected int[][] keyNestedProjection;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we instantiate two separate variables for nested projection? Instead, we can change the definition of keyProjection and valueProjection to int[][] and consolidate the member variables.

Copy link
Contributor Author

@KevBrowne KevBrowne Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done i change keyProjection/valueProjection to int[][] and consolidated the variables so we no longer need keyNestedProjection and valueNestedProjection

DataType physicalDataType,
@Nullable DecodingFormat<DeserializationSchema<RowData>> keyDecodingFormat,
DecodingFormat<DeserializationSchema<RowData>> valueDecodingFormat,
int[][] keyProjection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One main question that I still have here: in this keyProjection / valueProjection arrays for nested projection, those embedded indices are based on the field/sub-field indices in the DDL table schema definition, right? So, does it also mean that the nested projection also relies on the field/sub-field indices of the table DDL schema are exactly the same as the Thrift schema? i.e. if Thrift schema has 1: f1, 2:f2, 3:f3, 4:f4 fields, and the table DDL has not been updated yet and only has 1:f1, 2:f2, 3:f4 three fields. If the projection is on f1 and f4, the indices will be 1,3. However, due to difference between table DDL and Thrift schema, the projection will become 1:f1, and 3:f3, not expected.

The only reliable way to implement that is to have the nested projection based on field/sub-field names, instead of indices. This is what I believe to be implemented: a) int[][] keyProjection and int[][] valueProjection are indices to the table DDL schema; b) we need to find the list of field/sub-field names according to the table DDL schema based on keyProjection and valueProjection; c) query and convert the field/sub-field names to Thrift field/sub-field indices, which is the actual nested projection int[][] that are passed into the partial Thrift deserializer.

Copy link
Contributor

@nickpan47 nickpan47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment on the indices based matching to name-based matching between table DDL vs Thrift. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants