Enable Flink projection pushdown for PSC connector by KevBrowne · Pull Request #131 · pinterest/psc

KevBrowne · 2026-02-23T21:31:38Z

Summary

This PR implements SupportsProjectionPushDown for the PSC Flink connector, enabling Flink's optimizer to push column projections down to the source. This optimization reduces deserialization overhead by only deserializing the columns actually needed by queries, rather than deserializing the entire schema.

Key changes:

PscDynamicSource now implements SupportsProjectionPushDown
Added applyProjection() method to compute query-specific projections
Introduced format vs output projection separation (keyFormatProjection/valueFormatProjection for deserialization, keyOutputProjection/valueOutputProjection for row assembly)
Updated getScanRuntimeProvider() and createPscDeserializationSchema() to use the new projections

##Test plan

Unit Test

% mvn -pl psc-flink test -Dtest=PscTableCommonUtilsTest,PscDynamicTableFactoryTest,PscProjectionPushdownTest -Dgpg.skip=true -Djacoco.skip=true
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-examples:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-integration-test:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-flink:jar:4.1.4-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.flink:flink-table-planner_${scala.binary.version}:jar -> duplicate declaration of version ${flink.version} @ line 327, column 21
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.flink:flink-json:jar -> duplicate declaration of version ${flink.version} @ line 417, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-logging:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-common:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-flink-logging:jar:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ com.pinterest.psc:psc-java-oss:4.1.4-SNAPSHOT, /home/kbrowne/code/psc/pom.xml, line 155, column 21
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.pinterest.psc:psc-java-oss:pom:4.1.4-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-source-plugin is missing. @ line 155, column 21
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO] 
[INFO] --------------------< com.pinterest.psc:psc-flink >---------------------
[INFO] Building psc-flink 4.1.4-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- jacoco-maven-plugin:0.8.5:prepare-agent (prepare-unit-tests) @ psc-flink ---
[INFO] Skipping JaCoCo execution because property jacoco.skip is set.
[INFO] argLine set to empty
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ psc-flink ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ psc-flink ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ psc-flink ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 99 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ psc-flink ---
[INFO] Nothing to compile - all classes are up to date
[INFO] 
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ psc-flink ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscDynamicTableFactoryTest
[INFO] Tests run: 69, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.059 s - in com.pinterest.flink.streaming.connectors.psc.table.PscDynamicTableFactoryTest
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscProjectionPushdownTest
[INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 s - in com.pinterest.flink.streaming.connectors.psc.table.PscProjectionPushdownTest
[INFO] Running com.pinterest.flink.streaming.connectors.psc.table.PscTableCommonUtilsTest
[INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.551 s - in com.pinterest.flink.streaming.connectors.psc.table.PscTableCommonUtilsTest
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 102, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  4.938 s
[INFO] Finished at: 2026-03-06T04:29:39Z
[INFO] ------------------------------------------------------------------------

Internal Testing E2E w/ Flink

After submitting DDL with 42 columns

SELECT * FROM <Table> LIMIT 10; --> Produced RowType w/ all 42 columns

SELECT a, b, c FROM <Table> LIMIT 10; --> Produced RowType w/ 3 columns a,b,c

SELECT a FROM <table> WHERE b = '<value>' LIMIT 10; Produced Rowtype w/ 2 columns a,b

nickpan47

Could you please check w/ Ashish to see why nested projection is not supported? If you plan to add another PR to supported nested projection, please add a TODO as quick followup. If we don't plan to support it in the first release, I wonder why, given we have tested that partial deserializer actually supports nested pushdown?

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

nickpan47

Can you answer the question on how the nested pushdown will be translated into the selected Thrift metadata needed in the PartialThriftDeserializer? Thanks!

nickpan47 · 2026-03-05T23:35:19Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+                    "Projection path must have at least one element but got: %s",
+                    Arrays.toString(path));
+            // For nested projection, we only need the top-level field index to determine
+            // which fields to deserialize. The format (e.g., Thrift) handles nested extraction.


How do we pass the list of nested fields as the ThriftMetadata.ThriftStruct metadata to the PartialThriftDeserializer, if we only record the top-level field index?

Not sure why the PR did not show updated.

but to expand on just getting the top level row i.e

int physicalPos = path[0];

as part of that loop I added logic to get all the paths for the top level field. So for instance we have

CREATE TABLE events ( id INT, user ROW< name STRING, age INT, address ROW< city STRING, zip STRING > >, timestamp BIGINT )

we should now be able to capture

[1, 0], [1, 1], [1, 2], [1, 2, 0], etc.

i also updated the unit tests to capture this as well.

nickpan47

Thanks! Have a comment on the separation of projection variables between top-level only vs nested. PTAL

nickpan47 · 2026-03-09T23:01:30Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+                    physicalPos >= 0 && physicalPos < physicalFieldCount,
+                    "Projected field index out of bounds: %s",
+                    physicalPos);
+            physicalIndexToOutputIndex[physicalPos] = outputPos;


Would there be an issue if we have a table foo (a string, b row<key string, value string>>), then we have select b.key, b.value from foo. In this case, physicalIndexToOutputIndex[1] would collide since there are two sub-fields in the same top-level field that need to map to two outputPos?

made changes to this because yes physicalIndexToOutputIndex[1] would override losing b.key,

so now instead we properly record the outputPos in a map i.e like

pathsByTopLevelIndex = { 1: [ { path: [1, 0], outputPos: 0 }, b.key → output position 0 { path: [1, 1], outputPos: 1 } b.value → output position 1 ] }

nickpan47 · 2026-03-09T23:10:27Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+        this.valueNestedProjection = valueNestedList.toArray(new int[0][]);
+
+        // Remap decoded fields into the projected output row order.
+        this.keyOutputProjection =


So, for nested projection that included multiple sub-fields in the same top-level field, keyOutputProjection will only include the last sub-field's output pos as the top-level field's keyOutputProjection[topFieldIndex]? It is the same as valueOutputProjection. Hence, for the example foo with. b.key and b.value above, valueOutputProjection[1]=1, which is the index position of b.value. Is this intended?

similar to comment above

nickpan47 · 2026-03-09T23:16:04Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+     * Converts nested projection paths to dot-separated field names.
+     * Example: [[1, 0], [2]] with schema (a, b ROW&lt;x, y&gt;, c) → ["b.x", "c"]
+     */
+    public static List<String> convertPathsToFieldNames(int[][] paths, DataType dataType) {


Is this only used by unit tests? I don't see it access the private static member variables in PscDynamicSource class either. It can be completely moved to the test class.

nickpan47 · 2026-03-09T23:22:25Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+    // Full nested projection paths for each format field.
+    // Each int[] is a path: [topLevelIndex] for top-level, [topLevelIndex, nestedIndex, ...] for nested.
+    // Used by formats that support nested projection (e.g., Thrift's PartialThriftDeserializer).
+    protected int[][] keyNestedProjection;


Why do we instantiate two separate variables for nested projection? Instead, we can change the definition of keyProjection and valueProjection to int[][] and consolidate the member variables.

done i change keyProjection/valueProjection to int[][] and consolidated the variables so we no longer need keyNestedProjection and valueNestedProjection

nickpan47 · 2026-03-12T08:51:30Z

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java

+            DataType physicalDataType,
+            @Nullable DecodingFormat<DeserializationSchema<RowData>> keyDecodingFormat,
+            DecodingFormat<DeserializationSchema<RowData>> valueDecodingFormat,
+            int[][] keyProjection,


One main question that I still have here: in this keyProjection / valueProjection arrays for nested projection, those embedded indices are based on the field/sub-field indices in the DDL table schema definition, right? So, does it also mean that the nested projection also relies on the field/sub-field indices of the table DDL schema are exactly the same as the Thrift schema? i.e. if Thrift schema has 1: f1, 2:f2, 3:f3, 4:f4 fields, and the table DDL has not been updated yet and only has 1:f1, 2:f2, 3:f4 three fields. If the projection is on f1 and f4, the indices will be 1,3. However, due to difference between table DDL and Thrift schema, the projection will become 1:f1, and 3:f3, not expected.

The only reliable way to implement that is to have the nested projection based on field/sub-field names, instead of indices. This is what I believe to be implemented: a) int[][] keyProjection and int[][] valueProjection are indices to the table DDL schema; b) we need to find the list of field/sub-field names according to the table DDL schema based on keyProjection and valueProjection; c) query and convert the field/sub-field names to Thrift field/sub-field indices, which is the actual nested projection int[][] that are passed into the partial Thrift deserializer.

nickpan47

One last comment on the indices based matching to name-based matching between table DDL vs Thrift. PTAL.

KevBrowne force-pushed the enable-flink-projection-pushdown branch from 43f57d1 to 5f25ef7 Compare February 23, 2026 21:41

KevBrowne marked this pull request as ready for review February 24, 2026 14:34

KevBrowne requested a review from a team as a code owner February 24, 2026 14:34

KevBrowne force-pushed the enable-flink-projection-pushdown branch from 5f25ef7 to 813dd37 Compare February 24, 2026 14:39

nickpan47 suggested changes Mar 1, 2026

View reviewed changes

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java Outdated Show resolved Hide resolved

...flink/src/main/java/com/pinterest/flink/streaming/connectors/psc/table/PscDynamicSource.java Show resolved Hide resolved

Enable Flink projection pushdown for PSC connector

125e731

KevBrowne force-pushed the enable-flink-projection-pushdown branch from 813dd37 to 125e731 Compare March 2, 2026 18:05

KevBrowne requested a review from nickpan47 March 2, 2026 18:10

nickpan47 suggested changes Mar 5, 2026

View reviewed changes

nested logic updated

2daf54e

KevBrowne requested a review from nickpan47 March 6, 2026 04:35

nickpan47 suggested changes Mar 9, 2026

View reviewed changes

Kevin Browne added 2 commits March 11, 2026 23:06

fix pos ordering

b6bc368

consolidate variables

70b9366

nickpan47 reviewed Mar 12, 2026

View reviewed changes

nickpan47 suggested changes Mar 12, 2026

View reviewed changes

Conversation

KevBrowne commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes:

Unit Test

Internal Testing E2E w/ Flink

Uh oh!

nickpan47 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nickpan47 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickpan47 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevBrowne Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickpan47 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KevBrowne commented Feb 23, 2026 •

edited

Loading

KevBrowne Mar 11, 2026 •

edited

Loading