-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
Solution
As discussed in PIP-40, we propose introducing a dedicated vector data type in Paimon to better support storage and retrieval of vector data for AI workloads.
PIP-40 can be roughly split into two parts:
(1) introducing the vector data type itself;
(2) allowing users to specify the file format for vector data, to further optimize storage/access efficiency in mixed workloads.
For Part (1), a basic implementation is already available and includes:
- Introducing a new vector type. To avoid confusion with the existing term "Vector" in the codebase, the new type is named VecType.java.
- Providing a ColumnVector implementation for the vector type, with support in the paimon-arrow module, so Arrow-related file formats (e.g., Lance) can map FixedSizeList to VecType.
- Adding Flink-side compatibility: via configuration, a Flink Array can be stored as Paimon VecType.
For Part (2) (specifying the file format for vector data), work is still in progress.
- Add a new DataType extension to represent the vector type.
- Map vector type to arrow FixedSizeList, applying to arrow-based file format.
- Provide compatibility at the Flink connector layer, enabling vector type read/write via Flink SQL.
- Add relevant tests, including an end-to-end (E2E) test.
- Support specifying the file format for the vector store.
Although the code is still in a draft state, it changes some basic interfaces (e.g., DataGetters), thus I'd like to discuss it early. Any comments on this @JingsongLi
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request