Skip to content

[Feature] Introduce a new vector data type #7011

@ColdL

Description

@ColdL

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

PIP-40

Solution

As discussed in PIP-40, we propose introducing a dedicated vector data type in Paimon to better support storage and retrieval of vector data for AI workloads.

PIP-40 can be roughly split into two parts:
(1) introducing the vector data type itself;
(2) allowing users to specify the file format for vector data, to further optimize storage/access efficiency in mixed workloads.

For Part (1), a basic implementation is already available and includes:

  • Introducing a new vector type. To avoid confusion with the existing term "Vector" in the codebase, the new type is named VecType.java.
  • Providing a ColumnVector implementation for the vector type, with support in the paimon-arrow module, so Arrow-related file formats (e.g., Lance) can map FixedSizeList to VecType.
  • Adding Flink-side compatibility: via configuration, a Flink Array can be stored as Paimon VecType.

For Part (2) (specifying the file format for vector data), work is still in progress.

  • Add a new DataType extension to represent the vector type.
  • Map vector type to arrow FixedSizeList, applying to arrow-based file format.
  • Provide compatibility at the Flink connector layer, enabling vector type read/write via Flink SQL.
  • Add relevant tests, including an end-to-end (E2E) test.
  • Support specifying the file format for the vector store.

Although the code is still in a draft state, it changes some basic interfaces (e.g., DataGetters), thus I'd like to discuss it early. Any comments on this @JingsongLi

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions