Skip to content

polardb/duckdb-paimon

Repository files navigation

DuckDB Paimon Extension 🦆

This extension enables DuckDB to read and query Apache Paimon format data directly — no ETL pipelines, no Flink/Spark clusters required. Just open a DuckDB shell and run SQL against your Paimon tables.

Similar to other extensions, duckdb-paimon brings DuckDB's powerful local analytics to the Paimon data lake ecosystem.

About Apache Paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations. It innovatively combines lake format and LSM structure, bringing realtime streaming updates into the lake architecture.

Implementation

This extension is built on top of paimon-cpp, an open-source C++ library that provides native access to Paimon format data. It is the first library that brings native Paimon read/write capabilities to the C++ ecosystem.

  • Zero JVM dependency — No Java runtime required. Pure C++ implementation means minimal memory footprint and instant startup.
  • Apache Arrow data exchange — Data flows between paimon-cpp and DuckDB via Apache Arrow, the industry standard for columnar in-memory data, enabling zero-copy transfers with no serialization overhead.
  • Parallel scan architecture — Paimon tables are split into independent Splits, and DuckDB's multi-threaded execution engine reads them in parallel to fully utilize multi-core CPUs.
  • Secure credential management — OSS credentials are managed through DuckDB's native Secret Manager with scope isolation and automatic key redaction.

Features

  • Read Paimon table data (local and remote OSS)
  • Projection pushdown optimization
  • Predicate pushdown optimization
  • Multiple file format support (manifest / data)
  • Catalog ATTACH support
  • DuckDB Secret-based OSS credential management
  • Snapshot history inspection
  • Snapshot-based time travel queries

Use Cases

Lightweight Ad-hoc Queries on Realtime Lakehouses

Data is written into Paimon by Flink in real time. Analysts can query it directly on OSS using DuckDB + duckdb-paimon — no compute cluster needed, reducing query latency from minutes to seconds.

Data Validation & Quality Checks

Use DuckDB in CI/CD pipelines to run data quality assertions on Paimon tables, verifying that Flink job outputs meet expectations. Lightweight, fast, and dependency-free.

Data Exploration & Debugging

Data engineers developing Flink jobs can instantly inspect the current state of Paimon tables using DuckDB Shell, quickly locating data issues — far more efficient than launching a Flink SQL Client.

Cross-format Federated Queries

DuckDB natively supports Parquet, CSV, JSON, Iceberg, and more. Combined with duckdb-paimon, you can JOIN Paimon tables with other data sources without any data movement:

-- Join a Paimon orders table with a local CSV dimension table
SELECT o.order_id, o.amount, c.customer_name
FROM paimon_scan('oss://...', 'db', 'orders') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.id;

Development Guide

Building

Clone the repository with submodules:

git clone --recurse-submodules https://github.com/polardb/duckdb-paimon.git
cd duckdb-paimon

--recurse-submodules pulls DuckDB and paimon-cpp, which are required to build the extension.

Build in release mode:

GEN=ninja make

Or build in debug mode:

GEN=ninja make debug

Running the Tests

# Release
make test

# Debug
make test_debug

Usage

The examples below use sample data bundled in the data/ directory of this repository. Start the DuckDB shell with the extension pre-loaded:

./build/release/duckdb

Query Local Paimon Tables

Pass the table path directly to paimon_scan, or use separate warehouse / database / table arguments:

SELECT * FROM paimon_scan('./data/testdb.db/testtbl');
┌─────────┬───────┬───────┬────────┐
│   f0    │  f1   │  f2   │   f3   │
│ varchar │ int32 │ int32 │ double │
├─────────┼───────┼───────┼────────┤
│ Alice   │     1011.0 │
│ Bob     │     1112.1 │
│ Cathy   │     1213.2 │
│ David   │     2021.0 │
│ Eve     │     2122.1 │
│ Frank   │     2223.2 │
│ Grace   │     3031.0 │
│ Henry   │     3132.1 │
│ Iris    │     3233.2 │
└─────────┴───────┴───────┴────────┘

-- SELECT * FROM paimon_scan('./data', 'testdb', 'testtbl');

Query Remote OSS Paimon Tables

First create a secret to supply OSS credentials, then query using either a full table path or separate warehouse / database / table arguments:

-- Configure OSS credentials
CREATE SECRET my_oss (
    TYPE paimon,
    key_id 'your-access-key-id',
    secret 'your-access-key-secret',
    endpoint 'oss-cn-hangzhou.aliyuncs.com'
);

SELECT * FROM paimon_scan('oss://your-bucket/warehouse/your_db.db/your_table');
SELECT * FROM paimon_scan('oss://your-bucket/warehouse', 'your_db', 'your_table');

Attach as Catalog

ATTACH a Paimon warehouse as a catalog to browse and query all databases and tables inside it with standard DuckDB SQL:

ATTACH './data' AS my_catalog (TYPE paimon);

SHOW ALL TABLES;
SELECT * FROM my_catalog.testdb.testtbl;

-- For an OSS warehouse:
-- ATTACH 'oss://my-bucket/warehouse' AS my_catalog (TYPE paimon);

Inspect Snapshot History

Use paimon_snapshots to list all snapshots of a Paimon table — useful for auditing commit history, diagnosing data issues, or identifying a snapshot ID for time-travel queries:

SELECT snapshot_id, commit_kind, commit_time, total_record_count
FROM paimon_snapshots('./data/testdb.db/testtbl')
ORDER BY snapshot_id;
┌─────────────┬─────────────┬─────────────────────────┬────────────────────┐
│ snapshot_id │ commit_kind │      commit_time        │ total_record_count │
│    int64    │   varchartimestamp         │       int64        │
├─────────────┼─────────────┼─────────────────────────┼────────────────────┤
│           1 │ APPEND      │ 2026-01-15 10:48:23.4863 │
│           2 │ APPEND      │ 2026-01-15 10:48:23.5096 │
│           3 │ APPEND      │ 2026-01-15 10:48:23.5289 │
└─────────────┴─────────────┴─────────────────────────┴────────────────────┘

-- SELECT snapshot_id, commit_kind, commit_time, total_record_count
-- FROM paimon_snapshots('oss://your-bucket/warehouse', 'your_db', 'your_table')
-- ORDER BY snapshot_id;

Time Travel Queries

Query a historical version of a table by snapshot ID or by timestamp. Use paimon_snapshots first to identify the snapshot you want.

-- Read from a specific snapshot (6 rows — state after the second append)
SELECT * FROM paimon_scan('./data/testdb.db/testtbl', snapshot_from_id=2);

-- Read from a point in time (returns the snapshot active at that moment)
SELECT * FROM paimon_scan('./data/testdb.db/testtbl', snapshot_from_timestamp=TIMESTAMP '2026-01-15 10:48:23.5');

When using an ATTACHed catalog, the same functionality is available via DuckDB's native AT clause:

ATTACH './data' AS my_catalog (TYPE paimon);

-- AT (VERSION => snapshot_id)
SELECT count(*) FROM my_catalog.testdb.testtbl AT (VERSION => 2);

-- AT (TIMESTAMP => point_in_time)
SELECT count(*) FROM my_catalog.testdb.testtbl AT (TIMESTAMP => TIMESTAMP '2026-01-15 10:48:23.5');

Related Projects

Join the Community

We welcome contributions and discussions! If you have questions, ideas, or want to connect with other users and developers, join our community by clicking here or scan the QR code below:

DingTalk Group QR Code

About

DuckDB extension for accessing Apache Paimon. 🦆

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Generated from duckdb/extension-template