Skip to content

sandboxws/awesome-flink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Flink

Awesome Flink Awesome

A curated list of awesome Apache Flink frameworks, libraries, connectors, tools, and resources.

Apache Flink is an open-source unified stream and batch data processing framework with powerful state management, event-time semantics, and exactly-once guarantees.

Contents

Packages & Libraries

DSLs & Frameworks

Official Connectors

Connectors maintained under the Apache Flink project or Apache umbrella.

Community Connectors

Connectors maintained by database vendors or independent developers.

  • ClickHouse Connector - Flink connector for ClickHouse, maintained by the ClickHouse team.
  • ClickHouse Connector (itinycheng) - Flink SQL connector for ClickHouse with catalog support, read/write for complex types.
  • StarRocks Connector - Read/write connector for StarRocks with DataStream, Table API, SQL, and Flink CDC 3.0 support.
  • Doris Connector - Flink connector for Apache Doris maintained by the Doris community.
  • Redis Connector - Async Redis connector built on Lettuce, supporting SQL join/sink with query caching.
  • HTTP Connector - Source and sink for REST APIs with DataStream, Table, and SQL support.
  • Snowflake Connector - Flink connector for Snowflake, maintained by DeltaStream.
  • NATS Connector - Connector for NATS messaging, maintained by Synadia.
  • OceanBase Connector - Flink connector for OceanBase distributed database.
  • NebulaGraph Connector - Flink connector for NebulaGraph graph database.
  • QuestDB Connector - Flink sink for QuestDB time-series database using InfluxDB Line Protocol.
  • TiBigData - TiDB connectors for Flink Table API, maintained by the TiDB incubator.

Machine Learning

  • Flink ML - Official machine learning library for Flink with algorithms for classification, regression, and clustering.
  • Alink - Alibaba's machine learning platform built on Flink for batch and stream processing.
  • dl-on-flink - Deep learning framework integration (TensorFlow, PyTorch) running on Flink.

Complex Event Processing

  • Flink CEP - Built-in Complex Event Processing library for detecting patterns in event streams.
  • Flink CEP SQL - SQL-based pattern matching using the MATCH_RECOGNIZE clause.

State Backends

  • RocksDB State Backend - Production-grade state backend for large state using embedded RocksDB.
  • ForSt State Backend - Next-generation state backend (Flink on RocksDB over Storage) for disaggregated storage.

Testing & Quality

  • flink-testing - Official testing utilities including MiniCluster, test harnesses, and test sources/sinks.

Monitoring & Observability

  • Flink Reactor Console - FlinkReactor Console — a real-time dashboard and GraphQL server for managing Apache Flink clusters.
  • Flink Metrics System - Built-in metrics system with reporters for Prometheus, Graphite, Datadog, and more.
  • Flink Web UI - Built-in dashboard for monitoring job status, backpressure, checkpoints, and task managers.

Flink SQL

Tools & Frameworks

  • Flink Reactor DSL - Write streaming pipelines as TypeScript components. Compile to Flink SQL + Kubernetes CRDs.
  • Flink SQL Gateway - REST service for submitting Flink SQL statements remotely over a standard API.
  • Flink SQL Client - Interactive CLI for writing and executing Flink SQL queries against running clusters.

Connectors & Catalogs

  • Hive Catalog - Persistent catalog using Hive Metastore for managing Flink SQL metadata.
  • Paimon Catalog - Native catalog integration for Apache Paimon streaming lakehouse tables.
  • Iceberg Catalog - Catalog implementation for managing Apache Iceberg tables in Flink SQL.
  • JDBC Catalog - Catalog for exposing existing relational database tables as Flink SQL tables.

Tutorials & Examples

UDFs & Extensions

  • Flink UDF Documentation - Official guide for implementing scalar, table, and aggregate user-defined functions.
  • flink-faker - Table source for generating fake test data using SQL DDL with Datafaker expressions.

Flink 2.x

Flink 2.0 marked a major milestone — the biggest release in the project's history with 165 contributors, 25 FLIPs, and sweeping architectural changes including disaggregated state management, API modernization, and the removal of legacy APIs. Subsequent 2.x releases have continued the push toward unified real-time data and AI workloads.

What Changed from 1.x

Flink 2.0 is a breaking release. Key removals and changes to be aware of:

  • Flink 2.0 Release Notes - Complete list of breaking changes and migration notes.
  • Upgrading Applications and Flink Framework - Official guide for upgrading from 1.x to 2.x.
  • FLIP-458: Long-Term Support for Final 1.x Release - LTS plan for the final Flink 1.x version to ease migration.
  • Removed APIs: DataSet API, Scala DataStream/DataSet APIs, SourceFunction/SinkFunction/Sink V1, TableSource/TableSink, FsStateBackend, MemoryStateBackend, and per-job deployment mode — 210+ deprecated classes removed in total.
  • Java version changes: Java 8 dropped. Java 17 is the new default. Java 11 (minimum) and Java 21 are supported.
  • Configuration overhaul: Legacy flink-conf.yaml replaced by standard YAML config.yaml with a migration tool provided.
  • State compatibility: Recovery from 1.x savepoints may require migration strategies — state compatibility is not guaranteed across the major version boundary.
  • Connector impact: Connectors depending on SourceFunction/SinkFunction/Sink V1 do not work on 2.x. Official connectors (Kafka, Paimon, JDBC, Elasticsearch) shipped 2.x-compatible versions at launch.

Flink 2.0

Released March 2025. A ground-up modernization of the Flink runtime and API surface.

Headline features:

  • Disaggregated State Management — Decouples state storage from compute using the new ForSt state backend over distributed file systems. Enables asynchronous, non-blocking state access and fast rescaling for jobs with hundreds of terabytes of state. Nexmark benchmarks show 75-120% throughput versus traditional local state stores.
  • DataStream V2 API (experimental) — New DataStream replacement with ProcessFunction, partitioning primitives, state, time services, and watermark processing.
  • Materialized Tables — Unified real-time and historical data management through a single pipeline with schema/query updates without reprocessing. Native Kubernetes/YARN submission and Paimon integration for ACID transactions.
  • Adaptive Batch Execution — Dynamic broadcast join selection and automatic join skew optimization, achieving 8-16% TPC-DS benchmark improvements.
  • AI/ML in CDC — Flink CDC 3.3 with dynamic AI model invocation (OpenAI chat/embedding models) and specialized SQL syntax for defining and invoking AI models.
  • SQL enhancementsQUALIFY clause for window function filtering, SQL Gateway in application mode, seven critical SQL operators with async state access.

Flink 2.1

Released July 2025. Focused on unified real-time data and AI, with major SQL and streaming improvements.

Headline features:

  • Model DDLs & ML_PREDICT — Define AI models as catalog objects and invoke them via ML_PREDICT for real-time inference in SQL queries, with built-in OpenAI support.
  • Process Table Functions (PTFs) — Stateful transformations with managed state, event-time, timers, and changelog access — capabilities that previously required DataStream expertise, now accessible from SQL.
  • Variant Type — Semi-structured data type for deeply nested or evolving schemas, with native Paimon integration.
  • Delta Join — New streaming join operator requiring significantly less state than regular joins, enabled by default.
  • StreamingMultiJoinOperator — Zero intermediate state for cascaded joins sharing common keys.
  • SQL Connector for Keyed State — Query keyed state from checkpoints and savepoints directly via Flink SQL.

Flink 2.2

Released December 2025. The latest stable release, advancing AI integration and operational maturity.

Headline features:

  • ML_PREDICT in Table API — Model inference now available programmatically beyond SQL.
  • VECTOR_SEARCH — Real-time vector similarity search within Flink SQL for AI-powered retrieval.
  • Materialized Table Enhancements — Optional FRESHNESS clause, bucketing via DISTRIBUTED BY, SHOW MATERIALIZED TABLES, and customizable defaults via MaterializedTableEnricher.
  • SinkUpsertMaterializer V2 — Fixes exponential performance degradation in changelog reconciliation.
  • Delta Join Improvements — Expanded SQL pattern support, CDC source support, and caching to reduce external storage requests.
  • Operational improvements — Balanced task scheduling across TaskManagers, time-based job history retention, RateLimiter for scan sources, and balanced splits assignment for addressing data skew.

Resources

Official Resources

Books

Courses & Tutorials

Papers

Blogs

  • Robin Moffatt (rmoff.net) - In-depth Flink SQL tutorials covering watermarks, joins, changelogs, Iceberg integration, and CDC.
  • Ververica Blog - Technical blog from the original Flink creators covering architecture, best practices, and ecosystem.
  • Confluent - Apache Flink - Confluent's Apache Flink product page with resources on managed Flink and streaming architectures.
  • Flink Community Blog - Official Apache Flink blog with release notes and community updates.

Videos & Talks

  • Flink Forward - Annual conference dedicated to Apache Flink with talks from 2015 to 2025.
  • Flink Forward YouTube - Recorded talks and presentations from Flink Forward conferences.
  • Apache Flink YouTube - Official Apache Flink YouTube channel with tutorials and community talks.

Community

Related Projects

  • Apache Kafka - Distributed event streaming platform commonly used as a source and sink for Flink.
  • Apache Spark Structured Streaming - Alternative stream processing framework with micro-batch semantics.
  • Apache Beam - Unified model for batch and stream processing that can use Flink as a runner.
  • Apache Kafka Streams - Lightweight stream processing library built on Kafka.
  • Materialize - Streaming SQL database powered by Timely Dataflow.
  • RisingWave - Distributed SQL streaming database for real-time analytics.

Contributing

Contributions are welcome! Please read the contribution guidelines before submitting a pull request.

License

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors