Skip to content

gordonmurray/flink_paimon_duckdb_rill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time Analytics Pipeline with Flink, Paimon, and Rill

A complete streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards.

πŸš€ What You Get

  • Real-Time CDC: Captures every MySQL change using Flink CDC
  • Lake Storage: Stores data in Apache Paimon format on S3-compatible storage
  • Live Dashboard: Rill analytics with automated catalog management
  • Automated Fixes: Sidecar container handles DuckDB catalog prefix issues
  • One Command Start: Everything runs with docker compose up

πŸ—οΈ Architecture

MySQL β†’ Flink CDC β†’ Apache Paimon β†’ MinIO β†’ Rill Dashboard
  ↑                                           ↓
  Manual inserts                          Analytics

Components:

  • MySQL/MariaDB: Source database with sample product data
  • Apache Flink: Real-time CDC processing engine
  • Apache Paimon: Lake storage format optimized for streaming
  • MinIO: S3-compatible object storage
  • Rill: Modern analytics dashboard with DuckDB engine
  • Rill Patcher: Automated sidecar handling catalog prefix issues

⚑ Quick Start

Prerequisites

  • Docker and Docker Compose
  • 8GB+ RAM recommended
  • Ports 3000, 3306, 8081, 9000-9001 available

1. Clone and Start

git clone <your-repo>
cd flink_iceberg_anomaly_pipeline_paimon
docker compose up -d

2. Initialize the CDC Pipeline

./setup_cdc.sh

3. Open the Dashboard

Navigate to: http://localhost:3000

The dashboard will show live data with automatic 60-second refresh.

πŸ§ͺ Test Real-Time Updates

Add new products to see live updates:

# Add some products
docker exec mariadb mysql -u root -prootpassword -e "
INSERT INTO mydatabase.products (name, price) VALUES
('New Product 1', 99.99),
('New Product 2', 199.99);"

# Check MySQL count
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# Wait 60 seconds for dashboard to refresh
# You'll see the updated count automatically!

πŸ”§ How It Works

CDC Pipeline

  1. MySQL Changes: Any INSERT/UPDATE/DELETE in MySQL is captured
  2. Flink Processing: Flink CDC reads the MySQL binlog in real-time
  3. Paimon Storage: Changes are written to Paimon tables in MinIO
  4. Rill Dashboard: Visualizes data with 60-second refresh cycle

The Catalog Prefix Solution

DuckDB creates random catalog prefixes (e.g., main8514e79c) on startup. Our rill-patcher sidecar:

  1. Waits for Rill to start
  2. Discovers the current catalog alias via SQL
  3. Patches the model file with the correct prefix
  4. Refreshes data every 60 seconds
  5. Re-patches if Rill restarts with a new prefix

Why Apache Paimon?

  • Optimized for streaming updates with ACID guarantees
  • Supports both batch and streaming workloads
  • Compatible with multiple query engines
  • Efficient storage with automatic compaction

πŸ“Š Monitoring

Service Health Checks

# Check all containers
docker ps

# Monitor CDC job
curl -s http://localhost:8081/jobs | jq

# Test Rill Dashboard API
curl -s "http://localhost:3000/v1/instances/default/query" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT COUNT(*) FROM paimon_products"}'

# View Paimon files in MinIO
docker exec minio mc ls --recursive local/warehouse/

Data Flow Verification

# MySQL data
docker exec mariadb mysql -u root -prootpassword -e "SELECT COUNT(*) FROM mydatabase.products;"

# MinIO storage
docker exec minio mc ls --recursive local/warehouse/cdc_db.db/products_sink/

# Rill dashboard count
curl -s "http://localhost:3000/v1/instances/default/query" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT COUNT(*) FROM paimon_products"}' | jq '.data[0]'

πŸ› οΈ Development

Project Structure

β”œβ”€β”€ docker-compose.yml          # Complete stack definition
β”œβ”€β”€ conf/
β”‚   └── flink-conf.yaml        # Flink configuration
β”œβ”€β”€ rill/
β”‚   β”œβ”€β”€ connectors/           # DuckDB S3 configuration
β”‚   β”œβ”€β”€ models/               # SQL model definitions
β”‚   β”œβ”€β”€ metrics/              # Metrics definitions
β”‚   └── dashboards/           # Dashboard configs
β”œβ”€β”€ rill-patcher.sh           # Automated catalog management
β”œβ”€β”€ duckdb/
β”‚   └── test_s3.py            # DuckDB query examples
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ init.sql              # MySQL initial data
β”‚   └── setup_paimon_cdc.sql  # CDC pipeline setup
└── setup_cdc.sh              # CDC initialization script

Key Configuration Files

Flink Config (conf/flink-conf.yaml):

  • Configures Flink job manager and task manager
  • Sets checkpointing intervals
  • Defines S3/MinIO credentials

CDC Setup (sql/setup_paimon_cdc.sql):

  • Creates Paimon catalog
  • Defines source MySQL table
  • Creates sink Paimon table
  • Starts CDC pipeline

🚨 Troubleshooting

Common Issues

CDC Pipeline not starting

# Check if the job started:
curl -s http://localhost:8081/jobs | jq

# If not, run setup again:
./setup_cdc.sh

No data in MinIO

# Check Flink job status
curl -s http://localhost:8081/jobs

# Restart CDC setup
./setup_cdc.sh

Verify data flow

# Check Flink job metrics
curl -s http://localhost:8081/jobs/<job-id>/metrics

# List Paimon files
docker exec minio mc ls local/warehouse/cdc_db.db/

Clean Restart

# Complete reset
docker compose down -v
docker compose up -d
./setup_cdc.sh
# Wait 2-3 minutes for full initialization

Built with: Apache Flink β€’ Apache Paimon β€’ Rill β€’ DuckDB β€’

About

A streaming analytics stack that captures MySQL changes via CDC, stores them in Apache Paimon format, and visualizes them with Rill dashboards

Topics

Resources

License

Stars

Watchers

Forks

Contributors