This repository demonstrates several patterns for streaming data from Kafka into DuckDB (and DuckLake) using Python and Spark. It provides practical examples for ingesting, processing, and aggregating event data in real time, suitable for analytics and prototyping modern data pipelines. For a full write-up and context read the blogpost TODDO.
pipelines/pattern_1_1.py: Basic streaming from Kafka to DuckDB with periodic aggregation.pipelines/pattern_1_2.py: Streaming from Kafka to DuckLake with Change Data Feed aggregation.pipelines/pattern_2.py: Streaming from Kafka to DuckDB using Spark Structured Streaming.pipelines/bonus_pattern.py: Streaming views in DuckDB using the Tributary extension.scripts/producer.py: Produces random user event messages to a Kafka topic.scripts/cleanup.py: Cleans up local DuckDB/DuckLake files and deletes Kafka topics.
- Python 3.8+
- Docker (for running Kafka)
- Java (for Spark)
- Install dependencies:
pip install -r requirements.txt
Start the Kafka broker using Docker Compose:
docker compose up -d
Run the producer to generate random user events:
python scripts/producer.py --bootstrap-servers localhost:9092 --topic my_topic --duration 60
Example: Run Pattern 1.1 (DuckDB streaming and aggregation)
python pipelines/pattern_1_1.py --bootstrap-servers localhost:9092 --topic my_topic --duration 60
Other patterns can be run similarly (add --bootstrap-servers, --topic, and --duration as needed):
- DuckLake:
python pipelines/pattern_1_2.py --bootstrap-servers localhost:9092 --topic my_topic --duration 60 - Spark:
python pipelines/pattern_2.py --bootstrap-servers localhost:9092 --topic my_topic --duration 60 - Tributary:
python pipelines/bonus_pattern.py --bootstrap-servers localhost:9092 --topic my_topic --duration 60
To remove generated databases and Kafka topics:
python scripts/cleanup.py --bootstrap-servers localhost:9092 --topic my_topic
- The default Kafka topic is
my_topic. You can change this via command-line arguments. - DuckLake and Tributary extensions are installed automatically by the scripts.