- This is a educational project for learning purposes only. It is not intended for production use.
Author: Mahmudur R Manna
This project sets up and deploys Apache Flink with Apache Beam on Minikube with MinIO for scalable data processing pipelines. It provides a fully containerized and Kubernetes-based infrastructure designed for development, testing, and prototyping ML workflows.
- Minikube Cluster - A local Kubernetes cluster to host all components.
- Flink JobManager - Manages job execution and coordinates with TaskManagers.
- Flink TaskManager - Executes distributed data processing tasks as directed by the JobManager.
- Beam Job Server - Accepts Apache Beam jobs and submits them to Flink for execution.
- Data Ingestion App - Submits jobs to the Beam Job Server for processing data.
- MinIO DataLake - Provides object storage for raw and processed data.
- The Data Ingestion App submits processing jobs to the Beam Job Server.
- The Beam Job Server delegates job execution to the Flink JobManager.
- The Flink JobManager coordinates distributed execution with Flink TaskManagers.
- The Data Ingestion App interacts with MinIO for raw data storage.
- The Flink TaskManagers interact with MinIO for storing processed data.
- Minikube: Kubernetes cluster manager for local setups.
- kubectl: Kubernetes CLI for managing cluster resources.
- Docker: Container platform for building and running containerized applications.
-
Install Minikube:
- macOS:
brew install minikube
- Linux:
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 sudo install minikube-linux-amd64 /usr/local/bin/minikube
- Windows: Download from Minikube Releases.
- macOS:
-
Start Minikube:
minikube start
-
Install kubectl:
- macOS:
brew install kubectl
- Linux:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl - Windows: Download from Kubernetes Releases.
- macOS:
-
Verify kubectl:
kubectl version --client
-
Install Docker:
- macOS:
brew install --cask docker
- Linux: Follow instructions on Docker Website.
- Windows: Download from Docker Website.
- macOS:
-
Start Docker Service:
sudo systemctl start docker
-
Verify Docker Installation:
docker --version
AGNOSCTICMLFLOW
├── documentation
│ ├── Components.png
│ ├── architecture-diagram.png
│ ├── README.md
│
├── iac # Infrastructure as Code (Deployment Configurations)
│ ├── cloud # Placeholder for Cloud Deployment Configurations
│ ├── local # Local Kubernetes Deployment Configurations
│ │ ├── beam # Apache Beam JobServer Configurations
│ │ │ ├── beam-src # Source files or configurations for Beam
│ │ │ ├── beam-jobserver-deployment.yaml
│ │ │ ├── beam-jobserver-service.yaml
│ │ │
│ │ ├── flink # Apache Flink Configurations
│ │ │ ├── Dockerfile
│ │ │ ├── entrypoint.sh
│ │ │ ├── flink-conf.yaml
│ │ │ ├── flink-data-pvc.yaml
│ │ │ ├── flink-ingress.yaml
│ │ │ ├── flink-master-deployment.yaml
│ │ │ ├── flink-master-service.yaml
│ │ │ ├── flink-tmp-data-pvc.yaml
│ │ │ ├── flink-worker-deployment.yaml
│ │ │ ├── flink-worker-service.yaml
│ │ │
│ │ ├── minio # MinIO Object Storage Configurations
│ │ │ ├── minio-deployment.yaml
│ │ │ ├── minio-ingress.yaml
│ │ │ ├── minio-service.yaml
│ │ │ ├── minio.crt
│ │ │ ├── minio.key
│ │ │
│ │ ├── spark # Apache Spark Configurations
│ │ │ ├── Dockerfile-spark-standalone
│ │ │ ├── spark-ingress.yaml
│ │ │ ├── spark-master-deployment.yaml
│ │ │ ├── spark-master-service.yaml
│ │ │ ├── spark-worker-deployment.yaml
│ │ │ ├── spark-worker-service.yaml
│ │ │
│ │ ├── README.md
│
├── onprem # Placeholder for On-Prem Deployment Configurations
│ ├── README.md
│
├── middleware # Middleware Configurations (If applicable)
│ ├── README.md
│
├── portal # UI/Portal for Application Management
│ ├── README.md
│
├── services # Microservices for Specific Tasks
│ ├── customer-churn # Domain-Specific Use Cases
│ │ ├── data_ingestion_app # Data Ingestion Application
│ │ │ ├── scripts # Utility Scripts for Data Handling
│ │ │ ├── src # Source Code
│ │ │ ├── tests # Unit and Integration Tests
│ │ │ ├── Dockerfile # Container Configuration for App
│ │ │ ├── ingestion-deployment.yaml
│ │ │ ├── kube-config.yaml
│ │ │ ├── poetry.lock # Dependency Lock File
│ │ │ ├── pyproject.toml # Python Project Configuration
│ │ │ ├── README.md
│ │ │ ├── requirements.txt # Python Dependencies
│ │ │ ├── sales_data.csv # Example Dataset
│ │ │ ├── model_builder # Model Training Code
│ │ │ ├── predictor # Model Serving Code
│ │ │ ├── trainer # Model Training Pipeline
│ │ │
│ │ ├── domain-2 # Additional Domain Services (Placeholder)
│ │ │ ├── README.md
│
├── README.md # Main Documentation and Instructions
minikube start --cpus=4 --memory=12288 --disk-size=50gkubectl create secret tls minio-tls-secret --cert=minio.crt --key=minio.key
kubectl apply -f minio-deployment.yaml
kubectl apply -f minio-service.yaml
minikube addons enable ingress
kubectl apply -f minio-ingress.yaml
kubectl create namespace cloud2-namespace
kubectl config set-context --current --namespace=cloud2-namespace
docker build -t flink:1.13.0-with-docker .
kubectl apply -f iac/local/flink/flink-conf.yaml
kubectl apply -f iac/local/flink/flink-data-pvc.yaml
kubectl apply -f iac/local/flink/flink-tmp-data-pvc.yamlkubectl apply -f iac/local/flink/flink-master-deployment.yaml
kubectl apply -f iac/local/flink/flink-master-service.yamlkubectl apply -f iac/local/flink/flink-worker-deployment.yaml
kubectl apply -f iac/local/flink/flink-worker-service.yamlkubectl apply -f iac/local/flink/flink-ingress.yaml
kubectl apply -f iac/local/beam/beam-jobserver-deployment.yamlkubectl get pods -n cloud2-namespace
kubectl get svc -n cloud2-namespacedocker build -t data-ingestion-app:latest -f src/Dockerfile .kubectl apply -f iac/local/data-ingestion-deployment.yamlkubectl logs -f <pod-name> - Scalability: Distributed processing using Flink and Beam.
- Data Storage: Integrated with MinIO for scalable object storage.
- Extensibility: Flexible architecture to support additional pipelines.
- Local Testing: Minikube ensures easy local development and testing.
kubectl logs <pod-name> -n cloud2-namespacekubectl exec -it <pod-name> -- /bin/bashkubectl describe nodeThis project sets up a scalable data pipeline using Kubernetes, Flink, and Beam, providing an environment for developing distributed data processing workflows. It is ideal for prototyping and integrating with cloud-native platforms for production deployment.
For further enhancements, we consider integrating monitoring tools like Prometheus and Grafana or extending workflows for multi-cloud deployments.

