Agnostic ML Pipelines for Multi-Cloud: Apache Flink with Apache Beam on Minikube with MinIO

NOTE

This is a educational project for learning purposes only. It is not intended for production use.

Author: Mahmudur R Manna

Overview

This project sets up and deploys Apache Flink with Apache Beam on Minikube with MinIO for scalable data processing pipelines. It provides a fully containerized and Kubernetes-based infrastructure designed for development, testing, and prototyping ML workflows.

Core Components

Minikube Cluster - A local Kubernetes cluster to host all components.
Flink JobManager - Manages job execution and coordinates with TaskManagers.
Flink TaskManager - Executes distributed data processing tasks as directed by the JobManager.
Beam Job Server - Accepts Apache Beam jobs and submits them to Flink for execution.
Data Ingestion App - Submits jobs to the Beam Job Server for processing data.
MinIO DataLake - Provides object storage for raw and processed data.

Component Interaction

The Data Ingestion App submits processing jobs to the Beam Job Server.
The Beam Job Server delegates job execution to the Flink JobManager.
The Flink JobManager coordinates distributed execution with Flink TaskManagers.
The Data Ingestion App interacts with MinIO for raw data storage.
The Flink TaskManagers interact with MinIO for storing processed data.

Prerequisites

Tools Required

Minikube: Kubernetes cluster manager for local setups.
kubectl: Kubernetes CLI for managing cluster resources.
Docker: Container platform for building and running containerized applications.

Installing Dependencies

Minikube

Install Minikube:

macOS:
```
brew install minikube
```

Linux:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

Windows: Download from Minikube Releases.

Start Minikube:
```
minikube start
```

kubectl

Install kubectl:

macOS:
```
brew install kubectl
```

Linux:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Windows: Download from Kubernetes Releases.

Verify kubectl:
```
kubectl version --client
```

Docker

Install Docker:
- macOS:
```
brew install --cask docker
```
- Linux: Follow instructions on Docker Website.
- Windows: Download from Docker Website.
Start Docker Service:
```
sudo systemctl start docker
```
Verify Docker Installation:
```
docker --version
```

Project Structure

AGNOSCTICMLFLOW
├── documentation
│   ├── Components.png
│   ├── architecture-diagram.png
│   ├── README.md
│
├── iac                           # Infrastructure as Code (Deployment Configurations)
│   ├── cloud                     # Placeholder for Cloud Deployment Configurations
│   ├── local                     # Local Kubernetes Deployment Configurations
│   │   ├── beam                  # Apache Beam JobServer Configurations
│   │   │   ├── beam-src          # Source files or configurations for Beam
│   │   │   ├── beam-jobserver-deployment.yaml
│   │   │   ├── beam-jobserver-service.yaml
│   │   │
│   │   ├── flink                 # Apache Flink Configurations
│   │   │   ├── Dockerfile
│   │   │   ├── entrypoint.sh
│   │   │   ├── flink-conf.yaml
│   │   │   ├── flink-data-pvc.yaml
│   │   │   ├── flink-ingress.yaml
│   │   │   ├── flink-master-deployment.yaml
│   │   │   ├── flink-master-service.yaml
│   │   │   ├── flink-tmp-data-pvc.yaml
│   │   │   ├── flink-worker-deployment.yaml
│   │   │   ├── flink-worker-service.yaml
│   │   │
│   │   ├── minio                 # MinIO Object Storage Configurations
│   │   │   ├── minio-deployment.yaml
│   │   │   ├── minio-ingress.yaml
│   │   │   ├── minio-service.yaml
│   │   │   ├── minio.crt
│   │   │   ├── minio.key
│   │   │
│   │   ├── spark                 # Apache Spark Configurations
│   │   │   ├── Dockerfile-spark-standalone
│   │   │   ├── spark-ingress.yaml
│   │   │   ├── spark-master-deployment.yaml
│   │   │   ├── spark-master-service.yaml
│   │   │   ├── spark-worker-deployment.yaml
│   │   │   ├── spark-worker-service.yaml
│   │   │
│   │   ├── README.md
│
├── onprem                        # Placeholder for On-Prem Deployment Configurations
│   ├── README.md
│
├── middleware                    # Middleware Configurations (If applicable)
│   ├── README.md
│
├── portal                        # UI/Portal for Application Management
│   ├── README.md
│
├── services                      # Microservices for Specific Tasks
│   ├── customer-churn            # Domain-Specific Use Cases
│   │   ├── data_ingestion_app    # Data Ingestion Application
│   │   │   ├── scripts           # Utility Scripts for Data Handling
│   │   │   ├── src               # Source Code
│   │   │   ├── tests             # Unit and Integration Tests
│   │   │   ├── Dockerfile        # Container Configuration for App
│   │   │   ├── ingestion-deployment.yaml
│   │   │   ├── kube-config.yaml
│   │   │   ├── poetry.lock       # Dependency Lock File
│   │   │   ├── pyproject.toml    # Python Project Configuration
│   │   │   ├── README.md
│   │   │   ├── requirements.txt  # Python Dependencies
│   │   │   ├── sales_data.csv    # Example Dataset
│   │   │   ├── model_builder     # Model Training Code
│   │   │   ├── predictor         # Model Serving Code
│   │   │   ├── trainer           # Model Training Pipeline
│   │   │
│   │   ├── domain-2              # Additional Domain Services (Placeholder)
│   │   │   ├── README.md
│
├── README.md                     # Main Documentation and Instructions

Setup Instructions

1. Start Minikube

minikube start --cpus=4 --memory=12288 --disk-size=50g

2. Deploy Minio Data Lake

Create a TLS Secret

kubectl create secret tls minio-tls-secret --cert=minio.crt --key=minio.key

Deploy MinIO in Default Namespace

kubectl apply -f minio-deployment.yaml 
kubectl apply -f minio-service.yaml

Deploy Ingress

minikube addons enable ingress
kubectl apply -f minio-ingress.yaml

3. Deploy Flink in Different Namespace

Create a Namespace for Flink and Others

kubectl create namespace cloud2-namespace

kubectl config set-context --current --namespace=cloud2-namespace

Build Custom Flink Docker

docker build -t flink:1.13.0-with-docker .

Deploy Flink ConfigMap and Volumes

kubectl apply -f iac/local/flink/flink-conf.yaml
kubectl apply -f iac/local/flink/flink-data-pvc.yaml
kubectl apply -f iac/local/flink/flink-tmp-data-pvc.yaml

Deploy Flink JobManager

kubectl apply -f iac/local/flink/flink-master-deployment.yaml
kubectl apply -f iac/local/flink/flink-master-service.yaml

Deploy Flink TaskManager

kubectl apply -f iac/local/flink/flink-worker-deployment.yaml
kubectl apply -f iac/local/flink/flink-worker-service.yaml

Deploy Flink Ingress

kubectl apply -f iac/local/flink/flink-ingress.yaml

4. Deploy Beam Job Server

kubectl apply -f iac/local/beam/beam-jobserver-deployment.yaml

5. Verify Deployments

kubectl get pods -n cloud2-namespace
kubectl get svc -n cloud2-namespace

6. Running the Data Ingestion App

Build the Docker Image

docker build -t data-ingestion-app:latest -f src/Dockerfile .

Deploy the Data Ingestion App

kubectl apply -f iac/local/data-ingestion-deployment.yaml

Monitor Logs

kubectl logs -f <pod-name>

Key Features

Scalability: Distributed processing using Flink and Beam.
Data Storage: Integrated with MinIO for scalable object storage.
Extensibility: Flexible architecture to support additional pipelines.
Local Testing: Minikube ensures easy local development and testing.

Troubleshooting

1. Check Logs:

kubectl logs <pod-name> -n cloud2-namespace

2. Debug Kubernetes Pods:

kubectl exec -it <pod-name> -- /bin/bash

3. Check Node Resources:

kubectl describe node

Conclusion

This project sets up a scalable data pipeline using Kubernetes, Flink, and Beam, providing an environment for developing distributed data processing workflows. It is ideal for prototyping and integrating with cloud-native platforms for production deployment.

For further enhancements, we consider integrating monitoring tools like Prometheus and Grafana or extending workflows for multi-cloud deployments.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
documentation		documentation
iac		iac
middleware		middleware
portal		portal
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Agnostic ML Pipelines for Multi-Cloud: Apache Flink with Apache Beam on Minikube with MinIO

NOTE

Author: Mahmudur R Manna

Overview

Core Components

Component Interaction

Prerequisites

Tools Required

Installing Dependencies

Minikube

kubectl

Docker

Project Structure

Setup Instructions

1. Start Minikube

2. Deploy Minio Data Lake

Create a TLS Secret

Deploy MinIO in Default Namespace

Deploy Ingress

3. Deploy Flink in Different Namespace

Create a Namespace for Flink and Others

Build Custom Flink Docker

Deploy Flink ConfigMap and Volumes

Deploy Flink JobManager

Deploy Flink TaskManager

Deploy Flink Ingress

4. Deploy Beam Job Server

5. Verify Deployments

6. Running the Data Ingestion App

Build the Docker Image

Deploy the Data Ingestion App

Monitor Logs

Key Features

Troubleshooting

1. Check Logs:

2. Debug Kubernetes Pods:

3. Check Node Resources:

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages