AirFlow + Kafka + Spark + Docker

Streaming application data using New York City Taxi Fare.

The data pipeline is shown in the following airflow DAG picture:

DAG: 1_streaming

Write events to a Kafka cluster. A producer partitioner maps each message from the train.csv file to a topic partition named by Transactions, and the producer sends a produce request to the leader of that partition. The partitioners shipped with Kafka guarantee that all messages with the same non-empty key will be sent to the same partition.

DAG: 2_consumer:

Create a path used to recover from failures if something goes wrong during data treatments
Read (consume) messages from a number of Kafka TOPICS, in this case we have a unic topic called by: Topic1 = "transactions"
Run validation (Data Test) on data based on expecation suite.
If the Data Test is True, that means there is no problem with the data and this can be modeled and saved with success. But if the Data Test is False, the data will be discarded.

The data are modeled for be saved on this ways:

ride_per_month This query calculates fare amount per month and year. The Dataframe is saved on data lake partitioned by column pickup year and pickup year month.
ride_amount_per_hourThis query calculates fare amount per hour. The Dataframe is saved on data lake partitioned by columns pickup year and pickup year month.
taxi_ride_local_per_hourThis query calculate how many taxi ride there are per hour considering same local. The Dataframe is saved on data lake partitioned by columns pickup year and pickup year month.
taxi_ride_localThis query calculate how many taxi ride there are per month considering same local. The Dataframe is saved on data lake partitioned by columns pickup year and pickup year month.
taxi_ride_local_rankingThis query calculate how many taxi ride there are considering same local. The Dataframe is saved on data lake partitioned by columns pickup year and pickup year month.

Docker Container based architecture:

Container 1: Postgresql for Airflow db
Container 2: Airflow + KafkaProducer
Container 3: Zookeeper for Kafka server
Container 4: Kafka Server
Container 5: Spark + hadoop
Container 2 is responsible for producing data in a stream fashion, so my source data (train.csv).
Container 5 is responsible for Consuming the data in partitioned way.

To bind all the containers together using docker-compose i have taken Pre-Configured DockerFiles available on Docker-hub.

SETUP:

Before starting any dag, it is necessary to do some settings. These settings are described below:

1. Download train.csv https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data?select=train.csv and save on directory data.
1. On Airflow UI, create the airflow variables in Admin>Variables:
  - BOOTSTRAP_SERVERS = kafka:9092
  - DATA_OUTPUT = /usr/local/airflow/data/output/
  - TEST_SUITE = /usr/local/airflow/data/great_expectation_suite.json
  - PATH_STREAM = /data/train.csv
1. Export Java Home inside container:
  - Access the container using: docker exec -ti [airflow-container-id] bash
  - export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
  - source ~/.bashrc
1. Start Hive inside container:
  - Access the container using: docker exec -ti [airflow-container-id] bash
  - Before you run hive for the first time, run: schematool -initSchema -dbType derby
  - If you already ran hive and then tried to initSchema and it's failing:
    - cd /data/hive/
    - mv metastore_db metastore_db.tmp
    - Re run: schematool -initSchema -dbType derby
1. cd /opt/apache-hive-2.0.1-bin/bin/
1. chmod 777 hive
1. hive --service metastore
1. Now trigger the DAG 1_streaming from Airflow UI.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
airflow_docker		airflow_docker
dags		dags
data		data
README.md		README.md
comandos.txt		comandos.txt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AirFlow + Kafka + Spark + Docker

Streaming application data using New York City Taxi Fare.

DAG: 1_streaming

DAG: 2_consumer:

Docker Container based architecture:

SETUP:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AirFlow + Kafka + Spark + Docker

Streaming application data using New York City Taxi Fare.

DAG: 1_streaming

DAG: 2_consumer:

Docker Container based architecture:

SETUP:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages