cloud-bigdata-project

Project structure

All scripts that were used to reach our conclusions are stored in the /scripts folder in the root of this repo.

To streamline our workflow, we are using environment variables to set workspace-depending variables, such as the path to the dataset. So, to succesfully run scripts, you may create an .env file similar to .env.example

It is also worth mentioning there are two Python files in the root of the repo:

env_wrapper.py: gets the env variables.
schema.py: used to define the dataset's schema.

Datasets

Original dataset:

https://www.kaggle.com/datasets/dilwong/flightprices

There are available two smaller version of this dataset (1MB each) uploaded on this repo.

To obtain a smaller dataset than the original, this command can be used to take 1 out of every 30 lines, obtaining a 1GB dataset.

sed -n '1~30p' itineraries.csv > small_itineraries.csv

PySpark on Google Cloud

There will be a single Bucket shared accross the team.

First, we have to assign the following permissions in your Bucket to the Google service account (which belongs to the cluster that will be used to run scripts) "Storage Legacy Bucket Owner" and "Storage Legacy Object Owner".

The cluster's service account email is found on the master VM's details.

Then, the cluster will be able to access the shared Bucket.

Running PySpark on a cluster

The following commands have to be run to ensure we have the dependencies needed and environment variables set.

python -m pip install python-dotenv

BUCKET=gs://pacolo2
export FILENAME=$BUCKET/itineraries.csv
export OUTPUT_DIR=$BUCKET/scripts_output/

Then, to run scripts, the following command can be used:

spark-submit --py-files $BUCKET/env_wrapper.py,$BUCKET/schema.py $BUCKET/scripts/<SCRIPT-NAME>.py

Running PySpark locally

To run a script on your local machine, you have to have a .env file with the variables indicated by the .env.example file.

Then, to run scripts, you can use the following command:

spark-submit --py-files ../env_wrapper.py,../schema.py <SCRIPT-NAME>.py

There is an example script to test the correct functioning of your environment: example.py

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
example_datasets		example_datasets
scripts		scripts
scripts_outputs		scripts_outputs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
env_wrapper.py		env_wrapper.py
schema.py		schema.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cloud-bigdata-project

Project structure

Datasets

PySpark on Google Cloud

Running PySpark on a cluster

Running PySpark locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cloud-bigdata-project

Project structure

Datasets

PySpark on Google Cloud

Running PySpark on a cluster

Running PySpark locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages