Unstack the “Big Stack”

The objective of the project is to engineer Stack Overflow website’s real-time data with streaming and do some analysis to visualize some of the interesting trends.

Started in 2008, the popular site, Stack Overflow, has huge amounts of data that need to be handled and processed at a continuous rate. It is a very widely used question and answer forum for professional and enthusiastic programmers. A post can be a question, an answer to a question or comments on other posts. Posts contain various attributes like tags, upvotes, downvotes, views, etc. Users of Stack Overflow are encouraged to participate on the website to post quality questions and answers and are awarded reputation scores and badges. Such features help employers identify potential developers on the site for a particular technology.

The project provides some intetesting trends and insights of Stack Overflow website.

Team Members:

Ankita Kundra
Gayatri Ganapathy
Kunal Niranjan Desai
Ria Gupta

Data Pipeline

Project Working

Data

https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

Tableau link to our project:

https://public.tableau.com/views/StackOverflow_Analysis/UserAnalysis?:display_count=y&publish=yes&:origin=viz_share_link

Steps to run the project:

Use the Google Big Query API to query the data from Bigquery Dataset.
Upload the files onto Amazon S3 bucket and set up the Amazon EMR.
Use startup.sh to set up the environment in AWS CLI.
Run table_name_producer.py and table_name_stream.py to run the Kafka producer and consumer process.
See if Parquet files are formed in the cluster under the specified location.
Use Tableau link above to visualize the analysis made on the data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Tableau		Tableau
batch		batch
hql		hql
images		images
streaming		streaming
.DS_Store		.DS_Store
BD_Explorer.pdf		BD_Explorer.pdf
README.md		README.md
google_api.py		google_api.py
running.md		running.md
startup.sh		startup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unstack the “Big Stack”

Team Members:

Data Pipeline

Project Working

Data

Tableau link to our project:

Steps to run the project:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unstack the “Big Stack”

Team Members:

Data Pipeline

Project Working

Data

Tableau link to our project:

Steps to run the project:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages