The objective of the project is to engineer Stack Overflow website’s real-time data with streaming and do some analysis to visualize some of the interesting trends.
Started in 2008, the popular site, Stack Overflow, has huge amounts of data that need to be handled and processed at a continuous rate. It is a very widely used question and answer forum for professional and enthusiastic programmers. A post can be a question, an answer to a question or comments on other posts. Posts contain various attributes like tags, upvotes, downvotes, views, etc. Users of Stack Overflow are encouraged to participate on the website to post quality questions and answers and are awarded reputation scores and badges. Such features help employers identify potential developers on the site for a particular technology.
The project provides some intetesting trends and insights of Stack Overflow website.
- Ankita Kundra
- Gayatri Ganapathy
- Kunal Niranjan Desai
- Ria Gupta
https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow
- Use the Google Big Query API to query the data from Bigquery Dataset.
- Upload the files onto Amazon S3 bucket and set up the Amazon EMR.
- Use startup.sh to set up the environment in AWS CLI.
- Run table_name_producer.py and table_name_stream.py to run the Kafka producer and consumer process.
- See if Parquet files are formed in the cluster under the specified location.
- Use Tableau link above to visualize the analysis made on the data.

