ds_salary_proj

Data Science Salary Estimator: Project Overview

Created a tool that estimates data science salaries to help data scientists negotiate their income when they get a job. Scraped over sample of job descriptions from glassdoor using python and selenium. Engineered features from the text of each job description to quantify the value companies put on different Data-Science tools like python, excel, aws, and spark. Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model. Built a client facing API using flask.

Resources & Knowledge-base:

Code and Resources Used

Python Version:3.8
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: pip install -r requirements.txt
Special Thanks to KenJee Github: https://github.com/PlayingNumbers
Tutorial(youtube) : https://www.youtube.com/watch?v=GmW4F6MHqqs&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t
& Thanks to Ömer SakaryaScraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2
Web Scraping

Data-Scraping / Data-Collection

This is the first step where the script is written to pull-out the information available on web-page(information stored by web-elements and fields: -span,div, li, etc.)of the website. After pulling out the information it is stored in file for further processing.

P.S: the code has been modified according to the updated structure of the website for webscraping, there are fields/data that I couldn't able to find, for reading purposes the old lines of code is commented and updated code added underneath for better understanding.please take a note that I have changed the names of the column and files according to my need, if you are copy pasting this code you have to look for syntax errors in names of files and data-columns that are used in tutorial.

glassdoor_scrapper.py (this is the modified version for re-structured website) click the link below:
https://github.com/wizrox/dsSlryProj/blob/master/glassdoor_scrapper.py

Data Cleaning

After scraping the data, It has to be clean and arrange to select the information suitable for Data EDA(Exploration and Data Analysis). this is done through the script which drop the redundant and excessive tables and keep the valid data/fields in column. upon the extraction of the necessary data , it will be stored in another csv file for further computation and processing.

Data EDA

After getting the relevant information in the form of data-set, There will be applied computational and mathemetical formulas to analyze the patterns and accuracy of the information, there are various techniques used to find out the accurate prediction of the result. after listing the comparison of the output from different models and techniques. A 'pickle' file is created whcih will be imported later on to used with different application scenario like client/server or service call to API.

Productionization

After making a pickle file, to implement it in a client/server scenario, Flask API is downloaded and used to stimulate as a local server for listening to client request.

ds_salary_proj

#Resources:

https://www.youtube.com/watch?v=GmW4F6MHqqs&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905

https://github.com/PlayingNumbers/ds_salary_proj

https://github.com/arapfaik/scraping-glassdoor-selenium

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.ipynb_checkpoints		.ipynb_checkpoints
ChromeDriver		ChromeDriver
FirefoxDriver		FirefoxDriver
FlaskAPI		FlaskAPI
__pycache__		__pycache__
README.md		README.md
any.py		any.py
data_cleaning.py		data_cleaning.py
data_eda.ipynb		data_eda.ipynb
eda_data.csv		eda_data.csv
glassdoor_scrapper.py		glassdoor_scrapper.py
glsDr_jobs.csv		glsDr_jobs.csv
model_building.py		model_building.py
model_file.p		model_file.p
salary_data_cleansed.csv		salary_data_cleansed.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Salary Estimator: Project Overview

Resources & Knowledge-base:

Code and Resources Used

Data-Scraping / Data-Collection

Data Cleaning

Data EDA

Productionization

ds_salary_proj

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Science Salary Estimator: Project Overview

Resources & Knowledge-base:

Code and Resources Used

Data-Scraping / Data-Collection

Data Cleaning

Data EDA

Productionization

ds_salary_proj

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages