This project aims to collect indeed job offers and store it in a database.
We will be using python and the selenium package to scrap data from the indeed website and store the data in a PostgreSQL database.
It is worth noting that this project has been built using linux.
Getting started with PostgreSQL and setting up service connection file
export PGSERVICEFILE=.pg_service.confPopulate the tables :
psql "service=offers" < sql/tables/create_tables.sql
Populate the functions :
psql "service=offers" < sql/functions/scraped_stats.sql
psql "service=offers" < sql/functions/update_timestamp_on_updates.sqlPopulate the triggers :
psql "service=offers" < sql/triggers/trigger_jobs.sqlCreate a venv environment and activate it :
python3 -m venv venv
source venv/bin/activateInstall the dependencies :
pip3 install -r requirements.txtFew things to notice before launching indeed_scraping.py script :
In the
__name__ == '__main__'part : If you don't want to open the google chrome browser, set the boolean toTrue:
soup_list = scrap_offer_indeed(list_keyword, offer_age, indeed_country, False)
"""
...
...
"""
df_description = scrap_indeed_description(id_offers_to_scrap, url_offers_to_scrap, False)You can also set up the amount of scraping you want to do in the
for i in range(10):You can edit the number in the for loop
Launch the indeed_scraping.py script :
python3 indeed_scraping.pyGet jobs table stats :
psql "service=offers" -c "SELECT scraped_stats();"Output :
{
"scraped": 30,
"total_jobs": 977,
"scrap_progress": 0.03,
"to_scrap": 947
}Get jobs table output :
psql "service=offers" -c "SELECT id, job_title, company_name, company_rating, scraped_at, url from jobs limit 1"Ouput :
| id | job_title | company_name | company_rating | scraped_at | url |
|---|---|---|---|---|---|
| 5143d988833fff26 | Procurement Data Analyst | PCL Construction | 3.8 | 2023-01-03 17:38:41 | https://ca.indeed.com/viewjob?jk=5143d988833fff26 |
