SiteToSheet is a Python project that combines web scraping, natural language processing, and Google API integration to extract data from websites and store it in Google Sheets.
This project requires Python 3.7 or later. To set up the project environment:
- Clone the repository:
- Create a virtual environment:
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate - On macOS and Linux:
source venv/bin/activate
- Install the required packages:
This project relies on several key libraries:
- Web Scraping: BeautifulSoup4
- Natural Language Processing: spaCy (with en_core_web_sm model)
- Google API Integration: google-api-python-client, gspread
- Data Manipulation: pandas, numpy
- Mapping: googlemaps
- Environment Management: python-dotenv
- Rate Limiting: ratelimit
For a complete list of dependencies, see the requirements.txt file.
- Set up Google Cloud Project and enable necessary APIs (Sheets, Maps): To do this follow the steps found here https://developers.google.com/sheets/api/quickstart/python#set-up-environment Complete the steps up to but not including "Install the Google Client Library"
- Create and download a
sheet_credentials.jsonfile for Google API authentication. It must be named sheet_credentials. Unlike the tutorial store this in your LocalAppData/SiteToSheet folder. If this is not present the package will ask for its creation Typically this resides at C:\Users{user}\AppData\Local\SiteToSheet for Windows machines - Create a
.envfile in the project root alongside yoursheet_credentials.jsonand add your API keys:
##Usage
Once all credentials are present you can begin filling in your Google spreadsheet with the data you wish to search
#TODO include video of excel spreadsheet being filled in
- Web scraping with BeautifulSoup4
- Natural language processing with spaCy
- Google Sheets integration for data storage
- Google Maps API for geolocation services
- Rate limiting to respect API usage limits
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses the following third-party services and libraries:
- Google Maps Distance Matrix API: Subject to the Google Maps Platform Terms of Service
- Google Sheets API: Subject to the Google APIs Terms of Service
- spaCy: Licensed under the MIT License
Users of this software are responsible for ensuring their own compliance with the terms of these services and libraries.
This project makes use of several open-source libraries and APIs. We thank the maintainers and contributors of these projects.