A small Python/Selenium project that:
- Reads OA.Report pages as seen by users (
devandstagingwhich reflectbetaandapiAPI endpoints respectively). - Extracts figures from the Insights, Explore, and Actions sections.
- Writes them into Google Sheets so that QA can compare snapshots across dates and spot any abnormal regressions or spikes quickly. It also feeds into an archive kept for historical tracking purposes.
It runs daily via GitHub Actions workflow.
For each task run done via GitHub Actions, we target the website (not the API directly) on two environments:
- Staging website →
https://staging.oa.report/<org-slug>?orgkey=... - Dev website →
https://dev.oa.report/<org-slug>?orgkey=...
And for each environment we scrape three sections of a page:
- Insights –
extractors/insights.py - Explore –
extractors/explore.py - Actions –
extractors/actions.py
That’s 3 sections × 2 envs = 6 scrapes for each organisation’s OA.Report per full daily run.
The configuration lives in config/settings.yaml and explicitly lists, per environment:
- The exact org URLs to visit (with their
orgkey) for Insights, Explore, and Actions. - Which Google Sheet each scrape should write to, split by env (e.g.,
api_insights_parsed_data,beta_explore_parsed_data, etc.). - Additional settings:
explore.years_to_keep: 5(keep five years of Explore data)- XPaths for each section (e.g.,
year_buttons,insights_cards,explore_table,actions_buttons) - Small delays (
page_load,data_load) - Output filenames per section (env-specific names derived at runtime)
Each extractor reads the config, opens a headless Selenium browser, and navigates to the configured URLs for the chosen env.
For each section, generally, one row = one piece of data.
- Visits each configured org page (on staging/dev), as set in
settings.yaml. - Captures All time and the last two years including current.
- Collects the snapshot’s date range, the metric’s name, its corresponding value (percentages for most metrics, e.g. “with open code”, and an absolute value for “total publications” only), the page URL, and its collection date & time.
- Visits each configured org page (on staging/dev), as set in
settings.yaml. - Takes a snapshot of the last five years (exceptionally set in
settings.yaml). - Scrapes the Explore table and flattens/pivots it to ensure that one row = one metric (e.g.,
OPEN ACCESS = 73for a given year). - Collects the snapshot’s date range, the metric’s name, its corresponding value (always in absolute numbers), the page URL, and its collection date & time.
- Visits each configured org page (on staging/dev), as set in
settings.yaml. - Captures All time and the last two years including current.
- Collects the snapshot’s date range, the metric’s name, its corresponding value (always in absolute numbers), the page URL, and its collection date & time.
Remember: this reads what the UI renders (i.e. the exact figures a user sees) so this part of QA is meant to reflect the true user-facing state without requiring a human to manually navigate to every organisation and manually track their figures.
- Local CSVs per section/env for quick artifacts.
- Google Sheets via
export/google_sheets.py:- Uses a service account (
config/google_creds.json) withgspread/oauth2client. - Includes basic retry logic for rate limits (“Quota exceeded”), retrying up to 3 times with a delay.
- Appends rows to one of the six existing Google Sheet (one for each task run) specified in
settings.yaml.
- Uses a service account (
- The GitHub Actions workflow (
.github/workflows/schedule.yml) runs the six scrapes daily (and can be triggered manually). - On failure, the workflow sends a notification via email which should be routed to Front.
- The “Statistics QA” Google Sheet is the human-friendly hub.
- Because the parser writes timestamped snapshots daily, QA can pick two dates (often 2–3 weeks apart) and use Sheet formulas to compute deltas and flag anomalies (spikes/drops).
- No API reading: the parser does not call
api.oa.works/beta.oa.worksdirectly; it reads the websites (staging.oa.report,dev.oa.report) — i.e., what the API outputs to the UI. - Independent from Ghost Inspector: GI is a separate front-end test/alert system. The parser is for historical QA via Sheets and is not coupled to GI.
.
├── config/ # Configuration files (never commit secrets; both files are in .gitignore)
│ ├── google_creds.json # Google service account credentials
│ ├── google_creds.template.json # Template for google_creds.json — remove .template
│ ├── settings.yaml # List of parsed URLs, XPaths, output files, Google Sheets mapping
│ └── settings.template.yaml # Template for settings.yaml — remove .template
├── extractors/ # Main parser scripts
│ ├── insights.py # Parses Insights data
│ ├── compare_snapshots.py # Compares two data snapshots — not in use at the moment
│ ├── explore.py # Parses Explore data
│ └── actions.py # Parses Actions data
├── export/ # Google Sheets export handler
│ └── google_sheets.py # Uploads data frames to Google Sheets
├── .github/
│ └── workflows/schedule.yml # GitHub Actions runner (scheduled daily)
└── requirements.txt # Python dependencies
This project requires two configuration files in the config/ directory that need to be set up by the user:
config/google_creds.json– for Google Sheets API authenticationconfig/settings.yaml– for parser options, including URLs, XPath selectors, and Google Sheet names
This file contains credentials for the Google service account that uploads data to Google Sheets.
- Go to Google Cloud Console
- Create or select the
OAREPORT-PARSERproject - Go to APIs & Services → Credentials
- Click Create Credentials → Service Account
- Follow the prompts (you can skip role assignment)
- After creating the service account:
- Go to the account → Keys tab → Add Key → Create new key → JSON
- Download and save it as
config/google_creds.json- A template is also provided in this repo (
config/google_creds.template.json) - Simply update the file name to
config/google_creds.jsonand copy-paste to replace the contents of the JSON file downloaded in Google Cloud Console to this file
- A template is also provided in this repo (
- Share your target Google Sheets with the service account email (e.g.
...@...iam.gserviceaccount.com), giving it Editor access
This file contains:
- URLs to be parsed (i.e. organisational OA.Reports)
- XPath selectors (to target specific page elements)
- Output filenames (if generating CSVs on your own machine)
- Delay durations (time to wait after certain UI interactions)
- Google Sheets export details (for automated exports)
This is already set up to automatically run daily.
Go to GitHub → Settings → Secrets and Variables → Actions
Create the following secrets under Repository secrets:
| Name | Value |
|---|---|
GOOGLE_CREDS_JSON |
Paste the contents of google_creds.json |
SETTINGS_YAML |
Paste the contents of settings.yaml |
The current workflow will:
- Run daily at 2am UTC
- Or can be triggered manually via the GitHub UI
Go to GitHub → Actions → Daily OA.Report Parsing → Run workflow
All four parsers will run manually. All runs, whether failed or successful, will be listed on this page as well.
Useful for one-off exports in a single CSV file or for testing any changes done to the parsers before committing them to the repository.
- Set up your Python environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Run any parser manually:
# For Insights
python extractors/insights.py --env staging
python extractors/insights.py --env dev
# For Explore
python extractors/explore.py --env staging
python extractors/explore.py --env dev
# For Actions
python extractors/actions.py --env staging
python extractors/actions.py --env devThe data will be exported both to CSV (in your local project folder) and to the configured remote Google Sheet.