- Project Purpose
- Features
- Infrastructure Overview
- Data Processing Flow
- Google Sheet Structure
- Prerequisites
- Setup and Configuration
- Running the Workflow
- Efficiency and Performance Optimization
- Limitations
- Troubleshooting
The goal of this project is to leverage the capabilities of Google Sheets to obtain financial data for free, without relying on third-party APIs, but solely utilizing Google Finance. The project also integrates BigQuery and Dataform for data transformation, along with Google Workflows for orchestration and Google Sheets management. Firestore is exploited to store configurations and currency pairs.
The infrastructure is built around a workflow that automates the creation and management of a Google Sheet and schedules Dataform flows. Firestore is used to store information about currency pairs and the Google Sheet, ensuring persistence and reuse of the spreadsheet id. A collection of documents in Firestore represents the currency exchange rates to be calculated.
- Google Sheets as a Data Source:
- Google Sheets is used to fetch financial data via Google Finance.
- A specific structure (detailed in Google Sheet Structure) is used to manage historical and current data efficiently. The Google Sheet is created by the service account managing the workflow and is shared as read-only for all users. This ensures idempotency, as only the service account can modify the sheet, eliminating user-induced errors.
- Data Ingestion in BigQuery:
- BigQuery reads data from the Google Sheet using external tables that reference the spreadsheet.
- This enables real-time access to updated exchange rates without requiring additional API calls.
- Data Transformation with Dataform:
- The workflow calculates the historical exchange rate for one currency pair at a time.
- It materializes the data in BigQuery by reading from the external table in the landing dataset.
- Using an incremental table, the processed data is stored in the commons dataset, where the data is fully consolidated.
- This process is repeated for all currency pairs that have not been previously calculated.
- At the end of the process, the workflow triggers a final step that reads only the exchange rates for yesterday and today, integrating them into the historical dataset. This ensures that even when executed daily, only a small amount of data is processed.
- The exchange rates for today are managed as a view, combining historical data with today’s rates, since they fluctuate throughout the day and are not yet final.
- Workflow Orchestration with Google Workflows:
- A Google Workflow manages the creation and configuration of the spreadsheet.
- It also triggers and schedules Dataform execution for continuous data updates.
- Details on running the workflow and its parameters are covered in the Running the Workflow section.
- Firestore for Configuration Management:
- Firestore stores essential configurations, such as the
master_spreadsheet_idand the list ofcurrency-pairsto process. This ensures persistence and allows dynamic management of currency pairs.
- Firestore stores essential configurations, such as the
The Google Sheet is meticulously structured to optimize data retrieval and processing:
- HIDDEN_RATES_HISTORY: Stores the full exchange rate history using Google Finance formulas. This sheet is primarily for calculation and is not directly queried by BigQuery for historical data due to potential performance issues with live formulas.
- RATES_HISTORY: A copy of the historical data from
HIDDEN_RATES_HISTORY(values only, no formulas). This sheet is used by BigQuery as the source for historical data, ensuring stable and efficient reads. - PREV_DAY_RATES: Stores exchange rates for the previous day. This helps in incremental updates.
- CURRENT_RATES: Holds exchange rates for the current day, fetched live.
This structure ensures that historical data is computed and stored efficiently, and daily updates append new information without reprocessing the entire dataset.
Before deploying and running this project, ensure you have the following:
- A Google Cloud Platform (GCP) Project.
gcloudcommand-line tool installed and configured.- Permissions to enable APIs, create service accounts, and manage GCP resources (BigQuery, Dataform, Workflows, Firestore, Google Sheets).
Follow these steps to set up the environment:
Ensure your GCP project is selected and you have appropriate billing enabled.
Enable the following APIs in your GCP project:
- Google Drive API
- Google Sheets API
- BigQuery API
- Dataform API
- Workflows API
- Firestore API
- IAM Service Credentials API
- Cloud Scheduler API (if you plan to schedule the workflow automatically)
You can enable them via the GCP Console or using gcloud services enable [SERVICE_NAME].
Create a dedicated service account for this project. This service account will need the following roles (or equivalent custom roles):
- BigQuery:
BigQuery Data Editor(to manage tables and run queries),BigQuery User(to run jobs). - Dataform:
Dataform Editor(to trigger Dataform runs). - Workflows:
Workflows Invoker(to execute workflows). - Firestore:
Cloud Datastore User(to read and write configuration data). - Service Account:
Service Account Token Creator(if the workflow needs to act as this service account). - Cloud Logging:
Logs Writer(to write logs)
Download the JSON key for this service account if needed for local development or specific authentication methods, though Workflows can run with a specified service account directly.
Firestore is used to store metadata and configuration.
- Create a Firestore Database: In your GCP project, create a Firestore database in Native mode. Choose a region.
- Data Structure: The project expects a specific structure in Firestore. The main database (e.g.,
currency-exchangeor default) should contain:- A collection named
configurations. - Within
configurations, a document namedwf-currency-exchange. This document holds:master_spreadsheet_id(string): Stores the ID of the Google Sheet used. This is managed by the workflow.startDate(string, optional, formatYYYY-MM-DD): The specific start date from which to fetch historical data. If not provided, thedefaultStartDatefrom the workflow parameters will be used.- A sub-collection named
currency-pairs.
- The
currency-pairssub-collection contains documents where each document ID is the currency pair (e.g.,USDEUR,GBPEUR). Each currency pair document can have:status(string): Operational field managed by the workflow (e.g., "SUCCEEDED", "FAILED").
- A collection named
Example Firestore Structure:
🗄️ currency-exchange
└ 📂 configurations
└ 📃 wf-currency-exchange
├ master_spreadsheet_id (string, operational)
├ start_date (string, operational)
└ 📂 currency-pairs
├ 📃 USDEUR
│ └ status (string, operational)
└ 📃 GBPEUR
└ status (string, operational)
You will need to manually create the configurations/wf-currency-exchange document initially if the workflow doesn't create it. The currency-pairs and their documents need to be added to define which exchange rates to process.
- Create a Dataform repository in your GCP project.
- Connect it to your Git provider or use the default Dataform core.
- Populate the repository with your Dataform SQLX files for data transformation.
- Configure the
dataform.jsonor individual file configurations to use the correct datasets (e.g.,landingfor sources,commonsfor output). - Ensure the Dataform service account (usually configurable in Dataform settings) has BigQuery permissions.
- Write your workflow definition in a YAML file (e.g.,
currency_exchange_workflow.yaml). - Deploy the workflow using
gcloud:gcloud workflows deploy YOUR_WORKFLOW_NAME \ --source=currency_exchange_workflow.yaml \ --location=YOUR_REGION \ --service-account=YOUR_SERVICE_ACCOUNT_EMAIL
The workflow orchestrates the entire data retrieval and processing pipeline. It has two execution modes:
- Without Parameters: Runs with default settings. The workflow will attempt to use an existing spreadsheet ID from Firestore or create a new one. It will process currency pairs based on their status in Firestore.
- With Parameters (JSON format): Accepts the following optional parameters:
fullRefresh(boolean, optional, default:false): Iftrue, all BigQuery tables in thecommonsdataset managed by Dataform for this process will be rebuilt from scratch. This forces a full recalculation of all historical data for all pairs.recreateSpreadsheet(boolean, optional, default:false): Iftrue, the workflow will delete the existing Google Sheet (if its ID is found in Firestore) and create a brand new one. This is useful if the sheet structure is corrupted or needs a fresh start.
Example of running with parameters:
gcloud workflows execute YOUR_WORKFLOW_NAME \
--location=YOUR_REGION \
--data='{"fullRefresh": true, "recreateSpreadsheet": false, "defaultStartDate": "2020-01-01"}'The workflow can also be triggered via the GCP Console or scheduled using Cloud Scheduler to call the Workflow execution API.
- Utilizing external tables in BigQuery avoids unnecessary duplication of data while ensuring real-time availability.
This approach provides a cost-effective, scalable, and automated solution for obtaining and processing financial data using Google’s ecosystem.
- Google Finance Data: The
GOOGLEFINANCEfunction in Google Sheets has its own limitations:- Data is not real-time tick data; it can be delayed (e.g., up to 20 minutes).
- There might be undocumented rate limits if the function is used excessively across many cells or sheets simultaneously, potentially leading to
#N/Aerrors. - Historical data accuracy and availability depend on Google Finance.
- Google Sheets Quotas: Google Sheets has its own usage quotas (e.g., number of cells, complexity of formulas) that could be hit with a very large number of currency pairs or extremely long history.
- Error Handling: While the workflow aims for robustness, comprehensive error handling for all edge cases (e.g., Google Finance outages, unexpected Sheets behavior) requires careful design.
- Workflow Logs: Check Google Workflows execution history and logs for errors during orchestration.
- Dataform Execution Logs: Review Dataform run logs in the GCP console for issues related to SQL transformations or BigQuery jobs.
- BigQuery Job History: Examine BigQuery job history for errors related to queries, external table access, or data loading.
- Google Sheet Errors: Open the Google Sheet to check for any
#ERROR!,#N/A, or other formula errors in theHIDDEN_RATES_HISTORYsheet. - Permissions: Double-check that the service account used by the Workflow and Dataform has all the necessary IAM permissions.
