Python pipeline to:
- Read products from BigQuery (
bm_mock_data.products) - Read sales from GCS (
gs://bm_mock_sales/sales_YYYY-MM-DD.parquet) - Filter sales using a reference date with different granularities:
day,month,quarter,year
- Join products and sales on
product_id - Optionally filter by
brandandproduct_id - Export the final dataset to a date-suffixed file (Parquet or CSV)
gcp-sales-pipeline/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI (lint, unit, integration)
├── .gitignore # Excludes .env, data, venv, etc.
├── .pre-commit-config.yaml # Local checks (black, flake8, tests)
├── pyproject.toml
├── README.md
├── src/
│ └── gcp_sales_pipeline/
│ ├── __init__.py
│ ├── bq_client.py
│ ├── cli.py
│ ├── exceptions.py
│ ├── filters.py
│ ├── gcs_client.py
│ └── pipeline.py
└── tests/
├── unit/
│ └── test_filters.py
└── integration/
└── test_pipeline_integration.py
At runtime (not committed to git):
.env # contains GOOGLE_APPLICATION_CREDENTIALS
bot-sandbox-interviews-sa.json # service account key (do NOT commit)
data/
└── export/ # output files (ignored)
Sales files are expected to live in:
gs://bm_mock_sales/sales_YYYY-MM-DD.parquet
One file per day, e.g.:
gs://bm_mock_sales/sales_2025-01-01.parquetgs://bm_mock_sales/sales_2025-01-02.parquet- etc.
Each file should contain at least:
product_id(STRING)price(FLOAT)quantity(INTEGER)sold_at(TIMESTAMP)order_id(STRING)
The pipeline:
- Computes the date range corresponding to the requested granularity (day / month / quarter / year).
- Loads daily Parquet files in that range.
- Filters again on
sold_atto match exactly the period. - Joins with product data from BigQuery.
Products are read from table:
bm_mock_data.products
Expected schema:
product_id(STRING)product_name(STRING)category(STRING)brand(STRING)condition(STRING)
-
Python 3.10+
-
Access to a GCP project with:
- BigQuery dataset
bm_mock_dataand tableproducts - GCS bucket
bm_mock_saleswith daily Parquet sales files
- BigQuery dataset
-
A service account JSON file, e.g.
bot-sandbox-interviews-sa.json
Create a .env file at the project root (this file is ignored by git):
GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/bot-sandbox-interviews-sa.json
⚠️ Do not commit.envorbot-sandbox-interviews-sa.json..gitignorealready excludes.envand typical secrets/artefacts.
The CLI automatically loads .env via python-dotenv.
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"This installs:
- Runtime deps:
pandas,google-cloud-bigquery,google-cloud-storage,pyarrow,python-dotenv - Dev tools:
pytest,pytest-mock,black,flake8,pre-commit
The CLI entrypoint is gcp-sales-pipeline, defined in pyproject.toml.
--date(required): reference date,YYYY-MM-DD--project-id: GCP project ID--granularity:day(default),month,quarter,year--brand: may be passed multiple times to filter by several brands--product-id: may be passed multiple times to filter by several product IDs--bq-dataset: BigQuery dataset (default:bm_mock_data)--bq-table: products table name (default:products)--gcs-bucket: GCS bucket (default:bm_mock_sales)--output-dir: export directory (default:data/export)--output-format:parquet(default) orcsv--max-sales-files: max number of daily files (days) to load (default:500)--log-level:DEBUG,INFO,WARNING,ERROR(default:INFO)
gcp-sales-pipeline \
--date 2025-11-17gcp-sales-pipeline \
--date 2025-03-01 \
--granularity monthgcp-sales-pipeline \
--date 2025-04-15 \
--granularity quartergcp-sales-pipeline \
--project-id my-gcp-project \
--date 2025-01-01 \
--granularity year \
--brand Nike \
--brand Adidas \
--product-id P12345 \
--product-id P67890Semantics of filters:
product_idis in{P12345, P67890}sold_atis in the date range corresponding to the requested granularity.
By default, exports are written to:
data/export/
With filenames like:
sales_products_day_2025-11-17.parquetsales_products_month_2025-03.parquetsales_products_quarter_2025-Q2.parquetsales_products_year_2025.parquet
You can switch to CSV using --output-format csv.
The code uses custom exception types:
-
DataLoadErrorFor issues while reading from BigQuery / GCS (network, permissions, etc.). -
DataQualityErrorFor schema issues (e.g. missing required columns in the BigQuery table). -
TooManyFilesErrorWhen the requested date range exceeds--max-sales-filesdaily files.
The CLI maps these to exit codes:
1→ data load / data quality problem2→ too many files / too large date range99→ unexpected error
Error details are also logged using Python’s logging module.
pytest tests/unitCurrently focused on:
- Date / granularity filtering logic.
pytest tests/integrationThese tests:
-
Mock BigQuery and GCS clients.
-
Run the full pipeline.
-
Check that:
- join between products and sales works
- the exported file is produced
- the schema matches expectations.
The repo includes a .pre-commit-config.yaml running:
black(formatter)flake8(linter)- unit tests (
pytest tests/unit) - integration tests (
pytest tests/integration)
Install hooks:
pre-commit installAfter that, every commit will automatically:
- Format code with
black - Lint with
flake8 - Run unit tests
- Run integration tests
If any step fails, the commit is blocked.
The workflow file:
.github/workflows/ci.yml
Defines three jobs:
-
lint
- runs
black --checkandflake8
- runs
-
unit-tests
- runs
pytest tests/unit
- runs
-
integration-tests
- runs
pytest tests/integration
- runs
Jobs run on pushes and pull requests, ensuring consistent quality between local development and CI.