feat(destinations): add Google Cloud Storage destination (csv/json/jsonl/parquet + gzip) (closes #169) by masukai · Pull Request #623 · drt-hub/drt

masukai · 2026-06-08T00:35:52Z

Summary

GCS destination — the natural pair for BigQuery users and the second of the v0.8 cloud-storage trio (S3 / GCS / Azure Blob). Same shape as the S3 destination from #613.

Phase B of the v0.8.0 cloud-storage work:

✅ Phase A: shared _blob_serializer.py refactor (merged refactor(destinations): extract shared blob serialiser for S3/GCS/Azure (prep #169 #170) #622)
✅ Phase B: GCS destination (this PR)
⏳ Phase C: Azure Blob (feat: add Azure Blob Storage destination (CSV/Parquet export) #170, next)

Implementation

Thin _client + blob.upload_from_string shim on top of drt/destinations/_blob_serializer.py — the csv / json / jsonl / parquet + gzip + key-naming logic is shared with S3, not duplicated.

Formats: csv / json / jsonl / parquet, optional gzip for text formats.

Auth:

Default: Application Default Credentials chain (GOOGLE_APPLICATION_CREDENTIALS env → gcloud auth application-default login → GCE/GKE/Cloud Run attached SA)
Override: credentials_path for non-GCP CI / cron environments
project_id optional (the keyfile usually carries one)

Failure semantics:

[gcs] missing-extras → ImportError bubbles up (deployment mistake, surfaced once at the top by the engine)
[parquet] missing-extras → row failure with drt-core[parquet] hint preserved (matches S3)
Upload errors (network / auth / permissions) → row failures so other batches keep going
Empty batches short-circuit before any google.cloud import or GCS call (implicit "no driver was imported" contract from feat(tests): SQL destination contract tests (Step 2b of #364 follow-up) #595)

Example

destination:
  type: gcs
  bucket: my-data-exports
  prefix: drt/users/
  format: jsonl
  compression: gzip

For BigQuery external-table interop:

destination:
  type: gcs
  bucket: my-data-lake
  prefix: events/
  format: parquet
  parquet_compression: snappy
  project_id: my-gcp-project

Tests

17 unit tests in tests/unit/test_gcs_destination.py:

Config validation (minimal, describe, gs:// scheme)
Empty-batch short-circuit (no google.cloud import)
CSV / JSON / JSONL upload call shape per format
gzip → blob.content_encoding = "gzip" + .jsonl.gz extension + body decompresses correctly
Credential threading: ADC vs project_id vs credentials_path
Key-template overrides + no-prefix bare-filename
Failure paths: upload error → row failures, [gcs] missing-extras → ImportError bubble, [parquet] missing-extras → row failure with hint, serialisation error → row failures (no upload)
Parquet orchestration (mocked pandas/pyarrow — see note below)

Why the parquet test mocks pandas/pyarrow

The S3 destination's test_parquet_uploads_binary_body_with_octet_stream already validates the real PAR1 binary end-to-end. If GCS did the same, the two test classes would both call df.to_parquet(...) and pyarrow raises A type extension with name pandas.period already defined on the second registration. Mocked-pandas pattern matches the existing test_parquet_orchestration_with_mocked_pandas_runs_on_ci in the S3 suite.

Test plan

pytest tests/unit/test_gcs_destination.py — 17 passed
pytest tests/unit/test_gcs_destination.py tests/unit/test_s3_destination.py tests/unit/test_blob_serializer.py — 61 passed (no cross-test pollution)
Full unit + contracts: 1478 passed
make lint — ruff + mypy all green
CI green on 3.10–3.13 + CodeQL

Docs

docs/connectors/gcs.md — full reference (formats, auth, naming, sync modes, notes)
README.md + README.ja.md destination tables updated (v0.7.9 row)

i18n marker bump for README.ja.md follows the established post-merge housekeeping pattern (same as #618 for #613 S3).

CHANGELOG

[Unreleased] → Added entry above the S3 entry.

Closes feat: add GCS destination (CSV/Parquet export) #169
Builds on feat(destinations): add Amazon S3 destination (csv/json/jsonl/parquet + gzip) (closes #168) #613 (S3 destination, merged) + refactor(destinations): extract shared blob serialiser for S3/GCS/Azure (prep #169 #170) #622 (shared serialiser refactor, merged)
Unblocks feat: add Azure Blob Storage destination (CSV/Parquet export) #170 (Azure Blob — next PR in the trio, same shared module)
v0.8 milestone: https://github.com/drt-hub/drt/milestone/5

🤖 Generated with Claude Code

codecov · 2026-06-08T00:38:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

masukai · 2026-06-09T11:05:10Z

@yodakanohoshi — small ping when you have a window 🙏

GCS destination, Phase B of the v0.8 cloud-storage trio (S3 ✅ / GCS this PR / Azure Blob #624). Thin shim on top of the shared _blob_serializer.py from #622 (already merged), so the diff is small for what it ships — _client + blob.upload_from_string + 17 unit tests. ADC + service-account-JSON auth paths.

This one's the bigger review out of the four. If timing's tight, prioritising #623 / #624 over the docs PRs (#620 / #621) is fine.

…e to 100% codecov/patch on the prior commit hit 94.73% (target 86.72%) — the gate passed but not at 100%. Uncovered slice was the 3 branches not exercised by the happy-path tests: - **Lines 152-163** — MERGE-path staging INSERT failure handler (the per-row try/except inside the staging-table INSERT loop) - **Line 196** — mirror's ``failed_indices`` skip path inside the ``_mirror_keys`` accumulator (skip rows that didn't make it into the destination so they don't count as "observed in source" for the end-of-sync DELETE) - **Line 202** — the ``Unsupported mode`` defensive fallthrough ValueError (unreachable in normal flow because Pydantic Literal validates at config-load time, but tracked by coverage) 4 new tests: 1. `test_merge_staging_insert_failure_on_error_skip` — first staging INSERT fails, second succeeds; verifies result.failed=1 + row_errors recorded + MERGE still runs against whatever made it into staging. 2. `test_merge_staging_insert_failure_on_error_fail_raises` — same failure scenario but with on_error=fail; verifies the exception re-raises and the connection is still closed via try/finally. 3. `test_unsupported_mode_raises` — manually corrupts ``config.mode`` to "garbage" after Pydantic construction (bypasses Literal validation via ``object.__setattr__``) and verifies the defensive ValueError fires. 4. `test_mirror_skips_failed_keys_from_delete_observed_set` — mirror load with a staging failure on row 1, then finalize_sync; verifies the DELETE's NOT-IN list contains only the survivor's key (id=2), not the failed row's key (id=1). This catches the semantic bug where a row that failed to load would be deleted from the destination on next mirror run. drt/destinations/databricks.py file coverage: 94% → 100% (119/119 stmts). Coverage now matches the S3 / GCS / Azure Blob destinations from #613 / #623 / #624. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ge/mirror) (closes #167) (#629) * feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167) Third DWH destination alongside Snowflake (#353) and BigQuery (#584 in flight) — completes the major-DWH lineup. Supports the same three modes as Snowflake's leg: - INSERT (append, `config.mode: insert`) - MERGE (upsert via Delta Lake's native MERGE INTO, `config.mode: merge`) - sync.mode: mirror (#340 family — Databricks leg) — MERGE upsert + end-of-sync DELETE-missing Auth via Databricks SQL Connector: - `host_env` — workspace hostname (dbc-*.cloud.databricks.com) - `http_path_env` — SQL warehouse HTTP path (/sql/1.0/warehouses/*) - `token_env` — personal access token (PAT, dapi*) Unity Catalog three-part names (catalog.schema.table) are the default; legacy workspaces use `catalog: hive_metastore`. Merge implementation note: Databricks Delta Lake doesn't have session-local temp tables (no `CREATE TEMP TABLE` syntax), so the merge path creates a uniquely-named scratch Delta table `catalog.schema.__drt_staging_<table>` cloned from the target's schema, stages rows via per-row INSERT, executes MERGE INTO, and DROP TABLEs the staging at the end. The `__drt_staging_*` prefix makes it identifiable in audit logs. The token-bearing principal needs CREATE on the schema in addition to MODIFY on the target. Mirror semantics match the Snowflake leg of #340: - `sync.mode: mirror` forces the MERGE write path regardless of `config.mode` - End-of-sync issues `DELETE FROM <table> WHERE upsert_key NOT IN (observed)` - Composite keys use `WHERE (c1, c2) NOT IN ((v1a, v1b), ...)` form - Safety guard: skips DELETE entirely when no batch produced records 22 unit tests in tests/unit/test_databricks_destination.py cover: - Config validation (schema: YAML alias, three-part FQN in describe(), Hive Metastore catalog) - Empty-batch short-circuit (#595 contract) - databricks.sql.connect() kwargs shape — protects against silent template-copy drift from the Snowflake destination - INSERT happy path + on_error=skip / on_error=fail - MERGE happy path + upsert_key required + composite key ON clause + all-columns-are-key (no UPDATE clause) - Mirror invariants: upsert_key validation, MERGE-write-path forcing, single-column DELETE, composite-key DELETE tuple form, skip-when-no-records safety guard, no-op finalize_sync for non-mirror modes - test_connection round-trip databricks.sql is mocked via sys.modules injection — no real Databricks workspace or databricks-sql-connector install required. Requires `pip install drt-core[databricks]` (depends on databricks-sql-connector>=3.0, already in pyproject extras). New `docs/connectors/databricks.md` covers all three modes, auth flow with PAT generation steps, Unity Catalog vs Hive Metastore, the merge-path staging design (why Delta scratch table and not CREATE TEMP TABLE), and a sync-mode compatibility table. README destination table updated on both English and Japanese sides (Databricks Delta Lake row added after Snowflake, v0.7.9). i18n marker bump for README.ja.md follows the established post-merge housekeeping pattern (#618-style). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * test(databricks): cover MERGE/mirror edge paths to push patch coverage to 100% codecov/patch on the prior commit hit 94.73% (target 86.72%) — the gate passed but not at 100%. Uncovered slice was the 3 branches not exercised by the happy-path tests: - **Lines 152-163** — MERGE-path staging INSERT failure handler (the per-row try/except inside the staging-table INSERT loop) - **Line 196** — mirror's ``failed_indices`` skip path inside the ``_mirror_keys`` accumulator (skip rows that didn't make it into the destination so they don't count as "observed in source" for the end-of-sync DELETE) - **Line 202** — the ``Unsupported mode`` defensive fallthrough ValueError (unreachable in normal flow because Pydantic Literal validates at config-load time, but tracked by coverage) 4 new tests: 1. `test_merge_staging_insert_failure_on_error_skip` — first staging INSERT fails, second succeeds; verifies result.failed=1 + row_errors recorded + MERGE still runs against whatever made it into staging. 2. `test_merge_staging_insert_failure_on_error_fail_raises` — same failure scenario but with on_error=fail; verifies the exception re-raises and the connection is still closed via try/finally. 3. `test_unsupported_mode_raises` — manually corrupts ``config.mode`` to "garbage" after Pydantic construction (bypasses Literal validation via ``object.__setattr__``) and verifies the defensive ValueError fires. 4. `test_mirror_skips_failed_keys_from_delete_observed_set` — mirror load with a staging failure on row 1, then finalize_sync; verifies the DELETE's NOT-IN list contains only the survivor's key (id=2), not the failed row's key (id=1). This catches the semantic bug where a row that failed to load would be deleted from the destination on next mirror run. drt/destinations/databricks.py file coverage: 94% → 100% (119/119 stmts). Coverage now matches the S3 / GCS / Azure Blob destinations from #613 / #623 / #624. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…onl/parquet + gzip) (closes #169) GCS destination — the natural pair for BigQuery users and the second of the v0.8 cloud-storage trio (S3 / GCS / Azure Blob). Same shape as the S3 destination from #613 — four formats (csv / json / jsonl / parquet) with optional gzip compression for the text formats (parquet keeps its own column-level compression via `parquet_compression`). Implementation is a thin `_client` + `blob.upload_from_string` shim on top of `drt/destinations/_blob_serializer.py` from #622 — the csv/json/jsonl/parquet + gzip + key-naming logic is shared with S3, not duplicated. Azure Blob (#170) lands next on the same module. Authentication: - Default: Application Default Credentials chain (GOOGLE_APPLICATION_CREDENTIALS env → `gcloud auth application-default login` → GCE/GKE/Cloud Run attached SA) - Override: `credentials_path` for non-GCP CI / cron environments - `project_id` optional (the keyfile usually carries one) Failure semantics: - [gcs] missing-extras → ImportError bubbles up (deployment mistake, surfaced once at the top by the engine) - [parquet] missing-extras → row failure with the `drt-core[parquet]` install hint preserved (matches S3) - Upload errors (network / auth / permissions) → row failures so other batches keep going - Empty batches short-circuit before any `google.cloud` import or GCS call (implicit "no driver was imported" contract from #595) 17 unit tests in `tests/unit/test_gcs_destination.py` cover config validation, the empty-batch short-circuit, `blob.upload_from_string` call shape per format, gzip → `blob.content_encoding = "gzip"` + `.gz` extension, ADC vs `project_id` vs `credentials_path` threading, key-template overrides, and the failure paths. The parquet orchestration test mocks `pandas` + `pyarrow` rather than calling them end-to-end — the S3 destination's parquet test already validates the real PAR1 binary, and double-registering the pyarrow type extension across two test classes raises `A type extension with name pandas.period already defined`. `docs/connectors/gcs.md` covers all four formats, the ADC / service-account auth flow, the object-naming convention, and notes on BigQuery external-table interop. README destination table updated on both English and Japanese sides. i18n marker bump for README.ja.md → README.md follow-up via the established post-merge housekeeping pattern (#618-style). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…inventory Surfaced by `make check-drift` after rebasing onto main (the drift-audit infra from #630 landed while this PR was open). GCS was registered in the connector registry but missing from two surfaces the audit tracks: - `/drt-create-sync` skill destinations list (as "Google Cloud Storage (GCS)" — the "(GCS)" suffix is load-bearing for the drift matcher, which looks for the `gcs` type key as a substring) - `drt_list_connectors` MCP inventory `make check-drift` now exits 0 on this branch, so GCS lands already consistent across docs / skill / MCP rather than tripping the weekly audit after merge. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…l/parquet + gzip) (closes #170) Completes the v0.8 cloud-storage trio (S3 #168 / GCS #169 / Azure Blob #170). Same shape as the S3 destination from #613 and the GCS destination from #623 — four formats (csv/json/jsonl/parquet) with optional gzip for the text formats. Implementation is a thin `_service_client` + `blob_client.upload_blob` shim on top of `drt/destinations/_blob_serializer.py` from #622 — the csv/json/jsonl/parquet + gzip + key-naming logic is shared with S3 and GCS, not duplicated. Authentication offers two paths (exactly one must be set; config errors raise immediately): 1. Connection string via `connection_string_env` pointing at an env var holding the storage-account connection string — the right shape for CI / cron / non-Azure deployments. 2. DefaultAzureCredential via `account_url` pointing at the storage-account blob endpoint — the right shape for Azure-hosted apps with managed identity (App Service / AKS / Container Apps / VMs / Functions), plus local dev via `az login`. Content-Type and Content-Encoding metadata are carried on the blob via Azure's `ContentSettings` (the Azure equivalent of S3's `ContentType` / `ContentEncoding` put_object kwargs). `upload_blob` is called with `overwrite=True` — irrelevant under the default timestamped key (a fresh blob name every run) and intentional for `key_template` users who set a fixed name like `latest.csv`. Failure semantics: - [azure] missing-extras → ImportError bubbles up (deployment mistake, surfaced once at the top by the engine) - [parquet] missing-extras → row failure with the `drt-core[parquet]` install hint preserved (matches S3 / GCS) - Upload errors (network / auth / permissions) → row failures so other batches keep going - Config errors (empty `connection_string_env`, neither auth path set) → ValueError raised immediately - Empty batches short-circuit before any `azure.storage.blob` import or Azure call (implicit "no driver was imported" contract from #595) 18 unit tests in `tests/unit/test_azure_blob_destination.py` cover config validation, the empty-batch short-circuit, `upload_blob` call shape per format, gzip → `ContentSettings(content_type=..., content_encoding="gzip")` + `.gz` extension, both auth paths (connection-string env-var resolution + DefaultAzureCredential threading), the negative auth paths, key-template overrides, and the failure paths. The parquet orchestration test mocks `pandas` + `pyarrow` rather than calling them end-to-end — same `A type extension with name pandas.period already defined` rationale as the GCS suite (#623). `docs/connectors/azure-blob.md` covers all four formats, both auth paths with example connection-string and managed-identity flows, the blob-naming convention, and Content-Encoding interop notes for Azure Data Factory / Synapse / Databricks. README destination table updated on both English and Japanese sides. `[azure]` extras: `azure-storage-blob>=12.0` + `azure-identity>=1.15`. i18n marker bump for README.ja.md → README.md follow-up via the established post-merge housekeeping pattern (#618-style). Conflict expected against #623 (GCS) on `drt/config/models.py` / `drt/connectors/registry.py` / README destination tables / CHANGELOG — all are adjacent-line conflicts that resolve trivially during rebase. The destinations themselves don't touch each other. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

masukai mentioned this pull request Jun 8, 2026

feat(destinations): add Azure Blob Storage destination (csv/json/jsonl/parquet + gzip) (closes #170) #624

Merged

5 tasks

masukai requested a review from yodakanohoshi June 9, 2026 11:04

masukai and others added 2 commits June 11, 2026 09:32

masukai force-pushed the feat/gcs-destination branch from 540abe2 to 63fabc6 Compare June 11, 2026 00:34

masukai merged commit ea2a600 into main Jun 11, 2026
8 checks passed

masukai deleted the feat/gcs-destination branch June 11, 2026 00:42

github-actions Bot locked and limited conversation to collaborators Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(destinations): add Google Cloud Storage destination (csv/json/jsonl/parquet + gzip) (closes #169)#623

feat(destinations): add Google Cloud Storage destination (csv/json/jsonl/parquet + gzip) (closes #169)#623
masukai merged 2 commits into
mainfrom
feat/gcs-destination

masukai commented Jun 8, 2026

Uh oh!

codecov Bot commented Jun 8, 2026

Uh oh!

masukai commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

masukai commented Jun 8, 2026

Summary

Implementation

Example

Tests

Why the parquet test mocks pandas/pyarrow

Test plan

Docs

CHANGELOG

Related

Uh oh!

codecov Bot commented Jun 8, 2026

Codecov Report

Uh oh!

masukai commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant