Skip to content

feat(storage): create-table from a source table + BigQuery partition/clustering#468

Open
yustme wants to merge 2 commits into
mainfrom
feat/create-table-from-source
Open

feat(storage): create-table from a source table + BigQuery partition/clustering#468
yustme wants to merge 2 commits into
mainfrom
feat/create-table-from-source

Conversation

@yustme

@yustme yustme commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

Brings the Keboola connection capability from keboola/connection#7697 (DMD-1677) to the CLI: storage create-table can now create a table by copying a source table into a different BigQuery partition/clustering layout, and exposes the partition/clustering layout flags the tables-definition endpoint already supported but the CLI never surfaced.

This is the supported way to repartition a populated BigQuery table: copy it into the new layout, then flip it into place with the existing storage swap-tables.

Changes

  • --source-table-id (+ optional --source-branch-id): derive the new table's schema from a source table and copy its rows into the requested layout. Mutually exclusive with --column (and --not-null/--default, which attach to column specs).
  • --column is now optional (was required); exactly one of --column / --source-table-id must be given.
  • BigQuery layout flags (work in both columns and source mode): --time-partitioning-type/-field/-expiration-ms, --range-partitioning-field/-start/-end/-interval (range bounds are strings), --clustering-field (repeatable). Time vs range partitioning are mutually exclusive.
  • BigQuery-only, with a pre-flight guard: when any source/layout flag is used, the project backend is verified via one token-verify call and a non-BigQuery project fails fast (exit 2, clear message) before the create is issued. Plain --column creates make no extra call. The connection 422 codes (backendDoesNotSupportSourceTable, sourceAliasNotPersisted, sourceTableMissingReferencedColumn, sourceTableNotFound) remain a server-side backstop.

The client (KeboolaClient.create_table) builds the tables-definition body conditionally — source XOR columns, plus the optional timePartitioning/rangePartitioning/clustering objects — mirroring the exact shapes connection expects.

Example

# Copy a populated table into a DAY-partitioned, tenant-clustered layout...
kbagent storage create-table --project P --bucket-id in.c-main --name events_repart \
  --source-table-id in.c-main.events \
  --time-partitioning-type DAY --time-partitioning-field created_at \
  --clustering-field tenant_id --primary-key id

# ...then swap it into the original's place.
kbagent storage swap-tables --project P --table-id in.c-main.events \
  --target-table-id in.c-main.events_repart --branch <ID> --yes

Tests

  • New tests/test_storage_create_table.py: client body shaping (source vs columns, partition/clustering, string range bounds), service-layer XOR + partition validation, the BigQuery pre-flight guard (fires before POST; skipped for plain creates), and CLI flag pass-through / --column no longer required.
  • Backend-aware E2E step in tests/test_e2e.py: on BigQuery runs the source-copy + swap; on other backends asserts the pre-flight guard rejects with exit 2.
  • Updated existing test_storage_write.py call-signature assertions.
  • make check green (lint, format, typecheck, skill, version, command-sync, changelog, error-codes, 4198 tests).

Docs / version

  • Agent surfaces synced: context.py, CLAUDE.md, commands-reference.md, keboola-expert.md (tool matrix), gotchas.md, storage-types-workflow.md, regenerated SKILL.md.
  • Bumped to 0.66.0 with a changelog entry (make version-sync).

…clustering

Extend `storage create-table` to mirror keboola/connection#7697:

- `--source-table-id` (+ optional `--source-branch-id`): copy an existing
  table's data into the requested partition/clustering layout instead of
  building from `--column`. The schema is derived from the source, so
  `--column`/`--not-null`/`--default` are forbidden. This is the supported
  way to repartition a populated BigQuery table; pair with `swap-tables`.
- `--column` is now optional and mutually exclusive with `--source-table-id`.
- New BigQuery layout flags (also usable on a plain columns create):
  `--time-partitioning-type`/`-field`/`-expiration-ms`,
  `--range-partitioning-field`/`-start`/`-end`/`-interval` (bounds are
  strings), `--clustering-field`. Time vs range partitioning are mutually
  exclusive.
- BigQuery-only with a one-call pre-flight guard: when any source/layout
  flag is used, the project backend is verified first and a non-BigQuery
  project fails fast (exit 2) before the create. Plain `--column` creates
  are unaffected. Connection 422 codes remain as a server-side backstop.

Client builds the tables-definition body conditionally (source XOR columns,
optional layout). Adds unit tests (client/service/CLI), a backend-aware E2E
step, and full agent doc-sync. Bumps to 0.66.0.
@devin-ai-integration

devin-ai-integration Bot commented Jun 29, 2026

Copy link
Copy Markdown

Code Review

Overall a clean, well-structured PR. The 3-layer architecture (command → service → client) is maintained, validations are thorough, test coverage spans all layers, and documentation is updated across all mandatory surfaces. Backward compatibility for plain --column creates is verified by test. CI green.


Findings (all addressed in 920b5bd)

1. --source-branch-id without --source-table-id is silently ignored (Medium) — Fixed: validation added + test (test_source_branch_id_without_source_table_rejected).

2. if_not_exists "skipped" path missing new keys (Low) — Fixed: source_table_id, source_branch_id, time_partitioning, range_partitioning, clustering (all None) added to skipped dict + test assertions in test_storage_write.py.

3. Range partitioning human display is minimal (Nit) — Fixed: now shows Range partitioning: field [start, end) step interval.

4. uv.lock revision 3→2 (Nit) — Fixed: reverted to revision 3.


What's done well

  • Pre-flight backend guard — fails on non-BigQuery projects before the API call with a clear error message; plain columns creates pay no penalty (no extra API call).
  • XOR validation --column vs --source-table-id + prohibition of --not-null/--default in source mode — fail-fast with precise error messages.
  • _build_bigquery_layout and _build_source as pure helper functions separated from service logic.
  • Test coverage: client body shaping (3), service validation (7), backend guard (4), CLI pass-through (2), E2E (backend-aware path).
  • Doc sync: context.py, CLAUDE.md, commands-reference.md, keboola-expert.md, gotchas.md, storage-types-workflow.md, SKILL.md — all updated.

All findings addressed. LGTM 👍

- Reject --source-branch-id without --source-table-id (was silently dropped)
- Keep skipped if-not-exists envelope schema-consistent with created path
- Show range partitioning bounds in human output
- Restore uv.lock revision 3 (unrelated downgrade)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant