Skip to content

Storage-client parity gaps & robustness improvements (from sapi-python-client comparison) #417

Description

@padak

Surfaced while comparing kbagent's KeboolaClient against the official sapi-python-client (kbcstorage) during the in-process library facade work (#416 / #415).

Headline: kbagent is already at parity or ahead of the official client on almost every axis — timeouts (kbcstorage has none), 429 + Retry-After handling, connect/read-timeout retries, connection pooling, bucket sharing/linking (kbcstorage raise NotImplementedError), typed-table creation, multi-cloud upload/download with no boto3/azure/gcs dependency, and the Query Service (kbcstorage has none). The items below are the genuine gaps worth triaging — none block the facade.

File refs are against src/keboola_agent_cli/.

Storage export / import (highest value)

  • [high] where-column filter on export / preview / download. kbcstorage exposes where_column + where_operator (eq/neq) + where_values on preview/export/export_raw (tables.py). kbagent's get_table_data_preview / export_table_async support only columns + limit — no way to export "rows where status = active" via the credential-only Storage path (the Query Service path needs a workspace). Scope: medium (client + service + download-table --where-* flags).
  • [high] changed_since / changed_until incremental export. kbcstorage threads these through every export path; kbagent has no equivalent (grep returns zero Storage hits). Blocks "pull only rows changed since my last sync" without a full table download. Scope: small-medium.
  • [high] add-column (typed) to match the existing delete-column. kbagent can create a fully-typed table and drop columns, but cannot add a typed column to an existing table (Storage API POST /tables/{id}/columns supports it). Internal asymmetry, not a kbcstorage gap. Scope: small-medium (reuse _parse_column_spec).
  • [med] escaped_by / without_headers / columns on import. import_table_async / upload_table hard-code only delimiter + enclosure; headerless or backslash-escaped CSVs can't be loaded. Scope: small.
  • [med] Export file_format rfc / escaped / raw. export_table_async only switches csv vs parquet; always RFC CSV. Scope: small.

Robustness

  • [med] Retry jitter. Backoff in http_base.py is bare BACKOFF_BASE * 2**attempt with no jitter; kbagent's headline parallel multi-project fan-out (MCP read tools, org_service, lineage_service) makes a stack-wide 5xx retry in lock-step. Add equal/full jitter + a RETRY_JITTER constant. Scope: one-liner.
  • [med] Per-slice retry on large sliced downloads. The cloud GETs in download_sliced_file / _CloudDownloader.stream_to_file and the manifest fetch are one-shot httpx streams with no retry — one flaky slice aborts a multi-GB export. Wrap in a bounded retry on 429/5xx/TransportError reusing the existing retry constants. Scope: ~25 lines.
  • [low] Stop reaching into httpx.Client._headers private API (client.py seeds queue/query/encrypt sub-clients via self._client._headers.copy()). Store the composed headers on the instance in BaseHttpClient.__init__ and copy that. Scope: one-liner.

Library facade (lib.py) enhancements

  • [med] Native in-memory upload/download on KeboolaClient. lib.Files.upload(bytes) / read_bytes currently stage through a temp dir (works, hardened in feat(lib): importable in-process library facade (Client/Files) (0.61.0) #416). A _upload_to_cloud(source: BinaryIO | bytes) + a download_to_bytesio(url) would let the facade skip temp files entirely. The AWS branch already materializes bytes for SigV4 hashing, so the lift is small. Scope: ~70 lines.
  • [med] Extend the facade with Client.tables / Client.buckets / Client.jobs namespaces mirroring Client.files, matching the kbcstorage idiom Keboola users know. Thin wrappers over existing KeboolaClient methods. Scope: small-medium.
  • [low] upload_file return shape is stale. It echoes the prepare response, so FileEntry.created after an upload can be None and tags is what you asked for, not server-canonical. Either re-fetch get_file_info(id) after upload or document the contract. Scope: ~5 lines.

Lower priority / validate demand first

  • [med] workspace load CLI drops per-table load options (columns / where / datatypes / load-clone). The client load_workspace_tables already posts the raw input array, so the payload flows through — it's CLI/service plumbing that narrows it. Scope: small-medium.
  • [med] Storage event-driven triggers (run config X when table(s) Y change, with cooldown) are entirely absent. kbagent has time-based flow schedule / schedule and agent-task triggers but no data-event trigger. Decide whether this is intentionally out of scope vs flows before building. Scope: medium.
  • [low] Table snapshot / restore. kbcstorage hints at it via create_raw(snapshot_id=...); kbagent has clone-table/pull-table (branch materialization) but no named snapshot/restore. Validate demand. Scope: medium-large.
  • [low] create_config lacks state / is_disabled / configurationId kwargs (useful for idempotent provisioning with deterministic IDs). Scope: XS.
  • [low] list_files lacks max_id / run_id filters (max_id enables descending-window pagination; run_id finds files from a specific job run). Scope: ~6 lines.

Each box is independently shippable. Suggest cherry-picking the high export/import items + the two med robustness items first; the facade enhancements can follow as the in-process library use case (jasnost) matures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions