Unify producer batching across all read blocks (#400) by spicy-sauce · Pull Request #401 · datayoga-io/datayoga

spicy-sauce · 2026-05-28T14:27:55Z

Closes #400. Side-effect closes #293, #294, #295, #296, #377.

Summary

Move batching out of individual producer blocks into the Producer base class so every read block has consistent batch_size behavior, and three buggy producers stop yielding single records.

New Producer.produce_chunks() override hook; base produce() re-chunks subclass output into batches of up to batch_size (smaller on EOS or flush_ms), with bounded backpressure (asyncio.Queue(maxsize=1)) and source-error propagation.
Schema fragments (batchable.schema.json, streamable.schema.json) + new $inherit convention resolved at load time by schema_utils.resolve_inherits (wired into Block.get_json_schema and Job.get_json_schema).
7 producers migrated: std/read, files/read_csv, parquet/read, relational/read, redis/read_stream, http/receiver, azure/read_event_hub.
relational/read gains a separate fetch_size (default 10000) so DB round-trip size stays independent of pipeline batch_size.
generate-docs.sh resolves $inherit before jsonschema2mk (stdlib-only Python, no datayoga_core import) so per-block reference docs render inherited properties.

Spec: docs/superpowers/specs/2026-05-28-producer-batching-unification-design.md

Breaking change

azure/read_event_hub.batch_size is renamed to max_batch_size. The name batch_size now means pipeline batch size on this block, consistent with every other producer.

Users with batch_size: <N> in YAML for azure/read_event_hub (where it used to control the SDK callback size) must rename it to max_batch_size: <N> to preserve previous behavior. The literal batch_size: <N> still validates but with the new pipeline-level meaning. Schema now also has additionalProperties: false so typos fail loudly.

Other behavior notes (non-breaking, worth a callout)

redis/read_stream snapshot mode now drains all available pending+new records before exiting, rather than exiting after the first non-pending read. This is a strict improvement but a subtle semantic shift.
All producers now expose batch_size (default 1000). Streaming producers (redis/read_stream, http/receiver, azure/read_event_hub) also expose flush_ms (default 1000) — partial batches flush after that many ms of inactivity instead of being held indefinitely.

Test plan

CI core test suite green (83 unit tests passing locally and in CI)
Integration test redis_to_relational passes end-to-end against real Redis (closes redis.read_stream producer sends messages one by one, should be in batches based on batch_size arg #377)
Spot-check sample jobs: files/read_csv → std/write runs from the scaffold and produces expected output; smoke-tested std/read, files/read_csv, parquet/read end-to-end with batching
Real-Redis smoke (Redis 7 container): redis/read_stream snapshot mode with batch_size=10 on 25 messages produces batches of [10, 10, 5], acks correctly
Backpressure verified: 100k-row CSV with slow downstream → RSS stays flat (41.9 MB) and tracemalloc peak 4.4 MB; bounded queue prevents pre-loading
Error propagation verified: nonexistent CSV → FileNotFoundError propagates (not silent EOS)
For Azure Event Hub users: confirm YAML files have been updated to use max_batch_size instead of batch_size if SDK callback size was being configured

Review status

GitHub Copilot review: 3 comments raised (silent EOS on errors, unbounded queue, "exactly" vs "up to" wording). All 3 fixed (5734e03) with tests; replies posted on each thread.

🤖 Generated with Claude Code

…batch_size

…initialization

…ayoga-io/datayoga into batch_size_in_std_read_block

Brainstormed design for unifying batch handling across all 7 producer blocks (std/read, files/read_csv, relational/read, parquet/read, redis/read_stream, azure/read_event_hub, http/receiver). Closes the gap behind #294, #295, #296, #377 by making the Producer base class own batching via a new produce_chunks() hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Task-by-task TDD plan covering: schema fragment loader, Producer base class, and per-producer migrations (std/read, files/read_csv, parquet/read, relational/read, redis/read_stream, http/receiver, azure/read_event_hub), plus autogen + docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, BREAKING)

)

#400)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Explicitly aclose() the producer async generator on consumer exit so the underlying pump task and aiohttp server are torn down cleanly. Removes a "Task was destroyed but it is pending!" warning at test teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-unification # Conflicts: # core/src/datayoga_core/blocks/std/read/block.py # core/src/datayoga_core/blocks/std/read/block.schema.json # docs/reference/blocks/std_read.md # schemas/job.schema.json

One-line docstrings on every method, helper, and inner function added or modified by this PR: Producer base class, all 7 migrated producers, and the per-block test helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pt (#400) - Run isort/autopep8 on test files; collapse the blank line between third-party imports (pytest, etc.) and datayoga_core imports that isort flagged. - Rewrite the $inherit resolution in scripts/generate-docs.sh to use only the Python standard library, so the docs CI job (which installs only node) no longer hits ModuleNotFoundError on prometheus_client when importing datayoga_core. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

unit-tests CI does `pip install .[test]`. The azure/read_event_hub test module imports the block, which imports azure.eventhub at module load. Without azure-eventhub in the test extras, pytest's collection fails on ModuleNotFoundError. Other producer test modules (parquet, redis, http, relational) already work because their backing deps are in [test]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

formatting-check runs prettier --check on all .md including these. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR centralizes producer batching in the Producer base class so read blocks expose consistent batch_size behavior, adds shared schema fragments for batchable/streamable producers, and updates docs/tests around the migration.

Changes:

Adds Producer.produce_chunks() and base re-chunking/flush behavior.
Migrates read producers to the new hook and adds fetch_size / max_batch_size where needed.
Adds $inherit schema resolution plus regenerated schemas/reference docs.

Reviewed changes

Copilot reviewed 46 out of 54 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`core/src/datayoga_core/producer.py`	Implements base producer batching and flush logic.
`core/src/datayoga_core/schema_utils.py`	Adds `$inherit` schema-fragment resolver.
`core/src/datayoga_core/block.py`	Resolves inherited schema fragments for block validation.
`core/src/datayoga_core/job.py`	Resolves inherited fragments in generated job schema.
`core/src/datayoga_core/resources/schemas/batchable.schema.json`	Defines shared `batch_size` property.
`core/src/datayoga_core/resources/schemas/streamable.schema.json`	Defines shared streaming `batch_size`/`flush_ms` properties.
`core/src/datayoga_core/blocks//read/block.py`	Migrates producer implementations to `produce_chunks()`.
`core/src/datayoga_core/blocks/*/block.schema.json`	Uses shared batching schema fragments.
`core/src/datayoga_core/tests/*` and block tests	Adds coverage for schema inheritance and producer batching.
`core/pyproject.toml`	Adds Azure test dependencies.
`scripts/generate-docs.sh`	Resolves `$inherit` before reference-doc generation.
`schemas/job.schema.json`	Regenerated aggregate schema with inherited properties.
`docs/reference/**`	Regenerated reference docs for producer batching properties.
`docs/processing-strategies.md`	Documents producer batching, streaming flush, and Azure migration.
`docs/superpowers/specs/2026-05-28-producer-batching-unification-design.md`	Adds design spec.
`docs/superpowers/plans/2026-05-28-producer-batching-unification.md`	Adds implementation plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

#400) Two correctness fixes flagged by the Copilot PR reviewer: 1. Source errors no longer become silent EOS. The pump now captures non-cancellation exceptions and re-raises them on the consumer side after flushing the partial buffer, so a Redis disconnect, broken CSV, or DB read error fails the job loudly instead of being treated as end-of-stream against truncated input. 2. The internal queue is now bounded (maxsize=1), restoring the backpressure the old yield-driven model had. Without this, large bounded sources (parquet, relational, csv) could pre-load the entire table/file into memory while downstream was processing batch 1. The pump's `finally: put(EOS)` is skipped on cancellation to avoid deadlocking against a full queue. Also: corrected processing-strategies docs to say "up to batch_size" instead of "exactly batch_size", since partial batches fire on EOS and flush_ms timeout. Three new tests: - test_source_errors_propagate_instead_of_silent_eos - test_source_error_flushes_buffer_before_raising - test_pump_does_not_outrun_consumer_unboundedly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One-liner docstrings on every test_* function: producer batching, schema inherit, and per-block read tests. Matches the docstring coverage applied to production code earlier in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit pass: the test literally asserted block.properties.get("max_batch_size", 300) == 300 which is testing the standard library, not Block code. Removed. Remaining 4 azure tests cover validation, schema shape, accepted properties, and the documented breaking-change behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2225-line plan was execution scaffolding for the work that now ships. The spec remains as the architectural record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per review discussion: drop the custom \$inherit extension and use JSON Schema's standard allOf + \$ref composition instead. The on-disk schemas are now idiomatic JSON Schema, understood by any standards- compliant tool. Changes: - Each producer block schema gains "\$schema": draft/2019-09 and uses allOf: [{"\$ref": "../../../resources/schemas/<fragment>.schema.json"}] to inherit batch_size (and flush_ms for streaming producers). - additionalProperties: false -> unevaluatedProperties: false, which is composition-aware (the additionalProperties + allOf interaction is a known JSON Schema gotcha that rejects allOf-contributed properties). - schema_utils.resolve_inherits -> resolve_refs: walks the schema, inlines local-file \$refs recursively, detects cycles. The validation code path (Block.validate, Job.get_json_schema) stays unchanged — resolved schemas are flat. - Tests in test_schema_inherit.py rewritten for the new mechanics: inlining, transitive resolution, cycle detection, non-local refs passthrough, default base-dir fallback. - generate-docs.sh: walks standard \$ref instead of \$inherit, and also flattens allOf properties for jsonschema2mk's benefit (docs-only). - Aggregate schemas/job.schema.json regenerated. External \$ref-aware tools (IDE plugins, OpenAPI exporters) can now follow the schemas without our custom resolver. jsonschema2mk is the one tool that doesn't grok \$ref, so the docs generator keeps its pre-resolution step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@AbstractMethod

Four corrections after PR review and migration to standard JSON Schema: - Status: Design -> Implemented in PR #401. - flush_ms code snippet now shows the bounded queue (maxsize=1) and source-error propagation that Copilot review surfaced, plus the cancelled-flag dance that the bounded queue requires. - Producer ABC risk note: produce_chunks is the new override hook but not formally @AbstractMethod, so legacy produce() overrides keep working (correcting an earlier overstatement). - Event Hub schema risk note: we use unevaluatedProperties: false, not additionalProperties: false (composition-aware). - Drop CHANGELOG mention (no CHANGELOG in this repo; PR description carries the breaking change note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Section said schema gains additionalProperties: false; actually it gains unevaluatedProperties: false (composition-aware). Also reframed the "rejects batch_size: 300 loudly" claim, which was always wrong — that literal property name still validates after the rename, just with new semantics. Typos are what additionalProperties/unevaluatedProperties catches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The file now tests schema_utils.resolve_refs (standard JSON Schema \$ref resolution), not the old custom \$inherit extension. Filename was stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 46 out of 54 changed files in this pull request and generated 2 comments.

Copilot re-review caught a real regression: my PR set count=batch_size on all xreadgroup calls, including the pending-entry-list reads (id="0"). Combined with the unconditional flip to read_pending=False after the first call, this meant: only the first batch_size pending entries got processed per job session; anything beyond was stranded until restart. The "stay in pending mode until empty" attempt I tried first didn't work: XREADGROUP id="0" always returns from the start of the PEL (since the producer doesn't ack inside produce_chunks), so a smaller count just makes us re-read the same first page forever. Fix: revert to the pre-PR semantic for the pending read — count=None drains the entire PEL in one call. Keep count=batch_size for the new-message read (id=">") so the #377 batching contract still applies to live streams. The producer's yield-as-a-chunk behavior (the actual fix for #377) is unchanged. Updated tests: - test_redis_new_message_read_uses_count_equal_to_batch_size: pending call uses count=None, new-message call uses count=batch_size - test_redis_drains_full_pel_in_one_call_even_when_larger_than_batch_size: 20 pending entries drain in a single call; base class re-chunks to batch_size=5 -> four batches of 5 Real-Redis smoke against Redis 7 with PEL=25, batch_size=5: produces batches=[5,5,5,5,5], all PEL delivered, no re-reads. The other Copilot comment (sync redis-py with block=0 freezes the asyncio event loop) is a real architectural concern but pre-existing — same behavior in pre-PR code. Deferred to a follow-up issue if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three final verification additions: 1. Property-based rechunker tests (hypothesis): probe ~1000 generated chunk-size/batch-size combinations to verify the rechunker's invariants — record conservation, order preservation, all batches well-formed, no empty emissions, partial-only-at-end. Adds hypothesis to test extras. Catches the class of bug Copilot flagged where my existing tests only covered specific inputs, not the contract. 2. test_external_task_cancellation_cleans_up_pump: simulates the Job.shutdown / Job.run cancellation path (cancelling the outer task that iterates produce()) and verifies no producer pump task is orphaned afterward. The spec claims this works; now there's a test. 3. mypy fix: Producer.DEFAULT_FLUSH_MS was inferred as None-only, making subclass overrides with int fail strict type-checking. Now typed as Optional[int]. mypy clean on all 9 changed source files. 90 tests pass (was 84: +5 property tests, +1 external-cancel test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spicy-sauce and others added 26 commits April 1, 2025 13:43

Use sa.sql.text for SQL statement execution in Block class

124bb93

Add batch processing capability to Block class and update schema for …

4773f9a

…batch_size

update json schemas

c02ab04

update autogenerated docs

3a43bc1

Fix log message to remove unnecessary exclamation mark in batch size …

0b51d63

…initialization

Merge branch 'batch_size_in_std_read_block' of https://github.com/dat…

4bf88eb

…ayoga-io/datayoga into batch_size_in_std_read_block

Merge main into feature/when-block-in-redis-lookup

13fc0a7

update json schemas

8f4f992

Increase timeout for integration tests from 10 to 15 minutes

807d61a

Add $inherit schema fragment resolver (#400)

5c178b6

Tighten $inherit resolver: reject non-list, guard against nested (#400)

cb126c6

Remove unused List import in schema_utils (#400)

0931918

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Producer base class re-chunks via produce_chunks (#400)

c9dbe92

Migrate std/read to produce_chunks (#400, #296)

f1311d8

Migrate files/read_csv to produce_chunks (#400)

12c13fb

Migrate parquet/read to produce_chunks, fix one-by-one yield (#400, #293

1af0c66

)

Migrate relational/read to produce_chunks, add fetch_size (#400, #295)

85ac26a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrate http/receiver to produce_chunks (#400)

3b72998

Migrate redis/read_stream to batched xreadgroup (#400, #377)

0b774ac

Migrate azure/read_event_hub; rename batch_size -> max_batch_size (#400…

38cf4ec

…, BREAKING)

Regenerate JSON schemas and reference docs after producer batching (#400

b67bc4a

)

Resolve $inherit before jsonschema2mk so block docs include batch_size (

6069465

#400)

Document producer batching model in processing-strategies (#400)

9dbc5d3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spicy-sauce added enhancement New feature or request core block tests labels May 28, 2026

spicy-sauce added the documentation Improvements or additions to documentation label May 28, 2026

spicy-sauce self-assigned this May 28, 2026

spicy-sauce and others added 5 commits May 28, 2026 17:33

Merge remote-tracking branch 'origin/main' into 400-producer-batching…

50102aa

…-unification # Conflicts: # core/src/datayoga_core/blocks/std/read/block.py # core/src/datayoga_core/blocks/std/read/block.schema.json # docs/reference/blocks/std_read.md # schemas/job.schema.json

Format superpowers spec and plan with prettier (#400)

a3d9275

formatting-check runs prettier --check on all .md including these. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spicy-sauce requested a review from Copilot May 28, 2026 15:47

Copilot started reviewing on behalf of spicy-sauce May 28, 2026 15:47 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread core/src/datayoga_core/producer.py Outdated

Comment thread core/src/datayoga_core/producer.py Outdated

Comment thread docs/processing-strategies.md Outdated

spicy-sauce and others added 8 commits May 28, 2026 19:04

Drop retrospective implementation plan; keep design spec (#400)

056f8cf

The 2225-line plan was execution scaffolding for the work that now ships. The spec remains as the architectural record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rename test_schema_inherit.py -> test_schema_refs.py (#400)

26a271b

The file now tests schema_utils.resolve_refs (standard JSON Schema \$ref resolution), not the old custom \$inherit extension. Filename was stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spicy-sauce requested a review from Copilot May 31, 2026 10:50

Copilot started reviewing on behalf of spicy-sauce May 31, 2026 10:50 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread core/src/datayoga_core/blocks/redis/read_stream/block.py

Comment thread core/src/datayoga_core/blocks/redis/read_stream/block.py

spicy-sauce and others added 2 commits May 31, 2026 14:05

spicy-sauce merged commit 9aeef13 into main May 31, 2026
5 checks passed

spicy-sauce deleted the 400-producer-batching-unification branch May 31, 2026 12:23

This was referenced May 31, 2026

Implement batch support for the redis/read_stream producer #294

Closed

Implement batch support for the relational/read producer #295

Closed

Implement batch support for the std/read producer #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify producer batching across all read blocks (#400)#401

Unify producer batching across all read blocks (#400)#401
spicy-sauce merged 41 commits into
mainfrom
400-producer-batching-unification

spicy-sauce commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spicy-sauce commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking change

Other behavior notes (non-breaking, worth a callout)

Test plan

Review status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spicy-sauce commented May 28, 2026 •

edited

Loading