Skip to content

feat(admin-api): support inter-table dependencies in derived dataset validation#1912

Draft
mitchhs12 wants to merge 10 commits intomainfrom
mitchhs12/inter-table-deps
Draft

feat(admin-api): support inter-table dependencies in derived dataset validation#1912
mitchhs12 wants to merge 10 commits intomainfrom
mitchhs12/inter-table-deps

Conversation

@mitchhs12
Copy link
Contributor

@mitchhs12 mitchhs12 commented Mar 5, 2026

Summary

Adds full inter-table dependency support for derived datasets — both validation (admin API) and runtime (dump engine). Tables within a derived dataset can now reference sibling tables using self.<table_name> syntax (e.g., SELECT * FROM self.blocks_base), consistent with the existing self. UDF convention.

Part 1: Validation (admin API)

  • Add SelfSchemaProvider.add_table() for progressive schema registration during topological processing
  • Add cycle detection via topological_sort(), returning CYCLIC_DEPENDENCY 400 error
  • Process tables in dependency order in /schema and /manifests endpoints
  • 5 integration tests: basic self-ref, 3-table chain, cycle rejection, self-referencing table, mixed deps

Part 2: Runtime (dump engine)

  • Split physical_for_dump::create() into resolve_external_deps() + build_catalog() so callers can inject self-ref entries alongside external deps
  • Extract partition_table_refs() to separate self. refs from external deps
  • Register sibling tables in both planning (SelfSchemaProvider for column types) and execution (ResolvedTableEntry for physical data) phases
  • Replace earliest_block() early-exit with notification-driven polling loop — self-ref tables wait for sibling data instead of exiting immediately
  • Pass sibling PhysicalTable map from orchestrator to each table task
  • Tables continue dumping in parallel (no topological ordering at runtime) — the existing streaming/notification system handles dependency ordering naturally
  • Un-ignore intra_deps_test E2E test

Key design decisions

  • self. convention: Aligns with UDF convention (self.functionName()). Parsed by DataFusion as TableReference::Partial { schema: "self", table: "..." }
  • Resolve + build split: for_dump.rs has zero self-ref knowledge — it resolves external deps and builds catalogs from generic entries. Self-ref resolution lives in table.rs where it belongs
  • Parallel, not sequential: Per Leo's feedback, tables dump in parallel. The streaming query notification pipeline handles ordering — same mechanism as external deps
  • Notification-driven start block: Self-ref tables subscribe to sibling notifications and wait until data appears, protected by FailFastJoinSet cancellation if a sibling fails

Files changed

File Changes
common/src/self_schema_provider.rs add_table() for progressive schema registration
common/src/catalog/physical/for_dump.rs Split create()resolve_external_deps() + build_catalog(), add ResolvedTableEntry
worker-datasets-derived/src/job_impl.rs Build siblings map, pass to each materialize_table() call; 1 unit test
worker-datasets-derived/src/job_impl/table.rs partition_table_refs(), self-ref resolution in both phases, notification-driven polling; 5 unit tests
datasets-derived/src/sorting.rs topological_sort() and CyclicDepError
admin-api/src/handlers/schema.rs Topological ordering + cycle detection in /schema
admin-api/src/handlers/common.rs Topological ordering + cycle detection in /manifests
tests/src/tests/it_dependencies.rs Remove #[ignore] from intra_deps_test
docs/feat/data-inter-table-dependencies.md Document runtime support, remove "validation only" limitation

Related

  • Leo's prior runtime implementation: Table self reference #1524 (closed — codebase has since been refactored)
  • Feature doc: docs/feat/data-inter-table-dependencies.md

Test plan

  • 5 integration tests for validation (self-ref, chain, cycle, self-cycle, mixed deps)
  • 6 unit tests for runtime (partition logic, error fatality)
  • E2E intra_deps_test passes (dump + query with inter-table deps)
  • All 11 crate tests pass (cargo test -p amp-worker-datasets-derived)
  • Format, check, clippy — zero warnings

Add self-qualified table references (self.<table_name>) for tables within
the same derived dataset, consistent with the existing self.functionName()
UDF convention. Tables are topologically sorted by their dependencies and
validated in order, with each table's schema progressively registered so
subsequent tables can reference it.

- Add SelfSchemaProvider::add_table() for progressive schema registration
- Wire topological sort and cycle detection into schema.rs and common.rs
- Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME error codes to
  schema.rs, replacing the generic TABLE_REFERENCE_RESOLUTION for these
  cases — the new reference extraction step catches these errors earlier
  (before planning), so they now get specific codes matching common.rs
- Add inter-table dependency integration tests (cycle, chain, mixed deps)
- Update test fixtures to use self. prefix for intra-dataset references
- Add feature documentation
…limitations

Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME to the error codes
table, and document that runtime execution (dump/query) is not yet
supported.
This E2E test exercises the full pipeline (register → dump → query),
but runtime self-ref resolution in the dump engine is not yet
implemented. The dump engine's physical_for_dump::create treats "self"
as a dependency alias and fails with DependencyAliasNotFound.

Keep ignored until runtime support is added.
///
/// After a table's schema is inferred during inter-table dependency processing,
/// call this method so that subsequent tables can reference it via bare SQL names.
pub fn add_table(&self, name: &str, schema: SchemaRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn add_table(&self, name: &str, schema: SchemaRef) {
pub fn add_table(&self, name: impl Into<String>, schema: SchemaRef) {

So we can pass a TableName type

- Use DepAliasOrSelfRef instead of String for type-driven design
- Preserve error chain with #[source] instead of stringifying errors
- Rename tests to follow function_scenario_outcome convention
- Replace .expect() with let-else + continue
- Add doc comment to NonIncrementalQuery error variant
- Use let-else for defensive manifest lookup in common.rs
…_table

Per review feedback — allows callers to pass owned types directly.
Replace "bare SQL names" with "self.<table_name>" which accurately
describes the reference convention.
- Remove unnecessary table_names_set intermediate in common.rs
- Fix terminology: "intra-dataset" → "inter-table" for consistency
- Assert specific error code (INVALID_PLAN) in self-ref test
…E (500)

The unresolved self.table_a error from DataFusion is not tagged with
amp::invalid_input, so it maps to SCHEMA_INFERENCE (500) not
INVALID_PLAN (400).
…port

Enable self-ref tables (e.g., `SELECT * FROM self.blocks_base`) to work
at runtime during dump materialization. Previously only validation was
supported (PR #1).

Key changes:
- Split `physical_for_dump::create()` into `resolve_external_deps()` +
  `build_catalog()` so callers can inject self-ref entries
- Add `partition_table_refs()` to separate `self.` refs from external deps
- Register sibling tables in both planning (SelfSchemaProvider) and
  execution (ResolvedTableEntry) phases
- Replace earliest_block early-exit with notification-driven polling loop
  so self-ref tables wait for sibling data instead of exiting immediately
- Pass sibling PhysicalTable map from orchestrator to each table task
- Un-ignore `intra_deps_test` E2E test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants