feat(admin-api): support inter-table dependencies in derived dataset validation by mitchhs12 · Pull Request #1912 · edgeandnode/amp

mitchhs12 · 2026-03-05T19:46:31Z

Summary

Adds full inter-table dependency support for derived datasets — both validation (admin API) and runtime (dump engine). Tables within a derived dataset can now reference sibling tables using self.<table_name> syntax (e.g., SELECT * FROM self.blocks_base), consistent with the existing self. UDF convention.

Part 1: Validation (admin API)

Add SelfSchemaProvider.add_table() for progressive schema registration during topological processing
Add cycle detection via topological_sort(), returning CYCLIC_DEPENDENCY 400 error
Process tables in dependency order in /schema and /manifests endpoints
5 integration tests: basic self-ref, 3-table chain, cycle rejection, self-referencing table, mixed deps

Part 2: Runtime (dump engine)

Split physical_for_dump::create() into resolve_external_deps() + build_catalog() so callers can inject self-ref entries alongside external deps
Extract partition_table_refs() to separate self. refs from external deps
Register sibling tables in both planning (SelfSchemaProvider for column types) and execution (ResolvedTableEntry for physical data) phases
Replace earliest_block() early-exit with notification-driven polling loop — self-ref tables wait for sibling data instead of exiting immediately
Pass sibling PhysicalTable map from orchestrator to each table task
Tables continue dumping in parallel (no topological ordering at runtime) — the existing streaming/notification system handles dependency ordering naturally
Un-ignore intra_deps_test E2E test

Key design decisions

self. convention: Aligns with UDF convention (self.functionName()). Parsed by DataFusion as TableReference::Partial { schema: "self", table: "..." }
Resolve + build split: for_dump.rs has zero self-ref knowledge — it resolves external deps and builds catalogs from generic entries. Self-ref resolution lives in table.rs where it belongs
Parallel, not sequential: Per Leo's feedback, tables dump in parallel. The streaming query notification pipeline handles ordering — same mechanism as external deps
Notification-driven start block: Self-ref tables subscribe to sibling notifications and wait until data appears, protected by FailFastJoinSet cancellation if a sibling fails

Files changed

File	Changes
`common/src/self_schema_provider.rs`	`add_table()` for progressive schema registration
`common/src/catalog/physical/for_dump.rs`	Split `create()` → `resolve_external_deps()` + `build_catalog()`, add `ResolvedTableEntry`
`worker-datasets-derived/src/job_impl.rs`	Build siblings map, pass to each `materialize_table()` call; 1 unit test
`worker-datasets-derived/src/job_impl/table.rs`	`partition_table_refs()`, self-ref resolution in both phases, notification-driven polling; 5 unit tests
`datasets-derived/src/sorting.rs`	`topological_sort()` and `CyclicDepError`
`admin-api/src/handlers/schema.rs`	Topological ordering + cycle detection in `/schema`
`admin-api/src/handlers/common.rs`	Topological ordering + cycle detection in `/manifests`
`tests/src/tests/it_dependencies.rs`	Remove `#[ignore]` from `intra_deps_test`
`docs/feat/data-inter-table-dependencies.md`	Document runtime support, remove "validation only" limitation

Test plan

5 integration tests for validation (self-ref, chain, cycle, self-cycle, mixed deps)
6 unit tests for runtime (partition logic, error fatality)
E2E intra_deps_test passes (dump + query with inter-table deps)
All 11 crate tests pass (cargo test -p amp-worker-datasets-derived)
Format, check, clippy — zero warnings

Add self-qualified table references (self.<table_name>) for tables within the same derived dataset, consistent with the existing self.functionName() UDF convention. Tables are topologically sorted by their dependencies and validated in order, with each table's schema progressively registered so subsequent tables can reference it. - Add SelfSchemaProvider::add_table() for progressive schema registration - Wire topological sort and cycle detection into schema.rs and common.rs - Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME error codes to schema.rs, replacing the generic TABLE_REFERENCE_RESOLUTION for these cases — the new reference extraction step catches these errors earlier (before planning), so they now get specific codes matching common.rs - Add inter-table dependency integration tests (cycle, chain, mixed deps) - Update test fixtures to use self. prefix for intra-dataset references - Add feature documentation

…limitations Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME to the error codes table, and document that runtime execution (dump/query) is not yet supported.

This E2E test exercises the full pipeline (register → dump → query), but runtime self-ref resolution in the dump engine is not yet implemented. The dump engine's physical_for_dump::create treats "self" as a dependency alias and fails with DependencyAliasNotFound. Keep ignored until runtime support is added.

LNSD · 2026-03-05T21:13:41Z

crates/core/common/src/self_schema_provider.rs

+    ///
+    /// After a table's schema is inferred during inter-table dependency processing,
+    /// call this method so that subsequent tables can reference it via bare SQL names.
+    pub fn add_table(&self, name: &str, schema: SchemaRef) {


Suggested change

pub fn add_table(&self, name: &str, schema: SchemaRef) {

pub fn add_table(&self, name: impl Into<String>, schema: SchemaRef) {

So we can pass a TableName type

- Use DepAliasOrSelfRef instead of String for type-driven design - Preserve error chain with #[source] instead of stringifying errors - Rename tests to follow function_scenario_outcome convention - Replace .expect() with let-else + continue - Add doc comment to NonIncrementalQuery error variant - Use let-else for defensive manifest lookup in common.rs

…_table Per review feedback — allows callers to pass owned types directly.

Replace "bare SQL names" with "self.<table_name>" which accurately describes the reference convention.

- Remove unnecessary table_names_set intermediate in common.rs - Fix terminology: "intra-dataset" → "inter-table" for consistency - Assert specific error code (INVALID_PLAN) in self-ref test

…E (500) The unresolved self.table_a error from DataFusion is not tagged with amp::invalid_input, so it maps to SCHEMA_INFERENCE (500) not INVALID_PLAN (400).

…port Enable self-ref tables (e.g., `SELECT * FROM self.blocks_base`) to work at runtime during dump materialization. Previously only validation was supported (PR #1). Key changes: - Split `physical_for_dump::create()` into `resolve_external_deps()` + `build_catalog()` so callers can inject self-ref entries - Add `partition_table_refs()` to separate `self.` refs from external deps - Register sibling tables in both planning (SelfSchemaProvider) and execution (ResolvedTableEntry) phases - Replace earliest_block early-exit with notification-driven polling loop so self-ref tables wait for sibling data instead of exiting immediately - Pass sibling PhysicalTable map from orchestrator to each table task - Un-ignore `intra_deps_test` E2E test

mitchhs12 added 3 commits March 5, 2026 14:06

docs(feat): update inter-table dependencies doc with error codes and …

a41c2a7

…limitations Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME to the error codes table, and document that runtime execution (dump/query) is not yet supported.

LNSD reviewed Mar 5, 2026

View reviewed changes

mitchhs12 added 7 commits March 5, 2026 16:25

refactor(common): accept impl Into<String> in SelfSchemaProvider::add…

cb96f16

…_table Per review feedback — allows callers to pass owned types directly.

style(common): format add_table method body

a7619ef

docs(common): fix inaccurate doc comment on add_table

64d4367

Replace "bare SQL names" with "self.<table_name>" which accurately describes the reference convention.

refactor(admin-api): minor tidying of inter-table dep code

da7611c

- Remove unnecessary table_names_set intermediate in common.rs - Fix terminology: "intra-dataset" → "inter-table" for consistency - Assert specific error code (INVALID_PLAN) in self-ref test

fix(tests): correct self-ref test assertion to expect SCHEMA_INFERENC…

a8e5103

…E (500) The unresolved self.table_a error from DataFusion is not tagged with amp::invalid_input, so it maps to SCHEMA_INFERENCE (500) not INVALID_PLAN (400).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admin-api): support inter-table dependencies in derived dataset validation#1912

feat(admin-api): support inter-table dependencies in derived dataset validation#1912
mitchhs12 wants to merge 10 commits intomainfrom
mitchhs12/inter-table-deps

mitchhs12 commented Mar 5, 2026 •

edited

Loading

Uh oh!

LNSD Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	pub fn add_table(&self, name: &str, schema: SchemaRef) {
	pub fn add_table(&self, name: impl Into<String>, schema: SchemaRef) {

Conversation

mitchhs12 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Part 1: Validation (admin API)

Part 2: Runtime (dump engine)

Key design decisions

Files changed

Related

Test plan

Uh oh!

LNSD Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mitchhs12 commented Mar 5, 2026 •

edited

Loading