feat(admin-api): support inter-table dependencies in derived dataset validation#1912
Draft
feat(admin-api): support inter-table dependencies in derived dataset validation#1912
Conversation
Add self-qualified table references (self.<table_name>) for tables within the same derived dataset, consistent with the existing self.functionName() UDF convention. Tables are topologically sorted by their dependencies and validated in order, with each table's schema progressively registered so subsequent tables can reference it. - Add SelfSchemaProvider::add_table() for progressive schema registration - Wire topological sort and cycle detection into schema.rs and common.rs - Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME error codes to schema.rs, replacing the generic TABLE_REFERENCE_RESOLUTION for these cases — the new reference extraction step catches these errors earlier (before planning), so they now get specific codes matching common.rs - Add inter-table dependency integration tests (cycle, chain, mixed deps) - Update test fixtures to use self. prefix for intra-dataset references - Add feature documentation
…limitations Add CATALOG_QUALIFIED_TABLE and INVALID_TABLE_NAME to the error codes table, and document that runtime execution (dump/query) is not yet supported.
This E2E test exercises the full pipeline (register → dump → query), but runtime self-ref resolution in the dump engine is not yet implemented. The dump engine's physical_for_dump::create treats "self" as a dependency alias and fails with DependencyAliasNotFound. Keep ignored until runtime support is added.
LNSD
reviewed
Mar 5, 2026
| /// | ||
| /// After a table's schema is inferred during inter-table dependency processing, | ||
| /// call this method so that subsequent tables can reference it via bare SQL names. | ||
| pub fn add_table(&self, name: &str, schema: SchemaRef) { |
Contributor
There was a problem hiding this comment.
Suggested change
| pub fn add_table(&self, name: &str, schema: SchemaRef) { | |
| pub fn add_table(&self, name: impl Into<String>, schema: SchemaRef) { |
So we can pass a TableName type
- Use DepAliasOrSelfRef instead of String for type-driven design - Preserve error chain with #[source] instead of stringifying errors - Rename tests to follow function_scenario_outcome convention - Replace .expect() with let-else + continue - Add doc comment to NonIncrementalQuery error variant - Use let-else for defensive manifest lookup in common.rs
…_table Per review feedback — allows callers to pass owned types directly.
Replace "bare SQL names" with "self.<table_name>" which accurately describes the reference convention.
- Remove unnecessary table_names_set intermediate in common.rs - Fix terminology: "intra-dataset" → "inter-table" for consistency - Assert specific error code (INVALID_PLAN) in self-ref test
…E (500) The unresolved self.table_a error from DataFusion is not tagged with amp::invalid_input, so it maps to SCHEMA_INFERENCE (500) not INVALID_PLAN (400).
…port Enable self-ref tables (e.g., `SELECT * FROM self.blocks_base`) to work at runtime during dump materialization. Previously only validation was supported (PR #1). Key changes: - Split `physical_for_dump::create()` into `resolve_external_deps()` + `build_catalog()` so callers can inject self-ref entries - Add `partition_table_refs()` to separate `self.` refs from external deps - Register sibling tables in both planning (SelfSchemaProvider) and execution (ResolvedTableEntry) phases - Replace earliest_block early-exit with notification-driven polling loop so self-ref tables wait for sibling data instead of exiting immediately - Pass sibling PhysicalTable map from orchestrator to each table task - Un-ignore `intra_deps_test` E2E test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full inter-table dependency support for derived datasets — both validation (admin API) and runtime (dump engine). Tables within a derived dataset can now reference sibling tables using
self.<table_name>syntax (e.g.,SELECT * FROM self.blocks_base), consistent with the existingself.UDF convention.Part 1: Validation (admin API)
SelfSchemaProvider.add_table()for progressive schema registration during topological processingtopological_sort(), returningCYCLIC_DEPENDENCY400 error/schemaand/manifestsendpointsPart 2: Runtime (dump engine)
physical_for_dump::create()intoresolve_external_deps()+build_catalog()so callers can inject self-ref entries alongside external depspartition_table_refs()to separateself.refs from external depsearliest_block()early-exit with notification-driven polling loop — self-ref tables wait for sibling data instead of exiting immediatelyPhysicalTablemap from orchestrator to each table taskintra_deps_testE2E testKey design decisions
self.convention: Aligns with UDF convention (self.functionName()). Parsed by DataFusion asTableReference::Partial { schema: "self", table: "..." }for_dump.rshas zero self-ref knowledge — it resolves external deps and builds catalogs from generic entries. Self-ref resolution lives intable.rswhere it belongsFailFastJoinSetcancellation if a sibling failsFiles changed
common/src/self_schema_provider.rsadd_table()for progressive schema registrationcommon/src/catalog/physical/for_dump.rscreate()→resolve_external_deps()+build_catalog(), addResolvedTableEntryworker-datasets-derived/src/job_impl.rsmaterialize_table()call; 1 unit testworker-datasets-derived/src/job_impl/table.rspartition_table_refs(), self-ref resolution in both phases, notification-driven polling; 5 unit testsdatasets-derived/src/sorting.rstopological_sort()andCyclicDepErroradmin-api/src/handlers/schema.rs/schemaadmin-api/src/handlers/common.rs/manifeststests/src/tests/it_dependencies.rs#[ignore]fromintra_deps_testdocs/feat/data-inter-table-dependencies.mdRelated
docs/feat/data-inter-table-dependencies.mdTest plan
intra_deps_testpasses (dump + query with inter-table deps)cargo test -p amp-worker-datasets-derived)