Create placeholder field in empty target deploy schemas#9431
Conversation
This comment has been minimized.
This comment has been minimized.
Integration report for "Create placeholder field in empty target deploy schemas"
|
There was a problem hiding this comment.
This PR changes _fetch_stub_schema in bigquery_etl/util/target.py so that, when both client.get_table and Schema.for_table's dry-run fail to produce a non-empty schema for an unmanaged stub-target dependency, it writes a one-field placeholder schema (_bqetl_stub_placeholder STRING NULLABLE) instead of an empty fields: []. The function now also treats a successful-but-empty-fields response from either source as a failure and falls through to the placeholder. A new TestFetchStubSchema class covers the dry-run-empty-fields path.
Cross-cutting observations:
- The intent — letting downstream
SELECT *stage views survive an inaccessible source — is well scoped and the implementation is small and localized. The control flow (early-return on success, single placeholder write at the bottom) is a clean refactor of the previous nested try/except. - The main correctness concern is that
Schema.for_tablealready swallows its own exceptions and returnscls({"fields": []})(bigquery_etl/schema/__init__.py:109), so the newexcept Exception as e: for_table_err = earm is effectively unreachable in normal use. The warning will printdry-run: Noneeven thoughfor_tableprinted the real error to stdout earlier — left inline. - Test coverage exercises one of three new fall-through paths (
get_tableraises +for_tablereturns empty fields). The empty-bq_table.schemabranch and the both-raise branch aren't covered; adding parametrized cases would lock the placeholder behavior in. - Minor: typo in the test name, and
_PLACEHOLDER_STUB_SCHEMAis a shared mutable module global — fine for today's read-only usage, easy to harden.
Reviewer checklist items applicable here (issue ref, schema impact, restricted namespace) are satisfied: the PR links #9429, doesn't add new query fields, and touches utility code rather than restricted SQL namespaces.
| for_table_err: Optional[Exception] = None | ||
| try: | ||
| Schema.for_table( | ||
| schema = Schema.for_table( | ||
| project=project, | ||
| dataset=dataset, | ||
| table=name, | ||
| id_token=id_token, | ||
| partitioned_by=resolve_partition_for(sql_dir, project, dataset, name), | ||
| ).to_yaml_file(out_path) | ||
| except Exception as for_table_err: | ||
| print( | ||
| f"Warning: Could not fetch schema for {project}.{dataset}.{name}: " | ||
| f"get_table: {get_table_err}; dry-run: {for_table_err}" | ||
| ) | ||
| if schema.schema.get("fields"): | ||
| schema.to_yaml_file(out_path) | ||
| return | ||
| except Exception as e: | ||
| for_table_err = e |
There was a problem hiding this comment.
issue: Schema.for_table already catches its own exceptions internally and returns cls({"fields": []}) on failure (see bigquery_etl/schema/__init__.py:109-111) — it prints the underlying error and swallows it. That means this except branch is rarely taken in practice, and the warning emitted below will almost always show dry-run: None, hiding the actual reason the dry-run failed. The genuinely informative signal is the empty-fields fallthrough, not the exception. Consider either (a) re-raising in Schema.for_table and letting this caller decide how to log, or (b) dropping for_table_err and tightening the warning so readers don't chase a None.
| f"get_table: {get_table_err}; dry-run: {for_table_err}. " | ||
| f"Writing placeholder schema for stage deploy." | ||
| ) | ||
| Schema(_PLACEHOLDER_STUB_SCHEMA).to_yaml_file(out_path) |
There was a problem hiding this comment.
thought: downstream consumers that aren't simple SELECT * views (e.g., a view that references a specific column by name, or a query that joins on a particular field) will still fail against this placeholder — they'll just fail later in the deploy with a more obscure error than "empty schema." The PR description acknowledges the scope is SELECT * views, so this is fine, but it might be worth a short note in the warning ("downstream views that reference specific columns will still fail") so the deploy log makes the limitation obvious. Optional.
There was a problem hiding this comment.
This is true but they fail at the same spot, just with the different error message so I don't think it makes a difference
Description
For target deploys, empty schemas (
fields: []) are being created when the dry run to get schemas fails. The empty schema causes issues since the downstream tables and views can't query it. Example failed run: https://github.com/mozilla/bigquery-etl/actions/runs/26305494345/job/77442626593?pr=9429. This is blocking #9429Creating a schema with one field allows at least
SELECT *views to work. I think this hasn't come up before because there are very few cases where it would fail and they haven't been edited so they didn't get stage deployed.buildhub2is one example that I used as a test failure: https://github.com/mozilla/bigquery-etl/actions/runs/26310264998/job/77457419956Reviewer, please follow this checklist