Skip to content

Feature: User-Controlled Collection IDs for DPS STAC #91

@hrodmn

Description

@hrodmn

Feature: User-Controlled STAC Collection IDs

Background

Currently, STAC collection IDs are auto-assigned by DpsStacItemGenerator using a
deterministic formula derived from the DPS job's .met.json metadata file:

{username}__{algorithm_name}__{algorithm_version}__{tag}

This value is slugified (special characters replaced) and then unconditionally written
into the collection field of every STAC item before publishing to the ingestor queue —
regardless of what collection the user's catalog.json specifies. Users have requested
the ability to control the collection ID so that outputs from related jobs and algorithm
runs can be organized into a single, meaningfully named collection.

This ticket proposes an initial implementation using admin-mediated collection creation,
designed to extend toward self-service and algorithm-level authorization.


How the current pipeline works

DpsStacItemGenerator (link) is triggered by S3 event notifications when a DPS job writes a
catalog.json to the output bucket. For each event:

  1. The DPS output prefix is extracted from the S3 key path using a timestamp pattern
  2. A .met.json file is loaded from that prefix — this is the authoritative source of
    job context, containing at minimum: username, algorithm_name,
    algorithm_version, and tag
  3. A deterministic collection ID is constructed from those fields and slugified
  4. The catalog.json is read via pystac; every item's collection field is
    overwritten with the deterministic ID before publishing to the ingestor SNS topic

Some users are already setting the collection field in their STAC items, but the
current code silently overwrites it. This feature stops that overwrite and makes the
item-provided collection ID the primary routing mechanism.


Proposed Approach

Phase 1: Admin-Mediated Collection Creation (this ticket)

Users request a named collection via an out-of-band process (GitHub issue or intake
form). An admin creates the collection in pgSTAC using a new CLI tool, setting the
requesting user as owner. Users then set the collection field in their algorithm's
STAC item outputs; DpsStacItemGenerator will respect it if authorization passes.

Metadata storage design

There are two distinct categories of collection metadata introduced by this feature,
and they warrant different storage strategies:

Access control (who may write, which algorithms are approved) is governance data.
It should not appear in public STAC API responses and has no place in the STAC spec.
This belongs in pgstac's private JSONB column, which is purpose-built for backend
metadata that is never surfaced to API consumers:

// pgstac.collections.private — never returned by the STAC API
{
  "owner": "jsmith",
  "contributors": ["kwilliams"],
  "approved_algorithms": [
    {"name": "my-flood-detector", "version": "1.2.0"},
    {"name": "my-flood-detector", "version": "1.3.0"}
  ]
}

Algorithm provenance (which DPS algorithms have contributed data to this
collection) is catalog metadata. It is legitimately useful to anyone browsing the
catalog and belongs in the public STAC collection document. Rather than overloading the
standard providers field — which is designed for data lineage attribution, not
pipeline tracking — this is a natural fit for a MAAP DPS STAC extension:

// pgstac.collections.content — returned by the STAC API
"stac_extensions": [
  "https://maap-project.org/stac/extensions/dps/v1.0.0/schema.json"
],
"maap_dps:contributing_algorithms": [
  {"name": "my-flood-detector", "version": "1.2.0"},
  {"name": "my-flood-detector", "version": "1.3.0"}
]

This separation keeps providers clean for its intended provenance purpose, keeps
access control private, and gives MAAP DPS pipeline metadata a coherent versioned home
that can grow (e.g., DPS environment info, platform version) without polluting the
base STAC document. The extension schema would be maintained in this repository.

Note that approved_algorithms (private, access control) and
maap_dps:contributing_algorithms (public, provenance) are related but distinct: the
former defines what is permitted, the latter records what has actually run. They are
updated at different times and by different actors.

Authorization logic in DpsStacItemGenerator

When a catalog event is received, the Lambda reads job context from .met.json as
today, then resolves the target collection ID from the STAC items themselves before
applying authorization:

met.json fields available: username, algorithm_name, algorithm_version, tag
item["collection"] field: user-specified collection ID (optional)

If items do not specify a collection field (or specify one that is absent from
pgSTAC):
→ Fall back to the existing deterministic ID formula. Emit a structured warning to
CloudWatch if a collection was specified but not found in pgstac.

If the catalog specifies a collection ID that exists in pgstac:

Check Source Pass Fail
username is owner or contributor private column Continue Hard failure — do not fall back silently
algorithm_name + algorithm_version approved (if list non-empty) private column Continue Hard failure — emit structured error
Both pass Publish items with user-specified collection ID; update maap_dps:contributing_algorithms

The distinction between "collection not found" (soft fallback) and "collection found
but unauthorized" (hard failure) is intentional. A missing collection plausibly means
the user hasn't requested creation yet. An authorization failure means an explicit
policy was violated and silently redirecting items elsewhere would be a governance
failure.

Database query

Authorization is resolved by querying pgstac directly via the existing pgBouncer
connection (VPC-accessible EC2), avoiding HTTP overhead through the STAC API:

SELECT private
FROM pgstac.collections
WHERE id = $1

This is a single query per catalog event regardless of item count. pgBouncer in
transaction pooling mode handles concurrent Lambda invocations efficiently. On
successful authorization, a second update appends the contributing algorithm to the
public content if not already present.

Admin CLI

A lightweight wrapper around pypgstac that:

  • Validates the proposed collection ID against naming rules and the reserved-name blocklist
  • Writes the private column with owner, contributors, and optional approved algorithm list
  • Initializes the maap_dps extension fields in the public collection document

This is the highest-leverage piece of initial work. Without it, the catalog will
accumulate collections with inconsistent ownership and algorithm approval records.

Backfill

Existing auto-assigned collections will have private ownership records backfilled
using the username already embedded in their deterministic IDs. The algorithm_name
and algorithm_version components are also present in the deterministic ID, so
approved_algorithms and maap_dps:contributing_algorithms can be backfilled as well.
This allows the authorization check in DpsStacItemGenerator to go live without a
legacy carve-out for existing collections.


Phase 2: Self-Service Collection Creation (future)

The Phase 1 design is structured so that self-service collection creation slots in
without changing the DpsStacItemGenerator authorization logic. The Lambda already
handles all authorization outcomes. Future work will add a collection request UI or
API, synchronous ID reservation for race condition handling, and algorithm approval
management for collection owners.


Algorithm Authoring Convention

The collection ID should be treated as a runtime parameter, not a hardcoded value
inside algorithm code. DPS supports arbitrary named input parameters, and algorithms
should declare a collection_id input parameter that is passed through to the STAC
item outputs at job runtime:

# Recommended pattern in algorithm code
def run(collection_id: str = None, **kwargs):
    items = generate_stac_items(...)
    for item in items:
        if collection_id:
            item.collection_id = collection_id
    write_catalog(items)

When a user submits a DPS job, they can then pass their target collection ID as a job
input parameter without modifying the algorithm itself:

algorithm: my-flood-detector v1.2.0
inputs:
  collection_id: jsmith--flood-catalog-2025
  ...

This convention should be documented as a best practice in the MAAP algorithm
authoring guide. Its benefits are:

  • The same algorithm version can route to different collections (dev, staging,
    production; personal vs. shared)
  • Collection governance decisions are separated from algorithm logic — the algorithm
    doesn't need to know or care about catalog organization
  • Users who don't specify a collection_id parameter get the deterministic fallback
    behavior automatically, so the convention is opt-in and backward compatible

Algorithms that hardcode a collection ID in their output items will still work — the
authorization check applies regardless of how the collection ID got into the item —
but hardcoding is discouraged because it couples a specific catalog governance decision
to algorithm code that may be shared or reused by others.


Naming Rules

Enforced by the admin CLI now, by the self-service API later:

  • Lowercase alphanumeric characters, hyphens, and underscores only
  • 3–64 characters; no leading or trailing hyphens or underscores
  • Case-insensitive uniqueness (my-collection and My-Collection are the same)
  • Reserved names blocked: api, admin, system, search, conformance,
    queryables, and any existing system collection patterns

Collection IDs are immutable after creation. The current deterministic ID formula
uses __ (double underscore) as a delimiter — user-specified IDs should avoid this
pattern to remain visually distinguishable from auto-assigned IDs.


Error Surfacing

DpsStacItemGenerator currently has no feedback channel back to the user after a DPS
job completes. Collection governance introduces new async failure modes — collection not
found, user not authorized, algorithm version not approved — that users need visibility
into. At minimum, the Lambda will emit structured CloudWatch log events for every
governance decision. A user-facing feedback mechanism (DPS job callback or ingestion
status dashboard) is a dependency that should be resolved before this feature ships.


Key Questions to Resolve Before Proceeding

1. What does tag represent in the deterministic ID formula, and how should it be
handled in the fallback path?

The current format is {username}__{algorithm_name}__{algorithm_version}__{tag}. It seems like everyone uses this field differently - either to group similar jobs or as a unique identifier. The existing system works well if the tag is used as a grouping field but does not work well if it is used as a unique identifier!

2. Multi-collection catalogs: authorize per item or require uniformity?
Because the collection ID is read from each STAC item, a single catalog could
theoretically contain items targeting different collections. Define whether this is
supported (each unique collection ID is authorized separately within one job) or
rejected (all items in a catalog must target the same collection ID). Requiring
uniformity is simpler to reason about and likely sufficient for current use cases.

3. What is the policy for an empty or absent approved_algorithms list in private?
Two reasonable interpretations: (a) no list means any algorithm is permitted for
authorized users (open by default, better for research flexibility), or (b) no list
means no algorithm is approved until explicitly configured (closed by default, better
for production data quality control). This should be decided as a platform-wide default
and documented clearly.

4. Can algorithm approval be version-wildcarded?
Should the approved list support {"name": "my-detector", "version": "*"} to approve
all versions of an algorithm, or must each version be explicitly listed? Explicit
versioning is stricter and better for production collections; wildcards are more
convenient for active development workflows. These could coexist as separate
authorization tiers.

5. Who can manage the approved_algorithms list — owner only, or contributors too?
Recommend owner only, to prevent contributors from approving their own algorithm
versions without the collection owner's sign-off. This defines the permission boundary
for the algorithm approval management path built in Phase 2.

6. Should the "unauthorized" outcome fall back to the deterministic ID or hard-fail?
Confirmed preference in prior discussion was hard failure, but this should be validated
with stakeholders since it is a behavioral change for existing users who might
inadvertently specify a collection ID they don't own.

7. Namespace convention: prefixed or flat?
Should user-specified IDs use a prefix convention (e.g., jsmith--flood-catalog) to
prevent naming conflicts, or rely on a flat namespace with uniqueness enforcement? A
prefix is easy to enforce in the admin CLI and eliminates conflicts entirely.

8. MAAP DPS extension scope: what else belongs in it?
Algorithm name and version are the obvious starting point for maap_dps:contributing_algorithms.
Should the extension also capture other DPS job context — platform version, compute
environment, job ID — at the collection level? Defining the extension scope now avoids
schema churn later. A companion item-level extension (recording per-item job provenance)
may also be worth scoping alongside the collection-level one.

9. Backfill scope and feasibility.
How many existing collections need private ownership records and maap_dps extension
fields added? Are algorithm_name and algorithm_version reliably recoverable from all
existing deterministic IDs via the __ delimiter pattern?


Work Breakdown (Phase 1)

  • Clarify tag semantics and its role (if any) in the deterministic fallback ID
  • Decide whether multi-collection catalogs (items targeting different collection IDs
    in a single job) are supported or rejected
  • Resolve open questions above with stakeholders
  • Define naming rules and reserved-name blocklist
  • Define MAAP DPS STAC extension schema (maap_dps:contributing_algorithms and
    scope of additional fields); publish schema document in this repository
  • Build admin CLI (pypgstac wrapper with naming validation, private column
    initialization, and maap_dps extension field initialization)
  • Backfill private ownership records and maap_dps extension fields on existing
    collections
  • Update DpsStacItemGenerator to:
    • Query pgBouncer private column for ownership and approved algorithm list
    • Implement three-outcome user authorization logic
    • Implement algorithm+version approval check
    • Fall back to deterministic ID when collection not found; hard-fail on auth violations
    • Update maap_dps:contributing_algorithms in content on successful ingestion
  • Emit structured CloudWatch log events for all governance decisions
  • Decide and implement error surfacing channel (DPS callback or ingestion dashboard)
  • Document the collection request process and algorithm approval process for users

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions