Feature: User-Controlled Collection IDs for DPS STAC

# Feature: User-Controlled STAC Collection IDs
 
## Background
 
Currently, STAC collection IDs are auto-assigned by `DpsStacItemGenerator` using a
deterministic formula derived from the DPS job's `.met.json` metadata file:
 
```
{username}__{algorithm_name}__{algorithm_version}__{tag}
```
 
This value is slugified (special characters replaced) and then unconditionally written
into the `collection` field of every STAC item before publishing to the ingestor queue —
regardless of what collection the user's `catalog.json` specifies. Users have requested
the ability to control the collection ID so that outputs from related jobs and algorithm
runs can be organized into a single, meaningfully named collection.
 
This ticket proposes an initial implementation using admin-mediated collection creation,
designed to extend toward self-service and algorithm-level authorization.
 
---
 
## How the current pipeline works
 
`DpsStacItemGenerator` ([link](https://github.com/MAAP-Project/maap-eoapi/blob/6b76ac3fe65a3cce827e3320b46c310ff7a2df35/cdk/constructs/DpsStacItemGenerator/runtime/src/dps_stac_item_generator/item.py)) is triggered by S3 event notifications when a DPS job writes a
`catalog.json` to the output bucket. For each event:
 
1. The DPS output prefix is extracted from the S3 key path using a timestamp pattern
2. A `.met.json` file is loaded from that prefix — this is the authoritative source of
   job context, containing at minimum: `username`, `algorithm_name`,
   `algorithm_version`, and `tag`
3. A deterministic collection ID is constructed from those fields and slugified
4. The `catalog.json` is read via pystac; every item's `collection` field is
   **overwritten** with the deterministic ID before publishing to the ingestor SNS topic
 
Some users are already setting the `collection` field in their STAC items, but the
current code silently overwrites it. This feature stops that overwrite and makes the
item-provided collection ID the primary routing mechanism.
 
---
 
## Proposed Approach
 
### Phase 1: Admin-Mediated Collection Creation (this ticket)
 
Users request a named collection via an out-of-band process (GitHub issue or intake
form). An admin creates the collection in pgSTAC using a new CLI tool, setting the
requesting user as owner. Users then set the `collection` field in their algorithm's
STAC item outputs; `DpsStacItemGenerator` will respect it if authorization passes.
 
#### Metadata storage design
 
There are two distinct categories of collection metadata introduced by this feature,
and they warrant different storage strategies:
 
**Access control** (who may write, which algorithms are approved) is governance data.
It should not appear in public STAC API responses and has no place in the STAC spec.
This belongs in pgstac's `private` JSONB column, which is purpose-built for backend
metadata that is never surfaced to API consumers:
 
```json
// pgstac.collections.private — never returned by the STAC API
{
  "owner": "jsmith",
  "contributors": ["kwilliams"],
  "approved_algorithms": [
    {"name": "my-flood-detector", "version": "1.2.0"},
    {"name": "my-flood-detector", "version": "1.3.0"}
  ]
}
```
 
**Algorithm provenance** (which DPS algorithms have contributed data to this
collection) is catalog metadata. It is legitimately useful to anyone browsing the
catalog and belongs in the public STAC collection document. Rather than overloading the
standard `providers` field — which is designed for data lineage attribution, not
pipeline tracking — this is a natural fit for a MAAP DPS STAC extension:
 
```json
// pgstac.collections.content — returned by the STAC API
"stac_extensions": [
  "https://maap-project.org/stac/extensions/dps/v1.0.0/schema.json"
],
"maap_dps:contributing_algorithms": [
  {"name": "my-flood-detector", "version": "1.2.0"},
  {"name": "my-flood-detector", "version": "1.3.0"}
]
```
 
This separation keeps `providers` clean for its intended provenance purpose, keeps
access control private, and gives MAAP DPS pipeline metadata a coherent versioned home
that can grow (e.g., DPS environment info, platform version) without polluting the
base STAC document. The extension schema would be maintained in this repository.
 
Note that `approved_algorithms` (private, access control) and
`maap_dps:contributing_algorithms` (public, provenance) are related but distinct: the
former defines what is permitted, the latter records what has actually run. They are
updated at different times and by different actors.
 
#### Authorization logic in `DpsStacItemGenerator`
 
When a catalog event is received, the Lambda reads job context from `.met.json` as
today, then resolves the target collection ID from the STAC items themselves before
applying authorization:
 
```
met.json fields available: username, algorithm_name, algorithm_version, tag
item["collection"] field: user-specified collection ID (optional)
```
 
**If items do not specify a `collection` field** (or specify one that is absent from
pgSTAC):
→ Fall back to the existing deterministic ID formula. Emit a structured warning to
  CloudWatch if a collection was specified but not found in pgstac.
 
**If the catalog specifies a collection ID that exists in pgstac:**
 
| Check | Source | Pass | Fail |
|---|---|---|---|
| `username` is owner or contributor | `private` column | Continue | **Hard failure** — do not fall back silently |
| `algorithm_name` + `algorithm_version` approved (if list non-empty) | `private` column | Continue | **Hard failure** — emit structured error |
| Both pass | — | Publish items with user-specified collection ID; update `maap_dps:contributing_algorithms` | — |
 
The distinction between "collection not found" (soft fallback) and "collection found
but unauthorized" (hard failure) is intentional. A missing collection plausibly means
the user hasn't requested creation yet. An authorization failure means an explicit
policy was violated and silently redirecting items elsewhere would be a governance
failure.
 
#### Database query
 
Authorization is resolved by querying pgstac directly via the existing pgBouncer
connection (VPC-accessible EC2), avoiding HTTP overhead through the STAC API:
 
```sql
SELECT private
FROM pgstac.collections
WHERE id = $1
```
 
This is a single query per catalog event regardless of item count. pgBouncer in
transaction pooling mode handles concurrent Lambda invocations efficiently. On
successful authorization, a second update appends the contributing algorithm to the
public `content` if not already present.
 
#### Admin CLI
 
A lightweight wrapper around `pypgstac` that:
- Validates the proposed collection ID against naming rules and the reserved-name blocklist
- Writes the `private` column with owner, contributors, and optional approved algorithm list
- Initializes the `maap_dps` extension fields in the public collection document
 
This is the highest-leverage piece of initial work. Without it, the catalog will
accumulate collections with inconsistent ownership and algorithm approval records.
 
#### Backfill
 
Existing auto-assigned collections will have `private` ownership records backfilled
using the `username` already embedded in their deterministic IDs. The `algorithm_name`
and `algorithm_version` components are also present in the deterministic ID, so
`approved_algorithms` and `maap_dps:contributing_algorithms` can be backfilled as well.
This allows the authorization check in `DpsStacItemGenerator` to go live without a
legacy carve-out for existing collections.
 
---
 
### Phase 2: Self-Service Collection Creation (future)
 
The Phase 1 design is structured so that self-service collection creation slots in
without changing the `DpsStacItemGenerator` authorization logic. The Lambda already
handles all authorization outcomes. Future work will add a collection request UI or
API, synchronous ID reservation for race condition handling, and algorithm approval
management for collection owners.
 
---
 
## Algorithm Authoring Convention
 
The collection ID should be treated as a runtime parameter, not a hardcoded value
inside algorithm code. DPS supports arbitrary named input parameters, and algorithms
should declare a `collection_id` input parameter that is passed through to the STAC
item outputs at job runtime:
 
```python
# Recommended pattern in algorithm code
def run(collection_id: str = None, **kwargs):
    items = generate_stac_items(...)
    for item in items:
        if collection_id:
            item.collection_id = collection_id
    write_catalog(items)
```
 
When a user submits a DPS job, they can then pass their target collection ID as a job
input parameter without modifying the algorithm itself:
 
```
algorithm: my-flood-detector v1.2.0
inputs:
  collection_id: jsmith--flood-catalog-2025
  ...
```
 
This convention should be documented as a best practice in the MAAP algorithm
authoring guide. Its benefits are:
 
- The same algorithm version can route to different collections (dev, staging,
  production; personal vs. shared)
- Collection governance decisions are separated from algorithm logic — the algorithm
  doesn't need to know or care about catalog organization
- Users who don't specify a `collection_id` parameter get the deterministic fallback
  behavior automatically, so the convention is opt-in and backward compatible
 
Algorithms that hardcode a collection ID in their output items will still work — the
authorization check applies regardless of how the collection ID got into the item —
but hardcoding is discouraged because it couples a specific catalog governance decision
to algorithm code that may be shared or reused by others.
 
---
 
## Naming Rules
 
Enforced by the admin CLI now, by the self-service API later:
 
- Lowercase alphanumeric characters, hyphens, and underscores only
- 3–64 characters; no leading or trailing hyphens or underscores
- Case-insensitive uniqueness (`my-collection` and `My-Collection` are the same)
- Reserved names blocked: `api`, `admin`, `system`, `search`, `conformance`,
  `queryables`, and any existing system collection patterns
 
Collection IDs are **immutable after creation**. The current deterministic ID formula
uses `__` (double underscore) as a delimiter — user-specified IDs should avoid this
pattern to remain visually distinguishable from auto-assigned IDs.
 
---
 
## Error Surfacing
 
`DpsStacItemGenerator` currently has no feedback channel back to the user after a DPS
job completes. Collection governance introduces new async failure modes — collection not
found, user not authorized, algorithm version not approved — that users need visibility
into. At minimum, the Lambda will emit structured CloudWatch log events for every
governance decision. A user-facing feedback mechanism (DPS job callback or ingestion
status dashboard) is a dependency that should be resolved before this feature ships.
 
---
 
## Key Questions to Resolve Before Proceeding
 
**1. What does `tag` represent in the deterministic ID formula, and how should it be
handled in the fallback path?**
The current format is `{username}__{algorithm_name}__{algorithm_version}__{tag}`. It seems like everyone uses this field differently - either to group similar jobs or as a unique identifier. The existing system works well if the tag is used as a grouping field but does not work well if it is used as a unique identifier!
 
**2. Multi-collection catalogs: authorize per item or require uniformity?**
Because the collection ID is read from each STAC item, a single catalog could
theoretically contain items targeting different collections. Define whether this is
supported (each unique collection ID is authorized separately within one job) or
rejected (all items in a catalog must target the same collection ID). Requiring
uniformity is simpler to reason about and likely sufficient for current use cases.
 
**3. What is the policy for an empty or absent `approved_algorithms` list in `private`?**
Two reasonable interpretations: (a) no list means any algorithm is permitted for
authorized users (open by default, better for research flexibility), or (b) no list
means no algorithm is approved until explicitly configured (closed by default, better
for production data quality control). This should be decided as a platform-wide default
and documented clearly.
 
**4. Can algorithm approval be version-wildcarded?**
Should the approved list support `{"name": "my-detector", "version": "*"}` to approve
all versions of an algorithm, or must each version be explicitly listed? Explicit
versioning is stricter and better for production collections; wildcards are more
convenient for active development workflows. These could coexist as separate
authorization tiers.
 
**5. Who can manage the `approved_algorithms` list — owner only, or contributors too?**
Recommend owner only, to prevent contributors from approving their own algorithm
versions without the collection owner's sign-off. This defines the permission boundary
for the algorithm approval management path built in Phase 2.
 
**6. Should the "unauthorized" outcome fall back to the deterministic ID or hard-fail?**
Confirmed preference in prior discussion was hard failure, but this should be validated
with stakeholders since it is a behavioral change for existing users who might
inadvertently specify a collection ID they don't own.
 
**7. Namespace convention: prefixed or flat?**
Should user-specified IDs use a prefix convention (e.g., `jsmith--flood-catalog`) to
prevent naming conflicts, or rely on a flat namespace with uniqueness enforcement? A
prefix is easy to enforce in the admin CLI and eliminates conflicts entirely.
 
**8. MAAP DPS extension scope: what else belongs in it?**
Algorithm name and version are the obvious starting point for `maap_dps:contributing_algorithms`.
Should the extension also capture other DPS job context — platform version, compute
environment, job ID — at the collection level? Defining the extension scope now avoids
schema churn later. A companion item-level extension (recording per-item job provenance)
may also be worth scoping alongside the collection-level one.
 
**9. Backfill scope and feasibility.**
How many existing collections need `private` ownership records and `maap_dps` extension
fields added? Are `algorithm_name` and `algorithm_version` reliably recoverable from all
existing deterministic IDs via the `__` delimiter pattern?
 
---
 
## Work Breakdown (Phase 1)
 
- [ ] Clarify `tag` semantics and its role (if any) in the deterministic fallback ID
- [ ] Decide whether multi-collection catalogs (items targeting different collection IDs
      in a single job) are supported or rejected
- [ ] Resolve open questions above with stakeholders
- [ ] Define naming rules and reserved-name blocklist
- [ ] Define MAAP DPS STAC extension schema (`maap_dps:contributing_algorithms` and
      scope of additional fields); publish schema document in this repository
- [ ] Build admin CLI (pypgstac wrapper with naming validation, `private` column
      initialization, and `maap_dps` extension field initialization)
- [ ] Backfill `private` ownership records and `maap_dps` extension fields on existing
      collections
- [ ] Update `DpsStacItemGenerator` to:
  - Query pgBouncer `private` column for ownership and approved algorithm list
  - Implement three-outcome user authorization logic
  - Implement algorithm+version approval check
  - Fall back to deterministic ID when collection not found; hard-fail on auth violations
  - Update `maap_dps:contributing_algorithms` in `content` on successful ingestion
- [ ] Emit structured CloudWatch log events for all governance decisions
- [ ] Decide and implement error surfacing channel (DPS callback or ingestion dashboard)
- [ ] Document the collection request process and algorithm approval process for users

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: User-Controlled Collection IDs for DPS STAC #91

Feature: User-Controlled STAC Collection IDs

Background

How the current pipeline works

Proposed Approach

Phase 1: Admin-Mediated Collection Creation (this ticket)

Metadata storage design

Authorization logic in `DpsStacItemGenerator`

Database query

Admin CLI

Backfill

Phase 2: Self-Service Collection Creation (future)

Algorithm Authoring Convention

Naming Rules

Error Surfacing

Key Questions to Resolve Before Proceeding

Work Breakdown (Phase 1)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Check	Source	Pass	Fail
`username` is owner or contributor	`private` column	Continue	Hard failure — do not fall back silently
`algorithm_name` + `algorithm_version` approved (if list non-empty)	`private` column	Continue	Hard failure — emit structured error
Both pass	—	Publish items with user-specified collection ID; update `maap_dps:contributing_algorithms`	—

Feature: User-Controlled Collection IDs for DPS STAC #91

Description

Feature: User-Controlled STAC Collection IDs

Background

How the current pipeline works

Proposed Approach

Phase 1: Admin-Mediated Collection Creation (this ticket)

Metadata storage design

Authorization logic in DpsStacItemGenerator

Database query

Admin CLI

Backfill

Phase 2: Self-Service Collection Creation (future)

Algorithm Authoring Convention

Naming Rules

Error Surfacing

Key Questions to Resolve Before Proceeding

Work Breakdown (Phase 1)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Authorization logic in `DpsStacItemGenerator`