feat(metadata-db,controller): add idempotency key for job deduplication by shiyasmohd · Pull Request #1910 · edgeandnode/amp

shiyasmohd · 2026-03-05T13:56:55Z

Summary

Introduces idempotency keys for dataset deployment jobs so repeated deploy requests for the same dataset version return the same job ID instead of creating duplicates. Also there will be one job id per dataset.

Changes

Database (metadata-db)

Add idempotency_key column to jobs with a unique constraint
Migration backfills existing rows with legacy:{job_id} and enforces NOT NULL
Add IdempotencyKey newtype for type-safe handling of keys
Add get_by_idempotency_key for lookups
Update job registration to use ON CONFLICT (idempotency_key) DO UPDATE so duplicate keys update the descriptor and return the existing job ID

Controller

Compute idempotency key as hash(job_kind:namespace/name@manifest_hash) (e.g. materialize-raw:_/anvil_rpc@abc123...)
Before scheduling, check for an active job with the same key; if found, return its ID without creating a new job
Re-check inside the transaction to handle races
end_block, parallelism, and worker_id are not part of the key, so re-deploying with different parameters while a job is active returns the same job ID

Tests

Add it_idempotency.rs for metadata-db idempotency behavior.
Add it_admin_api_datasets_deploy.rs to verify redeploy with different end_block returns the same job ID when the job is still active.

LNSD

Please, check my comments 🙂

LNSD · 2026-03-09T09:46:33Z

crates/services/controller/src/scheduler.rs

+fn idempotency_key(job_kind: JobKind<'_>, reference: &HashReference) -> IdempotencyKey<'static> {
+    let input = format!("{}:{}", job_kind.as_str(), reference);
+    let hash = datasets_common::hash::hash(input);
+    // SAFETY: The hash is a validated 64-char hex string produced by our hash function.
+    IdempotencyKey::from_owned_unchecked(hash.into_inner())
+}


Right now the idempotency_key() function lives in the controller scheduler, which means the controller owns the knowledge of how to compute the key for each job kind. I think this responsibility belongs closer to the worker's job crates themselves. Each crate already owns its job_kind.rs with JOB_KIND and MaterializeRawJobKind and other similar types.

Could we add a job_key.rs module (or similar) to both worker-datasets-raw and worker-datasets-derived that each exposes a function to compute their own idempotency key? Something like:

// In worker-datasets-raw/src/job_key.rs pub fn idempotency_key(reference: &HashReference) -> IdempotencyKey<'static> { ... } // In worker-datasets-derived/src/job_key.rs pub fn idempotency_key(reference: &HashReference) -> IdempotencyKey<'static> { ... }

Each would use its own JOB_KIND constant internally, so the key format ({job_kind}:{reference}) stays the same, but the computation is co-located with the job definition rather than centralized in the controller.

This way:

Each worker crate owns its full job identity (kind + key computation)

The controller just calls worker_datasets_raw::job_key::idempotency_key(...) without needing to know the
hashing format

Adding a new worker type in the future doesn't require touching the controller's scheduler

LNSD · 2026-03-09T10:03:51Z

crates/services/admin-api/src/scheduler.rs

    async fn schedule_job(
        &self,
+        job_kind: JobKind<'_>,
        dataset_reference: HashReference,
        job_descriptor: JobDescriptor,
        worker_id: Option<NodeSelector>,


I have a question about one scenario.

The key is computed from {job_kind}:{namespace}/{name}@{manifest_hash}, which doesn't include job options like end_block, parallelism, or worker_id. This means if a user deploys foo/bar@v1 with end_block: 1000, and then deploys the same dataset with end_block: 2000 while the first job is still active, the second request silently returns the existing job ID. The caller gets back a 200 with a job that's running with end_block: 1000. No error, no indication that their end_block: 2000 was ignored. This goes against the principle of least surprise.

This also means there's no way to update the end_block of a running job through a single deploy call. The caller would need to manually stop the job first, then redeploy, but nothing in the API response indicates that.

I would suggest that, when the key matches an active job, but the descriptor differs, return a "conflict" error explaining that an active job with different options exists and must be stopped first. At least the caller knows what happened.

I've just realised that the job descriptor is meant to be immutable in the current design. That's problematic for the "different job options" scenario. We should find a better way to handle this.

Can you research how we can preserve the immutability property of the job ledger while allowing job rescheduling with different options, without creating multiple job instances?

Maybe move the job descriptor to the "scheduled" event and propagate that info. Idk.

leoyvens · 2026-03-09T12:32:29Z

Having multiple jobs for a same logical dataset is in many situations a feature, not a bug. So be careful how deep you constraint this.

Graph Node originally bolted de-duplication of subgraphs into DB primary key constraints, and un-doing it was quite a bit of work.

LNSD · 2026-03-09T13:03:13Z

Having multiple jobs for a same logical dataset is in many situations a feature, not a bug. So be careful how deep you constraint this.

Graph Node originally bolted de-duplication of subgraphs into DB primary key constraints, and un-doing it was quite a bit of work.

The current design uses an idempotency key (String) to prevent multiple instances of the same job from running at the same time. We can evolve this key and the job descriptor; both are flexible.

Can you describe the use cases where multiple dataset materialization jobs are desirable?

leoyvens · 2026-03-09T13:26:56Z

It allows resyncs without user intervention and without downtime. Situations where a resync is desirable:

When there is a suspected correctness, doing a fresh sync helps rule out non-determinism.
When there are performance issues, resyncing under different settings, software versions, etc... is something to try.

LNSD · 2026-03-09T13:42:20Z

It allows resyncs without user intervention and without downtime. Situations where a resync is desirable:

When there is a suspected correctness, doing a fresh sync helps rule out non-
determinism.

To cover this scenario the idempotency key would be composed of: job kind + dataset reference + physical table revision id (a.k.a., location ID). And this extends the current set of use cases of the deploy endpoint: materialize dataset in a custom location

When there are performance issues, resyncing under different settings, software versions, etc... is something to try.

This is a more "open-ended" suggestion. But I think we can accommodate it with the idempotency key and job descriptor flexible design.

leoyvens · 2026-03-09T13:57:00Z

Yes adding the physical id to the indempotency key completely addresses my concern.

shiyasmohd self-assigned this Mar 5, 2026

LNSD requested a review from Theodus March 5, 2026 14:13

shiyasmohd requested a review from leoyvens March 5, 2026 16:32

shiyasmohd force-pushed the shiyasmohd/idempotency-key branch from 8e51bcf to 4dbbcc0 Compare March 6, 2026 13:12

feat(metadata-db,controller): add idempotency key for job deduplication

32d875a

shiyasmohd force-pushed the shiyasmohd/idempotency-key branch from 4dbbcc0 to 32d875a Compare March 6, 2026 13:18

shiyasmohd mentioned this pull request Mar 7, 2026

refactor(metadata-db): drop redundant columns from jobs table #1937

Draft

LNSD requested changes Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metadata-db,controller): add idempotency key for job deduplication#1910

feat(metadata-db,controller): add idempotency key for job deduplication#1910
shiyasmohd wants to merge 1 commit intomainfrom
shiyasmohd/idempotency-key

shiyasmohd commented Mar 5, 2026

Uh oh!

LNSD left a comment

Uh oh!

LNSD Mar 9, 2026

Uh oh!

LNSD Mar 9, 2026

Uh oh!

LNSD Mar 9, 2026

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

LNSD commented Mar 9, 2026

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

LNSD commented Mar 9, 2026

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shiyasmohd commented Mar 5, 2026

Summary

Changes

Database (metadata-db)

Controller

Tests

Uh oh!

LNSD left a comment

Choose a reason for hiding this comment

Uh oh!

LNSD Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

LNSD Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

LNSD Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

LNSD commented Mar 9, 2026

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

LNSD commented Mar 9, 2026

Uh oh!

leoyvens commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants