Dispatch CancelActivityNexusTask to worker#9233
Dispatch CancelActivityNexusTask to worker#9233rkannan82 wants to merge 1 commit intokannan/activity-cancel/task-definitionfrom
Conversation
a91654f to
6a655f7
Compare
d402868 to
54b6d1a
Compare
13d8513 to
eb79c81
Compare
d103589 to
417606c
Compare
eb79c81 to
d1f2abf
Compare
417606c to
6246c4e
Compare
d1f2abf to
7489808
Compare
6246c4e to
fec3c41
Compare
37a51d2 to
11b1049
Compare
fec3c41 to
1dd975d
Compare
45ac313 to
e96dbe8
Compare
11c74b6 to
2cb3108
Compare
72fe398 to
37c2a1c
Compare
ac7c57e to
89d2cf5
Compare
5a9c85e to
d1572d7
Compare
bergundy
left a comment
There was a problem hiding this comment.
Can you confirm whether you're covering cancel requests and pause requests?
Do we also care about canceling activities that we know timed out or are we letting the worker take care of that?
When will you be adding support for standalone activities too?
service/history/cancel_activity.go
Outdated
| return nil | ||
| } | ||
|
|
||
| return t.dispatchCancelTaskToWorker(ctx, task.NamespaceID, task.WorkerControlTaskQueue, taskTokens) |
There was a problem hiding this comment.
You definitely want to release the lock before making this RPC.
There was a problem hiding this comment.
Why store the worker control task queue on the task? That seems redundant if you're already reading from mutable state, which IMHO should serve as the source of truth.
service/history/cancel_activity.go
Outdated
| Operation: nexus.CancelActivitiesOperation, | ||
| Payload: &commonpb.Payload{ | ||
| Metadata: map[string][]byte{ | ||
| "encoding": []byte("proto"), |
There was a problem hiding this comment.
Use the SDK's data converter here instead of inventing your own proto encoding scheme please.
common/nexus/constants.go
Outdated
| const ( | ||
| // WorkerControlService is the Nexus service name for worker control operations. | ||
| WorkerControlService = "temporal.api.worker.v1.WorkerService" | ||
| // CancelActivitiesOperation is the operation to cancel running activities. | ||
| CancelActivitiesOperation = "cancel-activities" | ||
| ) |
There was a problem hiding this comment.
These aren't nexus properties IMHO. It's part of the worker service, not generic for Nexus.
service/history/cancel_activity.go
Outdated
|
|
||
| // buildActivityTaskTokens builds task tokens for activities that need cancellation. | ||
| // Returns the serialized task tokens and any error. | ||
| func buildActivityTaskTokens( |
There was a problem hiding this comment.
Share this function everywhere where you construct task tokens instead of reimplementing this.
service/history/cancel_activity.go
Outdated
| return err | ||
| } | ||
|
|
||
| if resp.GetRequestTimeout() != nil { |
There was a problem hiding this comment.
What about async outcomes and operation failures?
service/history/cancel_activity.go
Outdated
| "google.golang.org/protobuf/proto" | ||
| ) | ||
|
|
||
| func (t *transferQueueActiveTaskExecutor) processCancelActivityTask( |
There was a problem hiding this comment.
I would put this on the outbound queue since it sync matches into workers and would have different latency characteristics than typical internal tasks.
There was a problem hiding this comment.
That will an exception re. how we divided queues IMO.
Not sure I follow the concern re. fail to sync match, if it fails the task can retry using a user error, which can be handled in the same way as namespace rate limit error today. I think this task is best effort and will be discarded after 5mins.
Maybe I missed something on the nexus sync match dispatch flow?
There was a problem hiding this comment.
IMHO this is still outbound because it blocks on a response from a worker. We would need to allow multiple seconds of timeout for this round trip.
There was a problem hiding this comment.
ah...we wait for the response! that's the part I missed. I thought DispatchNexusTask only do the dispatch part.
yeah agree if we need to allow multiple seconds for a task, we can't use transfer queue. cc @rkannan82 sorry I missed this when reviewing the design and pointed you in the wrong direction.
service/history/cancel_activity.go
Outdated
| return nil | ||
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(ctx, taskTimeout) |
There was a problem hiding this comment.
What's taskTimeout? You should be deliberate about setting a reasonable timeout for the whole dispatch to the worker. That's different from dispatching tasks internally within the cluster.
89d2cf5 to
79e4e3e
Compare
9ee5998 to
423bb54
Compare
79e4e3e to
7910d4e
Compare
423bb54 to
49949f7
Compare
49949f7 to
5b2a3b9
Compare
What changed?
Sends a Nexus task to the control queue on which the worker is listening for cancellation.
Main changes:
Called by transfer_queue_active_task_executor.go.
Controlled by dynamic config: EnableActivityCancellationNexusTask
Why?
To support activity cancellation without activity heartbeat.
Overall flow:
How did you test it?
Potential risks
No. The task dispatch is disabled by default.