Retriggering without in-tree-actions

There's a draft [RFC](https://github.com/mozilla-releng/releng-rfcs/pull/51) up for this, but I don't think we're quite at the point where we're ready for an RFC just yet as there are some issues to work out. Let's get consensus on the high level approach here and then write out the RFC to tackle the implementation details.

### How Retriggers Currently Work

Currently when a developer re-triggers, they click a button in Treeherder, which fires an in-tree-action hook task. The template for this in-tree-action hook is the contents of the repo's `.taskcluster.yml` file, which was downloaded by fxci-config at deploy time. The in-tree-action hook template is rendered from context that conforms to a [trigger_schema](https://github.com/mozilla-releng/fxci-config/blob/ec4117df913dc4b9bf3f8e87c43d84f412ffdb15/src/ciadmin/generate/in_tree_actions.py#L184) and that (presumably?) gets provided by Treeherder. It appears that Treeherder consumes the `actions.json` artifact from the Decision task, and includes this context as part of the `trigger_schema`.

This `in-tree-action` hook task then:

1. Clones the repo
2. Executes the relevant action using `taskgraph action-callback`

For retriggers, this action task typically takes ~1-2 minutes if there is a hot checkout cache. But it can take ~4-5 minutes if it needs to clone.

### Problem

There are two problems we'd like to solve here:
1. Improve performance of re-triggered tasks
2. Improve perception of sluggishness from developers

Performance wise, saving ~1-2 minutes off a test task that takes ~20-40 minutes is nice, but not a huge win imo. But the felt impact of needing to wait for an "rt" task to complete makes it feel much more impactful than it likely is. This felt impact is magnified by Treeherder's ingestion delay. First, developers need to wait for rt to show up. Then they need to wait 1-5 minutes for it to finish. Then they need to wait for their actual task to show up.

Whether it's worth the effort to reduce the felt performance or focus solely on actual performance is not an answered question imo. However, I do wholeheartedly agree that it would be much nicer overall if retriggers were snappier. Gecko test task retriggers in particular are like 95% of all action tasks in fxci (I made that up, but it's probably close), so focusing in on these would have outsized benefits on Taskcluster's perception.

### Possible Solutions

The root of the issue is that the `in-tree-action` task clones the repo regardless of what action is being re-triggered. What we want is some kind of logic that goes like this:

```
if action_type == "retrigger" and taskdef["extra"]["retrigger"] is True:
    call the TC create task API
elif action_type == "retrigger" and taskdef["extra"]["retrigger"] is False:
    call the TC rerun API
else:
    normal action task flow of cloning repo + running taskgraph action-callback
```

Here are three possible places we could put this logic:

1. In `.taskcluster.yml` - We could likely solve this without any fundamental changes by modifying the `.taskcluster.yml` file to have completely different behaviour if the action name is "retrigger". The downside is that we'd need to implement this change in every project we want this behaviour, and it would make the `.taskcluster.yml` files even more complicated.

2. In the in-tree-action hook generation code - Another approach might be to have the code that makes these hooks "wrap" whatever is in the repo's `.taskcluster.yml` files in some outer JSON-e logic. This would solve the downsides of 1, but we'd be "concatenating" JSON-e, which theoretically should work fine I think, but might be messy and will definitely be confusing to future maintainers of this.

3. In separate global hooks - Finally, we could create brand new global hooks in the `fxci-config` repo specifically for retriggering (i.e, move them out of the "generic" actions). It's unclear to me if there are still scenarios where we'd need to clone the repo, or if every task can boil down to a binary "should rerun or not". If it's the former, this becomes a bit more complex, but probably still do-able. Another downside here is scopes, but we can likely mitigate this by generating one hook per trust domain / level, it would just require some new code logic in ciadmin.

Overall, I like option 3 the best if it's feasible. If it's not feasible, I like option 2 the second best. If we want a quick win and are happy for this feature to be Gecko-only, option 1 might even be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retriggering without in-tree-actions #52

How Retriggers Currently Work

Problem

Possible Solutions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retriggering without in-tree-actions #52

Description

How Retriggers Currently Work

Problem

Possible Solutions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions