There's a draft RFC up for this, but I don't think we're quite at the point where we're ready for an RFC just yet as there are some issues to work out. Let's get consensus on the high level approach here and then write out the RFC to tackle the implementation details.
How Retriggers Currently Work
Currently when a developer re-triggers, they click a button in Treeherder, which fires an in-tree-action hook task. The template for this in-tree-action hook is the contents of the repo's .taskcluster.yml file, which was downloaded by fxci-config at deploy time. The in-tree-action hook template is rendered from context that conforms to a trigger_schema and that (presumably?) gets provided by Treeherder. It appears that Treeherder consumes the actions.json artifact from the Decision task, and includes this context as part of the trigger_schema.
This in-tree-action hook task then:
- Clones the repo
- Executes the relevant action using
taskgraph action-callback
For retriggers, this action task typically takes ~1-2 minutes if there is a hot checkout cache. But it can take ~4-5 minutes if it needs to clone.
Problem
There are two problems we'd like to solve here:
- Improve performance of re-triggered tasks
- Improve perception of sluggishness from developers
Performance wise, saving ~1-2 minutes off a test task that takes ~20-40 minutes is nice, but not a huge win imo. But the felt impact of needing to wait for an "rt" task to complete makes it feel much more impactful than it likely is. This felt impact is magnified by Treeherder's ingestion delay. First, developers need to wait for rt to show up. Then they need to wait 1-5 minutes for it to finish. Then they need to wait for their actual task to show up.
Whether it's worth the effort to reduce the felt performance or focus solely on actual performance is not an answered question imo. However, I do wholeheartedly agree that it would be much nicer overall if retriggers were snappier. Gecko test task retriggers in particular are like 95% of all action tasks in fxci (I made that up, but it's probably close), so focusing in on these would have outsized benefits on Taskcluster's perception.
Possible Solutions
The root of the issue is that the in-tree-action task clones the repo regardless of what action is being re-triggered. What we want is some kind of logic that goes like this:
if action_type == "retrigger" and taskdef["extra"]["retrigger"] is True:
call the TC create task API
elif action_type == "retrigger" and taskdef["extra"]["retrigger"] is False:
call the TC rerun API
else:
normal action task flow of cloning repo + running taskgraph action-callback
Here are three possible places we could put this logic:
-
In .taskcluster.yml - We could likely solve this without any fundamental changes by modifying the .taskcluster.yml file to have completely different behaviour if the action name is "retrigger". The downside is that we'd need to implement this change in every project we want this behaviour, and it would make the .taskcluster.yml files even more complicated.
-
In the in-tree-action hook generation code - Another approach might be to have the code that makes these hooks "wrap" whatever is in the repo's .taskcluster.yml files in some outer JSON-e logic. This would solve the downsides of 1, but we'd be "concatenating" JSON-e, which theoretically should work fine I think, but might be messy and will definitely be confusing to future maintainers of this.
-
In separate global hooks - Finally, we could create brand new global hooks in the fxci-config repo specifically for retriggering (i.e, move them out of the "generic" actions). It's unclear to me if there are still scenarios where we'd need to clone the repo, or if every task can boil down to a binary "should rerun or not". If it's the former, this becomes a bit more complex, but probably still do-able. Another downside here is scopes, but we can likely mitigate this by generating one hook per trust domain / level, it would just require some new code logic in ciadmin.
Overall, I like option 3 the best if it's feasible. If it's not feasible, I like option 2 the second best. If we want a quick win and are happy for this feature to be Gecko-only, option 1 might even be fine.
There's a draft RFC up for this, but I don't think we're quite at the point where we're ready for an RFC just yet as there are some issues to work out. Let's get consensus on the high level approach here and then write out the RFC to tackle the implementation details.
How Retriggers Currently Work
Currently when a developer re-triggers, they click a button in Treeherder, which fires an in-tree-action hook task. The template for this in-tree-action hook is the contents of the repo's
.taskcluster.ymlfile, which was downloaded by fxci-config at deploy time. The in-tree-action hook template is rendered from context that conforms to a trigger_schema and that (presumably?) gets provided by Treeherder. It appears that Treeherder consumes theactions.jsonartifact from the Decision task, and includes this context as part of thetrigger_schema.This
in-tree-actionhook task then:taskgraph action-callbackFor retriggers, this action task typically takes ~1-2 minutes if there is a hot checkout cache. But it can take ~4-5 minutes if it needs to clone.
Problem
There are two problems we'd like to solve here:
Performance wise, saving ~1-2 minutes off a test task that takes ~20-40 minutes is nice, but not a huge win imo. But the felt impact of needing to wait for an "rt" task to complete makes it feel much more impactful than it likely is. This felt impact is magnified by Treeherder's ingestion delay. First, developers need to wait for rt to show up. Then they need to wait 1-5 minutes for it to finish. Then they need to wait for their actual task to show up.
Whether it's worth the effort to reduce the felt performance or focus solely on actual performance is not an answered question imo. However, I do wholeheartedly agree that it would be much nicer overall if retriggers were snappier. Gecko test task retriggers in particular are like 95% of all action tasks in fxci (I made that up, but it's probably close), so focusing in on these would have outsized benefits on Taskcluster's perception.
Possible Solutions
The root of the issue is that the
in-tree-actiontask clones the repo regardless of what action is being re-triggered. What we want is some kind of logic that goes like this:Here are three possible places we could put this logic:
In
.taskcluster.yml- We could likely solve this without any fundamental changes by modifying the.taskcluster.ymlfile to have completely different behaviour if the action name is "retrigger". The downside is that we'd need to implement this change in every project we want this behaviour, and it would make the.taskcluster.ymlfiles even more complicated.In the in-tree-action hook generation code - Another approach might be to have the code that makes these hooks "wrap" whatever is in the repo's
.taskcluster.ymlfiles in some outer JSON-e logic. This would solve the downsides of 1, but we'd be "concatenating" JSON-e, which theoretically should work fine I think, but might be messy and will definitely be confusing to future maintainers of this.In separate global hooks - Finally, we could create brand new global hooks in the
fxci-configrepo specifically for retriggering (i.e, move them out of the "generic" actions). It's unclear to me if there are still scenarios where we'd need to clone the repo, or if every task can boil down to a binary "should rerun or not". If it's the former, this becomes a bit more complex, but probably still do-able. Another downside here is scopes, but we can likely mitigate this by generating one hook per trust domain / level, it would just require some new code logic in ciadmin.Overall, I like option 3 the best if it's feasible. If it's not feasible, I like option 2 the second best. If we want a quick win and are happy for this feature to be Gecko-only, option 1 might even be fine.