-
Notifications
You must be signed in to change notification settings - Fork 4
Description
https://ocdsdeploy.readthedocs.io/en/latest/reference/downtime.html
We can add details about:
- alerts (uptime monitors)
- common diagnostic steps
- common solutions (e.g. contact the hosting provider)
- link to relevant pages e.g. if the server is a total loss, we need to re-deploy the server and restore from backups
- Perhaps @yolile can review the Testing backups instructions, to see if they are clear enough to use as instructions for restoring a server
We can consider a protocol of having a team member available if a tool is presented publicly (unlikely to be able to help in time if the site goes down during the presentation – more relevant for after the presentation to monitor the impact of new traffic). That said, we should be testing the performance beforehand. If needed we can update our QASP to e.g. do stress tests with ApacheBench.
Perhaps with these changes, the name of the page ought to change.
At the same time, we can shorten or clarify the current text as needed.
Once ready:
- Yohanna and I can each do a test run of redeploying a server (all except DNS changes).
- Yohanna and I can socialize the changes with the OCP team.
Noting that based on the incident log, we expect this to be minimal additional work for Yohanna and me. (I can add an "impact" column (and look up the downtime length reported on Slack) to be able to answer any questions from the team.)