Skip to content

Update downtime protocol #569

@jpmckinney

Description

@jpmckinney

https://ocdsdeploy.readthedocs.io/en/latest/reference/downtime.html

We can add details about:

  • alerts (uptime monitors)
  • common diagnostic steps
  • common solutions (e.g. contact the hosting provider)
  • link to relevant pages e.g. if the server is a total loss, we need to re-deploy the server and restore from backups
    • Perhaps @yolile can review the Testing backups instructions, to see if they are clear enough to use as instructions for restoring a server

We can consider a protocol of having a team member available if a tool is presented publicly (unlikely to be able to help in time if the site goes down during the presentation – more relevant for after the presentation to monitor the impact of new traffic). That said, we should be testing the performance beforehand. If needed we can update our QASP to e.g. do stress tests with ApacheBench.

Perhaps with these changes, the name of the page ought to change.

At the same time, we can shorten or clarify the current text as needed.

Once ready:

  • Yohanna and I can each do a test run of redeploying a server (all except DNS changes).
  • Yohanna and I can socialize the changes with the OCP team.

Noting that based on the incident log, we expect this to be minimal additional work for Yohanna and me. (I can add an "impact" column (and look up the downtime length reported on Slack) to be able to answer any questions from the team.)

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions