There are several services provided by the RDO Infrastructure that are considered as critical (see https://www.rdoproject.org/infra/service-continuity/). A service failure or degradation can affect multiple projects. Each of the RDO Infra maintainers has different levels of knowledge about the services, so we have to document the most common troubleshooting procedures for the critical services. This will help in the following areas: - Allowing consistent troubleshooting when the most knowledgeable person is not around (e.g. weekends or holidays). - Prevent the "someone is hit by the bus" effect.
From the RDO Service Continuity page, we should document troubleshooting procedures for at least: - review.rdoproject.org nodepool nodes (or nodepool in general) - RDO Trunk repositories - DLRN DB instance - images.rdoproject.org - trunk.registry.rdoproject.org - www.rdoproject.org - lists.rdoprojects.org We should rely on upstream published documentation as much as possible.
It's okay to link to a documentation place (in git or readthedocs) from the service continuity page but I'm not sure rdoproject.org is a good place for technical documentation like this. Upstream uses system-config [1][2] for this purpose. Our equivalent would be rdo-infra-playbooks I guess ? Some projects already have their built-in documentation (RDO registry, delorean, weirdo) so we could see to link to them as appropriate (from the "main" documentation hub) [1]: https://docs.openstack.org/infra/system-config/ [2]: https://github.com/openstack-infra/system-config
Maybe we could create a new rdo-docs repo for that, and publish it to readthedocs.org as mentioned on IRC?
RDO Registry doc is published at http://rdo-container-registry.readthedocs.io/