Document URL: https://docs.openshift.com/container-platform/3.6/admin_guide/backup_restore.html https://docs.openshift.com/container-platform/3.7/admin_guide/backup_restore.html https://docs.openshift.com/container-platform/3.9/admin_guide/backup_restore.html Describe the issue: The documentation of backing up and restoring etcd is documented differently in multiple places. The one on https://docs.openshift.com/ have multiple flaws. See comments marked with "*" in the corresponding chapters of the documentation that should either be changed or clarified. Prerequisites Cluster Backup Master Backup Etcd Backup * 1) Should master services be running or not? * 2) Instructions for containerized is missing except for mention of "docker exec", but that is not enough instructions for how to create the backup in the containerized case. Also, is it really needed to run the command in the container? In the day-two operations guide, the backup process is the same for containerized and non-containerized. * 3) variable "$ETCD_DATA_DIR" used before defined, just print /var/lib/etcd instead. Registry Certificates Backup Cluster Restore for Single-member etcd Clusters * No instructions for containerized etcd. Is the service name the only difference? Cluster Restore for Multiple-member etcd Clusters Embedded etcd External etcd Containerized etcd Deployments * 3) The --peers value and description are misleading. There are two IP addresses in the example that is not explained where they come from and the explanation says that "only active members" should be specified. How can you have more than one active member when you have just restored to a one-node cluster? Non-containerized etcd Deployments * 3) Same here as above Adding New etcd Hosts * 1) Here it is also unclear what the --peers should be. Now it is suddenly 3 random IP addresses, but according to the description below, it should be only one at this point? * 4b) In the containerized case, the citation marks need to be removed from the values added to /etc/etcd/etcd.conf, otherwise it won't work according to the lab documentation from summit (link below). * 4c) In the containerized case, the service is called etcd_container and not etcd Bringing OpenShift Container Platform Services Back Online Project Backup Role Bindings Service Accounts Secrets Persistent Volume Claims Project Restore Application Data Backup Application Data Restore Suggestions for improvement: The information for this is too scattered and differs depending on what documentation source you look at. Some things that I think needs to be clarified: - In some places etcdctl2 is used and in some etcdctl3. What do we recommend customers to use and how? - What is the difference between containerized and rpm-based when it comes to backup and restore? - When adding nodes to cluster, there are different options used for --endpoints, --peers and --peer-urls depending on API version. The values that should go with each option is not explained well. I think the best way would be to have a clear example of e.g. 3 masters with defined IP addresses, e.g.; master1.example.com <-> 192.168.0.1 master2.example.com <-> 192.168.0.2 master3.example.com <-> 192.168.0.3 Instead of seemingly random IP addresses that are never explained where they come from as it is now on docs.openshift.com. Additional information: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html/day_two_operations_guide/day_two_host_level_tasks#day-two-guide-etcd-backup <--- this contain different instructions that what is on docs.openshift.com, but I haven't tried them. https://rht-labs-events.github.io/summit-lab-2018-doc/#/scenario2/README <--- I have successfully restored a containerized cluster using these instructions. This was easier to follow, but is it really necessary to run the instructions inside the container? All in all, I believe we need clear instructions on how to do this with defined examples. Also clearly state where it differs between rpm-based and containerized. When a user try to find instructions for backup and restore, it might be in a very delicate situation where faulty instructions can lead to data loss. Because of that, this is one of the most important parts of the documentation that needs to be very clear.
There are also a KB trying to explain something that is unclear in the documentation: https://access.redhat.com/solutions/3154561 This should of course be changed in the original doc instead of being described in a KB.
*** Bug 1579344 has been marked as a duplicate of this bug. ***
Hi Ture! I think this one's going to take a couple of PRs to fix because different changes need to go on different branches. First, the KB. I think using this command, which is in 3.7 and later, makes it much clearer. Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601 Second, I'm working on removing the parts of the backup_restore guide in the admin section that are clearer in the day 2 guide. Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which is everything after 3.7?
Hi Ture! I think this one's going to take a couple of PRs to fix because different changes need to go on different branches. First, the KB. I think using this command, which is in 3.7 and later, makes it much clearer. Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601 Second, I'm working on removing the parts of the backup_restore guide in the admin section that are clearer in the day 2 guide. Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which are 3.7 and later?
Hi Kathryn! > Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601 Yes that PR makes the KB unnecessary. > Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which are 3.7 and later? If the backup and restore instructions end up in the Day Two Operations Guide, that is fine by me, as long as the instructions are complete and in the same location.
Hi Ture! Thanks! My current plan is to put the backup instructions in the day 2 guide and the restore instructions in the admin guide. I think they're related tasks but not in the same workflow. You don't have a version on this bug, so I'm only going to apply the changes goin back to 3.7. @Vikram, how do you unpublish a KB? We'll need to unpublish https://access.redhat.com/solutions/3154561 this sprint.
@Vikram, I don't have the option to unpublish the KB. Will you unpublish it for me? The KB portion of this change is live on docs.openshift, eg https://docs.openshift.com/enterprise/3.2/admin_guide/backup_restore.html and the portal, eg https://access.redhat.com/documentation/en-us/openshift_container_platform/3.6/html-single/cluster_administration/#etcd-backup
(In reply to Kathryn Alexander from comment #8) > @Vikram, I don't have the option to unpublish the KB. Will you unpublish it > for me? > > The KB portion of this change is live on docs.openshift, eg > https://docs.openshift.com/enterprise/3.2/admin_guide/backup_restore.html > > and the portal, eg > https://access.redhat.com/documentation/en-us/openshift_container_platform/3. > 6/html-single/cluster_administration/#etcd-backup Done. I have marked the KBase as outdated.