Bug 1579304 - https://docs.openshift.com/ description of etcd backup and restore differs from other sources and is in some places wrong or misleading
Summary: https://docs.openshift.com/ description of etcd backup and restore differs fr...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Kathryn Alexander
QA Contact: Vikram Goyal
Vikram Goyal
URL:
Whiteboard:
: 1579344 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-17 10:35 UTC by Ture Karlsson
Modified: 2021-03-24 19:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-24 19:08:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ture Karlsson 2018-05-17 10:35:05 UTC
Document URL:

https://docs.openshift.com/container-platform/3.6/admin_guide/backup_restore.html
https://docs.openshift.com/container-platform/3.7/admin_guide/backup_restore.html
https://docs.openshift.com/container-platform/3.9/admin_guide/backup_restore.html

Describe the issue: 

The documentation of backing up and restoring etcd is documented differently in multiple places. The one on https://docs.openshift.com/ have multiple flaws. See comments marked with "*" in the corresponding chapters of the documentation that should either be changed or clarified.

Prerequisites
Cluster Backup
    Master Backup
    Etcd Backup

    * 1) Should master services be running or not?
    * 2) Instructions for containerized is missing except for mention of "docker exec", but that is not enough instructions for how to create the backup in the containerized case. Also, is it really needed to run the command in the container? In the day-two operations guide, the backup process is the same for containerized and non-containerized. 
    * 3) variable "$ETCD_DATA_DIR" used before defined, just print /var/lib/etcd instead.

    Registry Certificates Backup
Cluster Restore for Single-member etcd Clusters

* No instructions for containerized etcd. Is the service name the only difference?

Cluster Restore for Multiple-member etcd Clusters
    Embedded etcd
    External etcd
        Containerized etcd Deployments

        * 3) The --peers value and description are misleading. There are two IP addresses in the example that is not explained where they come from and the explanation says that "only active members" should be specified. How can you have more than one active member when you have just restored to a one-node cluster?

        Non-containerized etcd Deployments

        * 3) Same here as above

Adding New etcd Hosts

* 1) Here it is also unclear what the --peers should be. Now it is suddenly 3 random IP addresses, but according to the description below, it should be only one at this point?
* 4b) In the containerized case, the citation marks need to be removed from the values added to /etc/etcd/etcd.conf, otherwise it won't work according to the lab documentation from summit (link below).
* 4c) In the containerized case, the service is called etcd_container and not etcd

Bringing OpenShift Container Platform Services Back Online
Project Backup
    Role Bindings
    Service Accounts
    Secrets
    Persistent Volume Claims
Project Restore
Application Data Backup
Application Data Restore

Suggestions for improvement: 

The information for this is too scattered and differs depending on what documentation source you look at. Some things that I think needs to be clarified:

- In some places etcdctl2 is used and in some etcdctl3. What do we recommend customers to use and how?

- What is the difference between containerized and rpm-based when it comes to backup and restore? 

- When adding nodes to cluster, there are different options used for --endpoints, --peers and --peer-urls depending on API version. The values that should go with each option is not explained well. I think the best way would be to have a clear example of e.g. 3 masters with defined IP addresses, e.g.;
master1.example.com <-> 192.168.0.1
master2.example.com <-> 192.168.0.2
master3.example.com <-> 192.168.0.3
Instead of seemingly random IP addresses that are never explained where they come from as it is now on docs.openshift.com. 

Additional information: 

https://access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html/day_two_operations_guide/day_two_host_level_tasks#day-two-guide-etcd-backup        <--- this contain different instructions that what is on docs.openshift.com, but I haven't tried them.
https://rht-labs-events.github.io/summit-lab-2018-doc/#/scenario2/README      <--- I have successfully restored a containerized cluster using these instructions. This was easier to follow, but is it really necessary to run the instructions inside the container?

All in all, I believe we need clear instructions on how to do this with defined examples. Also clearly state where it differs between rpm-based and containerized.

When a user try to find instructions for backup and restore, it might be in a very delicate situation where faulty instructions can lead to data loss. Because of that, this is one of the most important parts of the documentation that needs to be very clear.

Comment 1 Ture Karlsson 2018-05-17 10:42:00 UTC
There are also a KB trying to explain something that is unclear in the documentation: https://access.redhat.com/solutions/3154561

This should of course be changed in the original doc instead of being described in a KB.

Comment 2 Kathryn Alexander 2018-05-17 12:44:28 UTC
*** Bug 1579344 has been marked as a duplicate of this bug. ***

Comment 3 Kathryn Alexander 2018-05-25 17:21:19 UTC
Hi Ture! I think this one's going to take a couple of PRs to fix because different changes need to go on different branches.

First, the KB. I think using this command, which is in 3.7 and later, makes it much clearer. Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601


Second, I'm working on removing the parts of the backup_restore guide in the admin section that are clearer in the day 2 guide. Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which is everything after 3.7?

Comment 4 Kathryn Alexander 2018-05-25 17:24:33 UTC
Hi Ture! I think this one's going to take a couple of PRs to fix because different changes need to go on different branches.

First, the KB. I think using this command, which is in 3.7 and later, makes it much clearer. Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601


Second, I'm working on removing the parts of the backup_restore guide in the admin section that are clearer in the day 2 guide. Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which are 3.7 and later?

Comment 5 Ture Karlsson 2018-05-28 07:03:33 UTC
Hi Kathryn!

> Do you agree that this change makes the KB unnecessary? https://github.com/openshift/openshift-docs/pull/9601

Yes that PR makes the KB unnecessary.

> Are you ok with the changes for the rest of the fixes for this bug going in the versions that contain the day 2 guide, which are 3.7 and later?

If the backup and restore instructions end up in the Day Two Operations Guide, that is fine by me, as long as the instructions are complete and in the same location.

Comment 6 Kathryn Alexander 2018-05-29 14:00:15 UTC
Hi Ture!

Thanks!

My current plan is to put the backup instructions in the day 2 guide and the restore instructions in the admin guide. I think they're related tasks but not in the same workflow.

You don't have a version on this bug, so I'm only going to apply the changes goin back to 3.7.

@Vikram, how do you unpublish a KB? We'll need to unpublish https://access.redhat.com/solutions/3154561 this sprint.

Comment 8 Kathryn Alexander 2018-05-31 15:06:09 UTC
@Vikram, I don't have the option to unpublish the KB. Will you unpublish it for me?

The KB portion of this change is live on docs.openshift, eg https://docs.openshift.com/enterprise/3.2/admin_guide/backup_restore.html

and the portal, eg https://access.redhat.com/documentation/en-us/openshift_container_platform/3.6/html-single/cluster_administration/#etcd-backup

Comment 9 Vikram Goyal 2018-06-04 02:51:01 UTC
(In reply to Kathryn Alexander from comment #8)
> @Vikram, I don't have the option to unpublish the KB. Will you unpublish it
> for me?
> 
> The KB portion of this change is live on docs.openshift, eg
> https://docs.openshift.com/enterprise/3.2/admin_guide/backup_restore.html
> 
> and the portal, eg
> https://access.redhat.com/documentation/en-us/openshift_container_platform/3.
> 6/html-single/cluster_administration/#etcd-backup

Done. I have marked the KBase as outdated.


Note You need to log in before you can comment on or make changes to this bug.