Bug 1656397

Summary: [DOCS] Master/etcd backup and restore documentation is confusing and needs cleaning up
Product: OpenShift Container Platform Reporter: Paul Dwyer <pdwyer>
Component: DocumentationAssignee: Andrew Taylor <antaylor>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 3.11.0CC: antaylor, aos-bugs, clichybi, dmoessne, jialiu, jokerman, knakayam, ksuzumur, mmccomas, sburke, vigoyal
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1744543 (view as bug list) Environment:
Last Closed: 2019-08-20 17:23:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1744543    

Description Paul Dwyer 2018-12-05 12:22:07 UTC
Document URL: 
https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html

Describe the issue: 
The docs are currently very confusing as they seem to be trying to cover a wide range of scenarios.
In a pressure situation (i.e. when trying to recover an environment) incorrect steps will be followed and lead to escalations. Not every customer will test out backup and recovery until its too late.

The way the sections are laid out there is duplicate steps/information which again leads to confusion.

There is no clear flow through the steps, e.g. recovering etcd, once you have recovered etcd on one node, which steps do you do next to scale up the other etcd nodes. Or if you are following just the steps relating to static pods.

Is there a need to have v2 and v3 commands?
We should be using v3 data in etcd since 3.6, yes?

Suggestions for improvement: 
https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#creating-master-backup_deprecating-etcd

This does not mention backing up pod definition /etc/origin/node/pods/etcd.yaml for static pods which is mentioned in the recovery steps

The default in 3.11 is static pods for etcd so this should be the main focus, with other scenarios in separate sections.

https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#etcd-data-backup_backing-up-etcd-before-deprecating

In steps for static pods:
"Specify the etcd endpoint IP address that you obtained." - why not use the variable you created $ETCD_EP


https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd_deprecating-etcd
If using static pods then the following step is not needed
# systemctl restart etcd.service

https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod
building etcd tools will fail if go is not available on host

https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v2-v3-data

Clarify whether you do actually need to do most of the steps on all nodes simultaneously (or not) and if so make that clear and fix up the last statement which says:

9. After the first instance is running, you can restore the rest of your etcd servers.

If it's not the case and you do one at time make that clear in the earlier steps.

- https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v3-snapshot

Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch

- https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod

Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch

Additional information: 
There is also probably duplication of this same info in other sections of the docs as well which would need cleaning up

Comment 9 Andrew Taylor 2019-04-11 20:29:57 UTC
Hi Paul,

Thanks for filing this bug. I agree that the steps are confusing and difficult to follow in a panic situation. 

As this bug is requesting an extensive modification to our documentation, I intend to double (and triple) check every change you've mentioned to ensure that we get all the necessary changes covered. I did my best to break it down into categories of changes. 

#######
Item 1: 
#######

> Is there a need to have v2 and v3 commands?
> We should be using v3 data in etcd since 3.6, yes?

Yes, or even 3.5. Not sure how far back we'll make these changes; I'll ask QE when we get a pull request filed.  

First, I removed all mentions of v2. I took some liberties with notes (blue circle with an "i" in it). One simply stated that etcdctl2 and 3 were aliases, then another said that these aliases do not provide the full endpoint list to the etcdctl command, so I combined them:

[NOTE]
====
`etcdctl3` is an alias for the `etcdctl` tool that contains the proper flags to
query the etcd cluster. However, the `etcdctl3` alias does not provide the full
endpoint list to the `etcdctl` command, so you must specify the `--endpoints`
option to list all the endpoints. 
====

I also replaced all etcdctl2 with etcdctl3 commands.  

#######
Item 2: 
#######

> Suggestions for improvement: 
> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#creating-> master-backup_deprecating-etcd

> This does not mention backing up pod definition /etc/origin/node/pods/etcd.yaml for static pods  which is mentioned in the recovery steps
> The default in 3.11 is static pods for etcd so this should be the main focus, with other scenarios in separate sections.

We can probably include the backup pod definitions section here. I will need to discuss exactly how we will make these changes with my team. 

#######
Item 3: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#etcd-data-backup_backing-up-etcd-before-deprecating

> In steps for static pods:
> "Specify the etcd endpoint IP address that you obtained." - why not use the variable you created $ETCD_EP

The code block above where the (1) is includes the <ETCD_EP> variable. The "Specify the etcd endpoint..." note is just providing clarification. I think you're right and this could be safely changed to just state `--endpoints ${ETCD_EP}`. I'm open to any input you may have from personal testing. 

#######
Item 4: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd_deprecating-etcd
> If using static pods then the following step is not needed
> # systemctl restart etcd.service

Removed, since that section is specifically about static pods. 

There was a note beneath this describing "if you run etcd as a static pod..."  which I modified to say "as a static pod (the default configuration)..." for clarity to the end user. The phrasing may change while under peer review, but I think that concept is sufficient. 

#######
Item 5: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod
> building etcd tools will fail if go is not available on host

Sorry, what do you mean by "if go is not available on the host" ? 

#######
Item 6: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v2-v3-data
> Clarify whether you do actually need to do most of the steps on all nodes simultaneously (or not) and if so make that clear and fix up the last statement which says:
> 9. After the first instance is running, you can restore the rest of your etcd servers. 
> If it's not the case and you do one at time make that clear in the earlier steps.

To me, step 9 reads "do this on the first node, then restore the rest of your etcd servers". I will likely have to confirm this with engineering. 

#######
Item 7: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v3-snapshot
> Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch
> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod
> Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch

I added a step in those sections to "Repeat the previous steps for every etcd node to be added." Can you elaborate on "go run scale" ? Is this in reference to a specific command that needs to be run on new clusters? 

#######
Item 8: 
#######

> Additional information: 
> There is also probably duplication of this same info in other sections of the docs as well which would need cleaning up

These docs are modularized, meaning we can reference sections on another page and it's using the same content. Making these changes here will change them elsewhere in documentation, as well. 



##############
CURRENT STATUS
##############

I will leave this as assigned to me while I discuss several points mentioned above with my team, which may take some time. 

If you could provide any feedback on the above (particularly item 5) that will be helpful. I'll set a needinfo on you (Paul) in the meantime. 



Let me know if you have any questions!

Thanks,
Andrew

Comment 18 Andrew Taylor 2019-07-31 21:16:14 UTC
Sorry for the delay. Below is an update on each item: 

#######
Item 1: 
#######

> Is there a need to have v2 and v3 commands?
> We should be using v3 data in etcd since 3.6, yes?

According to several previous bugs, it seems that v2 storage was used even after v3 came out, which is why the overlap of v2 and v3 commands. Per Scott Dodson in June 2018: 

"When etcd 3.x was initially released and we were still using v2 storage we found that backing up only v2 data prevented etcd from starting properly after restoration. I've not validated that the inverse (only backing up v3) is true, but given the varied use of datastores it seems safest to backup and restore both datastores until such a time that we ship a version of etcd that removes v2 support."

http://post-office.corp.redhat.com/archives/openshift-sme/2018-June/msg00584.html

As v2 is still supported, I believe we should continue with Scott's advice and keep the v2 sections in place. 

If Red Hat customers need a v3 guide, we have KCS for their use: 

https://access.redhat.com/solutions/1981013

#######
Item 2: 
#######

> Suggestions for improvement: 
> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#creating-master-backup_deprecating-etcd

> This does not mention backing up pod definition /etc/origin/node/pods/etcd.yaml for static pods  which is mentioned in the recovery steps
Fixed.

> The default in 3.11 is static pods for etcd so this should be the main focus, with other scenarios in separate sections.

I agree in principle, however I do not wish to change this for the same reasons as Item 1 above. 


#######
Item 3: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#etcd-data-backup_backing-up-etcd-before-deprecating

> In steps for static pods:
> "Specify the etcd endpoint IP address that you obtained." - why not use the variable you created $ETCD_EP

The code block above where the (1) is includes the <ETCD_EP> variable. The "Specify the etcd endpoint..." note is just providing clarification. I think you're right and this could be safely changed to just state `--endpoints $ETCD_EP`. I'm open to any input you may have from personal testing. 

#######
Item 4: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd_deprecating-etcd
> If using static pods then the following step is not needed
> # systemctl restart etcd.service

Removed, since that section is specifically about static pods. 

#######
Item 5: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod
> building etcd tools will fail if go is not available on host

> This piece was getting customers to build etcd tools direct on the servers, to do this they will need the "go" command installed otherwise it will not work.
> I am not a fan of getting customers to install compilers/extra tools and build things on production environments.
> I would suggest not asking customers to do this.

I believe installing go should be a prereq if building the etcd tools is required. Added to the note in this section. 

#######
Item 6: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v2-v3-data
> Clarify whether you do actually need to do most of the steps on all nodes simultaneously (or not) and if so make that clear and fix up the last statement which says:
> 9. After the first instance is running, you can restore the rest of your etcd servers. 
> If it's not the case and you do one at time make that clear in the earlier steps.

To me, step 9 reads "do this on the first node, then restore the rest of your etcd servers". It says this in the introduction. No changes made.  

#######
Item 7: 
#######

> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v3-snapshot
> Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch
> https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod
> Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch

I added a step in those sections to "Repeat the previous steps for every etcd node to be added." 

#######
Item 8: 
#######

> Additional information: 
> There is also probably duplication of this same info in other sections of the docs as well which would need cleaning up

These docs are modularized, meaning we can reference sections on another page and it's using the same content. Making these changes here will change them elsewhere in documentation, as well. This item is resolved.

##############
CURRENT STATUS
##############

Pull request filed:

https://github.com/openshift/openshift-docs/pull/16086

Comment 21 Andrew Taylor 2019-08-15 18:21:16 UTC
Changes merged; setting release pending.

Comment 22 Andrew Taylor 2019-08-20 17:23:38 UTC
These changes are now live: 
https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html

I will close this bug at this time.