Bug 1656397
Summary: | [DOCS] Master/etcd backup and restore documentation is confusing and needs cleaning up | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Paul Dwyer <pdwyer> | |
Component: | Documentation | Assignee: | Andrew Taylor <antaylor> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ge liu <geliu> | |
Severity: | high | Docs Contact: | Vikram Goyal <vigoyal> | |
Priority: | high | |||
Version: | 3.11.0 | CC: | antaylor, aos-bugs, clichybi, dmoessne, jialiu, jokerman, knakayam, ksuzumur, mmccomas, sburke, vigoyal | |
Target Milestone: | --- | |||
Target Release: | 3.11.z | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1744543 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-20 17:23:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1744543 |
Description
Paul Dwyer
2018-12-05 12:22:07 UTC
Hi Paul, Thanks for filing this bug. I agree that the steps are confusing and difficult to follow in a panic situation. As this bug is requesting an extensive modification to our documentation, I intend to double (and triple) check every change you've mentioned to ensure that we get all the necessary changes covered. I did my best to break it down into categories of changes. ####### Item 1: ####### > Is there a need to have v2 and v3 commands? > We should be using v3 data in etcd since 3.6, yes? Yes, or even 3.5. Not sure how far back we'll make these changes; I'll ask QE when we get a pull request filed. First, I removed all mentions of v2. I took some liberties with notes (blue circle with an "i" in it). One simply stated that etcdctl2 and 3 were aliases, then another said that these aliases do not provide the full endpoint list to the etcdctl command, so I combined them: [NOTE] ==== `etcdctl3` is an alias for the `etcdctl` tool that contains the proper flags to query the etcd cluster. However, the `etcdctl3` alias does not provide the full endpoint list to the `etcdctl` command, so you must specify the `--endpoints` option to list all the endpoints. ==== I also replaced all etcdctl2 with etcdctl3 commands. ####### Item 2: ####### > Suggestions for improvement: > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#creating-> master-backup_deprecating-etcd > This does not mention backing up pod definition /etc/origin/node/pods/etcd.yaml for static pods which is mentioned in the recovery steps > The default in 3.11 is static pods for etcd so this should be the main focus, with other scenarios in separate sections. We can probably include the backup pod definitions section here. I will need to discuss exactly how we will make these changes with my team. ####### Item 3: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#etcd-data-backup_backing-up-etcd-before-deprecating > In steps for static pods: > "Specify the etcd endpoint IP address that you obtained." - why not use the variable you created $ETCD_EP The code block above where the (1) is includes the <ETCD_EP> variable. The "Specify the etcd endpoint..." note is just providing clarification. I think you're right and this could be safely changed to just state `--endpoints ${ETCD_EP}`. I'm open to any input you may have from personal testing. ####### Item 4: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd_deprecating-etcd > If using static pods then the following step is not needed > # systemctl restart etcd.service Removed, since that section is specifically about static pods. There was a note beneath this describing "if you run etcd as a static pod..." which I modified to say "as a static pod (the default configuration)..." for clarity to the end user. The phrasing may change while under peer review, but I think that concept is sufficient. ####### Item 5: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod > building etcd tools will fail if go is not available on host Sorry, what do you mean by "if go is not available on the host" ? ####### Item 6: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v2-v3-data > Clarify whether you do actually need to do most of the steps on all nodes simultaneously (or not) and if so make that clear and fix up the last statement which says: > 9. After the first instance is running, you can restore the rest of your etcd servers. > If it's not the case and you do one at time make that clear in the earlier steps. To me, step 9 reads "do this on the first node, then restore the rest of your etcd servers". I will likely have to confirm this with engineering. ####### Item 7: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v3-snapshot > Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod > Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch I added a step in those sections to "Repeat the previous steps for every etcd node to be added." Can you elaborate on "go run scale" ? Is this in reference to a specific command that needs to be run on new clusters? ####### Item 8: ####### > Additional information: > There is also probably duplication of this same info in other sections of the docs as well which would need cleaning up These docs are modularized, meaning we can reference sections on another page and it's using the same content. Making these changes here will change them elsewhere in documentation, as well. ############## CURRENT STATUS ############## I will leave this as assigned to me while I discuss several points mentioned above with my team, which may take some time. If you could provide any feedback on the above (particularly item 5) that will be helpful. I'll set a needinfo on you (Paul) in the meantime. Let me know if you have any questions! Thanks, Andrew Sorry for the delay. Below is an update on each item: ####### Item 1: ####### > Is there a need to have v2 and v3 commands? > We should be using v3 data in etcd since 3.6, yes? According to several previous bugs, it seems that v2 storage was used even after v3 came out, which is why the overlap of v2 and v3 commands. Per Scott Dodson in June 2018: "When etcd 3.x was initially released and we were still using v2 storage we found that backing up only v2 data prevented etcd from starting properly after restoration. I've not validated that the inverse (only backing up v3) is true, but given the varied use of datastores it seems safest to backup and restore both datastores until such a time that we ship a version of etcd that removes v2 support." http://post-office.corp.redhat.com/archives/openshift-sme/2018-June/msg00584.html As v2 is still supported, I believe we should continue with Scott's advice and keep the v2 sections in place. If Red Hat customers need a v3 guide, we have KCS for their use: https://access.redhat.com/solutions/1981013 ####### Item 2: ####### > Suggestions for improvement: > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#creating-master-backup_deprecating-etcd > This does not mention backing up pod definition /etc/origin/node/pods/etcd.yaml for static pods which is mentioned in the recovery steps Fixed. > The default in 3.11 is static pods for etcd so this should be the main focus, with other scenarios in separate sections. I agree in principle, however I do not wish to change this for the same reasons as Item 1 above. ####### Item 3: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#etcd-data-backup_backing-up-etcd-before-deprecating > In steps for static pods: > "Specify the etcd endpoint IP address that you obtained." - why not use the variable you created $ETCD_EP The code block above where the (1) is includes the <ETCD_EP> variable. The "Specify the etcd endpoint..." note is just providing clarification. I think you're right and this could be safely changed to just state `--endpoints $ETCD_EP`. I'm open to any input you may have from personal testing. ####### Item 4: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd_deprecating-etcd > If using static pods then the following step is not needed > # systemctl restart etcd.service Removed, since that section is specifically about static pods. ####### Item 5: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod > building etcd tools will fail if go is not available on host > This piece was getting customers to build etcd tools direct on the servers, to do this they will need the "go" command installed otherwise it will not work. > I am not a fan of getting customers to install compilers/extra tools and build things on production environments. > I would suggest not asking customers to do this. I believe installing go should be a prereq if building the etcd tools is required. Added to the note in this section. ####### Item 6: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v2-v3-data > Clarify whether you do actually need to do most of the steps on all nodes simultaneously (or not) and if so make that clear and fix up the last statement which says: > 9. After the first instance is running, you can restore the rest of your etcd servers. > If it's not the case and you do one at time make that clear in the earlier steps. To me, step 9 reads "do this on the first node, then restore the rest of your etcd servers". It says this in the introduction. No changes made. ####### Item 7: ####### > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-v3-snapshot > Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch > https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html#restoring-etcd-on-a-static-pod > Add in notes about adding other members back in if you're recovering to an "existing cluster" or to go run scale if you're adding new ones from scratch I added a step in those sections to "Repeat the previous steps for every etcd node to be added." ####### Item 8: ####### > Additional information: > There is also probably duplication of this same info in other sections of the docs as well which would need cleaning up These docs are modularized, meaning we can reference sections on another page and it's using the same content. Making these changes here will change them elsewhere in documentation, as well. This item is resolved. ############## CURRENT STATUS ############## Pull request filed: https://github.com/openshift/openshift-docs/pull/16086 Changes merged; setting release pending. These changes are now live: https://docs.openshift.com/container-platform/3.11/day_two_guide/host_level_tasks.html I will close this bug at this time. |