Bug 1405338 - [DOCS] Etcd restore procedure
Summary: [DOCS] Etcd restore procedure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Ashley Hardin
QA Contact: Johnny Liu
Vikram Goyal
URL:
Whiteboard:
Depends On: 1419670 1421072
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-16 08:59 UTC by Jaspreet Kaur
Modified: 2020-04-15 15:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-15 22:02:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jaspreet Kaur 2016-12-16 08:59:39 UTC
Document URL: https://access.redhat.com/documentation/en/openshift-container-platform/3.3/single/installation-and-configuration/#downgrading-restoring-etcd


Section Number and Name: 6.7. Restoring etcd


Describe the issue: 

We have our own OCP 3.3 on our infrastructure and needed a way to automate the restore of the entire cluster if problems appear however it seems to be wrong at many levels.

- there is no embedded etcd on the HA deployment and the docs say nothing about it (be noted it's external as there is a etcd service running on all nodes)
- the procedure to restore the external etcd cluster is so unclear and it's not working because the external one fails when adding the 2nd member because the config file has all the members there and at this point we have only one up and trying to add the 2nd so we need to update the config file for that
- the embedded etcd procedure doesn't say to start the new cluster with only one peer and after add the others as we go; it will fail randomly by listing all peers there (maybe it uses a round-robin algorithm to ask the peers and when it gets to a one that's down it fails getting the message: cluster is unconfigured"
- as the engineers probably know there is no way to add the second member of the cluster without mentioning to the cluster that it is "existing" because it will just boot with the old cluster id and generate a ton of errors
- when adding nodes it must be done with existing peers mentioned in the add member procedure because it will destroy the configuration by getting the first node in the initial configuration and it listens on localhost so it cannot expose the service to other members

This being said the procedure that works is as follows:

- shut down all openshift services on the masters
- shut down the etcd service on all OCP masters
- remove the /var/lib/etcd/member folder on all OCP masters (or backup the folder)
- restore the etcd from backup on the 1st OCP  master and boot it up this --force-new-cluster but from the command line, not from the service
- update the member not to have localhost there and kill the etcd process
- start etcd service now on the 1st OCP master
- add the 2nd member by mentioning only the 1st peer (1st master)
- edit the config file of the 2nd member (/etc/etcd/etcd.conf) with the info that the member add provided (etcd-name, etcd-initial-cluster and etcd-initial-cluster-state) and start the etcd service which will have the effect that the cluster is up and running with 2 nodes
- add the rest of the members by mentioning existing peers only and editing the config files on the rest of the members with the new "etcd-initial-cluster" that you get at every add of a member and start the new cluster (I haven't tested with adding multiple members at a time as the etcd documentation says but maybe you did so it would be useful to know)
- when the etcd cluster is up the OCP services can be brought back online but I just want to mention here that the OCP cluster (the api, controllers and node services) start up with just one etcd member with no problems and doesn't need the entire etcd cluster
 
Need feedback on the procedure and also would like the documentation of Openshift to be clear about this things 


Suggestions for improvement: 

Additional information:

Comment 17 Ashley Hardin 2017-05-10 17:01:33 UTC
Verified by Gaoyun Pei in the PR, so moving to verified


Note You need to log in before you can comment on or make changes to this bug.