Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1405338

Summary:	[DOCS] Etcd restore procedure
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Documentation	Assignee:	Ashley Hardin <ahardin>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	high
Version:	3.3.0	CC:	ahardin, aos-bugs, jeder, jokerman, mifiedle, mmccomas, mmckinst, myllynen, vigoyal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-15 22:02:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1419670, 1421072
Bug Blocks:

Description Jaspreet Kaur 2016-12-16 08:59:39 UTC

Document URL: https://access.redhat.com/documentation/en/openshift-container-platform/3.3/single/installation-and-configuration/#downgrading-restoring-etcd


Section Number and Name: 6.7. Restoring etcd


Describe the issue: 

We have our own OCP 3.3 on our infrastructure and needed a way to automate the restore of the entire cluster if problems appear however it seems to be wrong at many levels.

- there is no embedded etcd on the HA deployment and the docs say nothing about it (be noted it's external as there is a etcd service running on all nodes)
- the procedure to restore the external etcd cluster is so unclear and it's not working because the external one fails when adding the 2nd member because the config file has all the members there and at this point we have only one up and trying to add the 2nd so we need to update the config file for that
- the embedded etcd procedure doesn't say to start the new cluster with only one peer and after add the others as we go; it will fail randomly by listing all peers there (maybe it uses a round-robin algorithm to ask the peers and when it gets to a one that's down it fails getting the message: cluster is unconfigured"
- as the engineers probably know there is no way to add the second member of the cluster without mentioning to the cluster that it is "existing" because it will just boot with the old cluster id and generate a ton of errors
- when adding nodes it must be done with existing peers mentioned in the add member procedure because it will destroy the configuration by getting the first node in the initial configuration and it listens on localhost so it cannot expose the service to other members

This being said the procedure that works is as follows:

- shut down all openshift services on the masters
- shut down the etcd service on all OCP masters
- remove the /var/lib/etcd/member folder on all OCP masters (or backup the folder)
- restore the etcd from backup on the 1st OCP  master and boot it up this --force-new-cluster but from the command line, not from the service
- update the member not to have localhost there and kill the etcd process
- start etcd service now on the 1st OCP master
- add the 2nd member by mentioning only the 1st peer (1st master)
- edit the config file of the 2nd member (/etc/etcd/etcd.conf) with the info that the member add provided (etcd-name, etcd-initial-cluster and etcd-initial-cluster-state) and start the etcd service which will have the effect that the cluster is up and running with 2 nodes
- add the rest of the members by mentioning existing peers only and editing the config files on the rest of the members with the new "etcd-initial-cluster" that you get at every add of a member and start the new cluster (I haven't tested with adding multiple members at a time as the etcd documentation says but maybe you did so it would be useful to know)
- when the etcd cluster is up the OCP services can be brought back online but I just want to mention here that the OCP cluster (the api, controllers and node services) start up with just one etcd member with no problems and doesn't need the entire etcd cluster
 
Need feedback on the procedure and also would like the documentation of Openshift to be clear about this things 


Suggestions for improvement: 

Additional information:

Comment 17 Ashley Hardin 2017-05-10 17:01:33 UTC

Verified by Gaoyun Pei in the PR, so moving to verified

Comment 18 Ashley Hardin 2017-05-15 22:02:54 UTC

Content is now published: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.5/html/cluster_administration/admin-guide-backup-and-restore#backup-restore-adding-etcd-hosts