Bug 1673705

Summary:	Single ETCD Recovery is not possible because of discovery option
Product:	OpenShift Container Platform	Reporter:	jooho lee <jlee>
Component:	Master	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED CURRENTRELEASE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.1.0	CC:	aos-bugs, erich, jokerman, mfojtik, mifiedle, mmccomas, xxia
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-03 15:02:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664187

Description jooho lee 2019-02-07 20:00:32 UTC

Description of problem:

Testing ETCD recovery on OCP4.

To break one of etcd members, I removed /var/lib/etcd folder then the pod start to keep crashing.

In order to make the stale member start, I add the following:

```
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```

but the ETCD container still find other members because of the discovery-srv option.
```
     --discovery-srv ocp4.jlee.rhcee.support 
```

From this doc(https://github.com/etcd-io/etcd/issues/8585), it is not possible to recover ETCD member using public discovery.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Remove /var/lib/etcd folder
2. Remove the stale member from ETCD cluster.
3. Add 
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```
4. rm /var/lib/etcd/*
5. chown -R etcd:etcd /var/lib/etcd 
6. restorecon -Rv /var/lib/etcd 
7. oc delete pod $STALE_ETCD_POD  -n kube-system 


Actual results:
Keep crashing

Expected results:
Start up in a new cluster.

Additional info: