Bug 1673705

Summary: Single ETCD Recovery is not possible because of discovery option
Product: OpenShift Container Platform Reporter: jooho lee <jlee>
Component: MasterAssignee: Sam Batschelet <sbatsche>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.1.0CC: aos-bugs, erich, jokerman, mfojtik, mifiedle, mmccomas, xxia
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-03 15:02:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Description jooho lee 2019-02-07 20:00:32 UTC
Description of problem:

Testing ETCD recovery on OCP4.

To break one of etcd members, I removed /var/lib/etcd folder then the pod start to keep crashing.

In order to make the stale member start, I add the following:

```
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```

but the ETCD container still find other members because of the discovery-srv option.
```
     --discovery-srv ocp4.jlee.rhcee.support 
```

From this doc(https://github.com/etcd-io/etcd/issues/8585), it is not possible to recover ETCD member using public discovery.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Remove /var/lib/etcd folder
2. Remove the stale member from ETCD cluster.
3. Add 
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```
4. rm /var/lib/etcd/*
5. chown -R etcd:etcd /var/lib/etcd 
6. restorecon -Rv /var/lib/etcd 
7. oc delete pod $STALE_ETCD_POD  -n kube-system 


Actual results:
Keep crashing

Expected results:
Start up in a new cluster.

Additional info: