1673705 – Single ETCD Recovery is not possible because of discovery option

Bug 1673705 - Single ETCD Recovery is not possible because of discovery option

Summary: Single ETCD Recovery is not possible because of discovery option

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1664187
TreeView+	depends on / blocked

Reported:	2019-02-07 20:00 UTC by jooho lee
Modified:	2019-04-03 15:02 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-03 15:02:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jooho lee 2019-02-07 20:00:32 UTC

Description of problem:

Testing ETCD recovery on OCP4.

To break one of etcd members, I removed /var/lib/etcd folder then the pod start to keep crashing.

In order to make the stale member start, I add the following:

```
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```

but the ETCD container still find other members because of the discovery-srv option.
```
     --discovery-srv ocp4.jlee.rhcee.support 
```

From this doc(https://github.com/etcd-io/etcd/issues/8585), it is not possible to recover ETCD member using public discovery.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Remove /var/lib/etcd folder
2. Remove the stale member from ETCD cluster.
3. Add 
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```
4. rm /var/lib/etcd/*
5. chown -R etcd:etcd /var/lib/etcd 
6. restorecon -Rv /var/lib/etcd 
7. oc delete pod $STALE_ETCD_POD  -n kube-system 


Actual results:
Keep crashing

Expected results:
Start up in a new cluster.

Additional info:

Note You need to log in before you can comment on or make changes to this bug.