Bug 1673705 - Single ETCD Recovery is not possible because of discovery option
Summary: Single ETCD Recovery is not possible because of discovery option
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.1.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2019-02-07 20:00 UTC by jooho lee
Modified: 2019-04-03 15:02 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-03 15:02:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jooho lee 2019-02-07 20:00:32 UTC
Description of problem:

Testing ETCD recovery on OCP4.

To break one of etcd members, I removed /var/lib/etcd folder then the pod start to keep crashing.

In order to make the stale member start, I add the following:

```
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```

but the ETCD container still find other members because of the discovery-srv option.
```
     --discovery-srv ocp4.jlee.rhcee.support 
```

From this doc(https://github.com/etcd-io/etcd/issues/8585), it is not possible to recover ETCD member using public discovery.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Remove /var/lib/etcd folder
2. Remove the stale member from ETCD cluster.
3. Add 
echo "ETCD_FORCE_NEW_CLUSTER=true" >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER=etcd-member-ip-10-0-23-78.us-east-2.compute.internal=https://ocp4-etcd-1.jlee.rhcee.support:2380"  >> /run/etcd/environment
echo "ETCD_INITIAL_CLUSTER_STATE=new"  >> /run/etcd/environment
```
4. rm /var/lib/etcd/*
5. chown -R etcd:etcd /var/lib/etcd 
6. restorecon -Rv /var/lib/etcd 
7. oc delete pod $STALE_ETCD_POD  -n kube-system 


Actual results:
Keep crashing

Expected results:
Start up in a new cluster.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.