Bug 1678921

Summary: Master instance is deleted, new master is created with etcd pod not able to join cluster.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, decarr, jokerman, mmccomas
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-07 12:17:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ryan Howe 2019-02-19 21:44:19 UTC
Description of problem:

In an OCP 4.0 install one of the master instances gets deleted, a new instance is automatically created. The etcd static pod that gets created does not get added to the existing cluster. The old etcd node of the deleted master is never removed from the cluster.


Version-Release number of selected component (if applicable):
4.0  v0.12 installer

How reproducible:
100%

Steps to Reproduce:
1. Install cluster 
2. Delete a master ec2 instance 
aws ec2 terminate-instances --region us-west-2 --instance-ids i-007f05ba505734630
3. Check nodes see a new master instance was started and created.

Actual results:
 etcd that gets started on new master fails starting up and never gets added to existing cluster 

Expected results:
 new etcd to gets added to existing cluster, old etcd member is removed from cluster. 

Additional info:

# oc get pod --all-namespaces | grep etcd
kube-system                                  etcd-member-ip-10-0-15-225.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-23-205.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-38-249.us-west-2.compute.internal                         1/1       Running     0          1d

# oc get nodes 
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-142-75.us-west-2.compute.internal    Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-15-225.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-155-141.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-166-138.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-23-205.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-38-249.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8

Delete instance "ip-10-0-23-205.us-west-2.compute.internal"

# oc get pod --all-namespaces | grep etcd
kube-system                                  etcd-member-ip-10-0-15-225.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-23-155.us-west-2.compute.internal                         0/1       Init:0/2    5          38m
kube-system                                  etcd-member-ip-10-0-38-249.us-west-2.compute.internal                         1/1       Running     0          1d

# oc get nodes 
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-142-75.us-west-2.compute.internal    Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-15-225.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-155-141.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-166-138.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-23-155.us-west-2.compute.internal    Ready     master    39m       v1.11.0+406fc897d8
ip-10-0-38-249.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8


# oc rsh etcd-member-ip-10-0-15-225.us-west-2.compute.internal
# ETCDCTL_API=3 etcdctl   --cert=/peer.crt --key=/peer.key  --cacert=/etc/ssl/etcd/ca.crt --endpoints https://localh
ost:2379 member list  
6852a310452cfe52, started, etcd-member-ip-10-0-15-225.us-west-2.compute.internal, https://rtest-etcd-0.test.redhat.com:2380, https://10.0.15.225:2379
68f9eddcc9186b35, started, etcd-member-ip-10-0-23-205.us-west-2.compute.internal, https://rtest-etcd-1.test.redhat.com:2380, https://10.0.23.205:2379
d764bbf70b8f188f, started, etcd-member-ip-10-0-38-249.us-west-2.compute.internal, https://rtest-etcd-2.test.redhat.com:2380, https://10.0.38.249:2379


# ETCDCTL_API=3 etcdctl   --cert=/peer.crt --key=/peer.key  --cacert=/etc/ssl/etcd/ca.crt --endpoints https://10.0.3
8.249:2379,https://10.0.23.205:2379,https://10.0.15.225:2379,https://10.0.23.155:2379 endpoint status  --write-out=tab
le 
Failed to get the status of endpoint https://10.0.23.205:2379 (context deadline exceeded)
Failed to get the status of endpoint https://10.0.23.155:2379 (context deadline exceeded)
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.38.249:2379 | d764bbf70b8f188f |  3.3.10 |   88 MB |     false |        66 |    1924899 |
| https://10.0.15.225:2379 | 6852a310452cfe52 |  3.3.10 |   88 MB |      true |        66 |    1924979 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

Comment 2 Xingxing Xia 2019-02-21 03:00:28 UTC
Sounds similar to bug 1667557 ?

Comment 3 Michal Fojtik 2019-03-07 12:17:58 UTC

*** This bug has been marked as a duplicate of bug 1667557 ***

Comment 4 Red Hat Bugzilla 2023-09-14 05:24:03 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days