1678921 – Master instance is deleted, new master is created with etcd pod not able to join cluster.

Bug 1678921 - Master instance is deleted, new master is created with etcd pod not able to join cluster.

Summary: Master instance is deleted, new master is created with etcd pod not able to j...

Keywords:
Status:	CLOSED DUPLICATE of bug 1667557
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michal Fojtik
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-19 21:44 UTC by Ryan Howe
Modified:	2023-09-14 05:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-07 12:17:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ryan Howe 2019-02-19 21:44:19 UTC

Description of problem:

In an OCP 4.0 install one of the master instances gets deleted, a new instance is automatically created. The etcd static pod that gets created does not get added to the existing cluster. The old etcd node of the deleted master is never removed from the cluster.


Version-Release number of selected component (if applicable):
4.0  v0.12 installer

How reproducible:
100%

Steps to Reproduce:
1. Install cluster 
2. Delete a master ec2 instance 
aws ec2 terminate-instances --region us-west-2 --instance-ids i-007f05ba505734630
3. Check nodes see a new master instance was started and created.

Actual results:
 etcd that gets started on new master fails starting up and never gets added to existing cluster 

Expected results:
 new etcd to gets added to existing cluster, old etcd member is removed from cluster. 

Additional info:

# oc get pod --all-namespaces | grep etcd
kube-system                                  etcd-member-ip-10-0-15-225.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-23-205.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-38-249.us-west-2.compute.internal                         1/1       Running     0          1d

# oc get nodes 
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-142-75.us-west-2.compute.internal    Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-15-225.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-155-141.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-166-138.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-23-205.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-38-249.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8

Delete instance "ip-10-0-23-205.us-west-2.compute.internal"

# oc get pod --all-namespaces | grep etcd
kube-system                                  etcd-member-ip-10-0-15-225.us-west-2.compute.internal                         1/1       Running     0          1d
kube-system                                  etcd-member-ip-10-0-23-155.us-west-2.compute.internal                         0/1       Init:0/2    5          38m
kube-system                                  etcd-member-ip-10-0-38-249.us-west-2.compute.internal                         1/1       Running     0          1d

# oc get nodes 
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-142-75.us-west-2.compute.internal    Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-15-225.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8
ip-10-0-155-141.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-166-138.us-west-2.compute.internal   Ready     worker    1d        v1.11.0+406fc897d8
ip-10-0-23-155.us-west-2.compute.internal    Ready     master    39m       v1.11.0+406fc897d8
ip-10-0-38-249.us-west-2.compute.internal    Ready     master    1d        v1.11.0+406fc897d8


# oc rsh etcd-member-ip-10-0-15-225.us-west-2.compute.internal
# ETCDCTL_API=3 etcdctl   --cert=/peer.crt --key=/peer.key  --cacert=/etc/ssl/etcd/ca.crt --endpoints https://localh
ost:2379 member list  
6852a310452cfe52, started, etcd-member-ip-10-0-15-225.us-west-2.compute.internal, https://rtest-etcd-0.test.redhat.com:2380, https://10.0.15.225:2379
68f9eddcc9186b35, started, etcd-member-ip-10-0-23-205.us-west-2.compute.internal, https://rtest-etcd-1.test.redhat.com:2380, https://10.0.23.205:2379
d764bbf70b8f188f, started, etcd-member-ip-10-0-38-249.us-west-2.compute.internal, https://rtest-etcd-2.test.redhat.com:2380, https://10.0.38.249:2379


# ETCDCTL_API=3 etcdctl   --cert=/peer.crt --key=/peer.key  --cacert=/etc/ssl/etcd/ca.crt --endpoints https://10.0.3
8.249:2379,https://10.0.23.205:2379,https://10.0.15.225:2379,https://10.0.23.155:2379 endpoint status  --write-out=tab
le 
Failed to get the status of endpoint https://10.0.23.205:2379 (context deadline exceeded)
Failed to get the status of endpoint https://10.0.23.155:2379 (context deadline exceeded)
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.38.249:2379 | d764bbf70b8f188f |  3.3.10 |   88 MB |     false |        66 |    1924899 |
| https://10.0.15.225:2379 | 6852a310452cfe52 |  3.3.10 |   88 MB |      true |        66 |    1924979 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

Comment 2 Xingxing Xia 2019-02-21 03:00:28 UTC

Sounds similar to bug 1667557 ?

Comment 3 Michal Fojtik 2019-03-07 12:17:58 UTC


*** This bug has been marked as a duplicate of bug 1667557 ***

Comment 4 Red Hat Bugzilla 2023-09-14 05:24:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.