1813061 – cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Bug 1813061 - cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Summary: cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Keywords:
Status:	CLOSED DUPLICATE of bug 1813341
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1813190 (view as bug list)
Depends On:	1812584 1813341
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-12 20:26 UTC by Sam Batschelet
Modified:	2021-04-05 17:46 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1812584
Environment:
Last Closed:	2020-03-16 17:16:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sam Batschelet 2020-03-12 20:26:39 UTC

+++ This bug was initially created as a clone of Bug #1812584 +++

Description of problem:
It was observed that few manual upgrade runs failed and ended up having 4 etcd members. Upon closer look, the etcd operator looks for etcd member using the pod's node name to check if the member exists. In 4.3 and previous version, the discovery init container used to set hostname as etcd member name. 

During an upgrade, the operator sees an unready pod, does not find the member with hostname, adds the pod as a member. The etcd membership is extended to 4 member with 2 members point to the same pod as follows

-----
sh-4.2# etcdctl member list
6eaa8b03968621d, started, etcd-member-ip-10-0-139-57.us-west-2.compute.internal, https://etcd-0.scooter.group-b.devcluster.openshift.com:2380, https://10.0.139.57:2379
607b6768da8a2af5, started, ip-10-0-139-57.us-west-2.compute.internal, https://10.0.139.57:2380, https://10.0.139.57:2379
7ac864e4e29706a1, started, ip-10-0-170-79.us-west-2.compute.internal, https://etcd-2.scooter.group-b.devcluster.openshift.com:2380, https://10.0.170.79:2379
ef15d118336ebace, started, ip-10-0-155-153.us-west-2.compute.internal, https://etcd-1.scooter.group-b.devcluster.openshift.com:2380, https://10.0.155.153:2379


Version-Release number of selected component (if applicable):


How reproducible:
1/4 manual runs in my experience. Sometimes leads to total loss of control plane on upgrade

Steps to Reproduce:
1.upgrade to a recent 4.4 cluster from 4.3

Comment 1 Sam Batschelet 2020-03-13 19:48:36 UTC

*** Bug 1813190 has been marked as a duplicate of this bug. ***

Comment 3 Alay Patel 2020-03-16 17:16:55 UTC


*** This bug has been marked as a duplicate of bug 1813341 ***

Comment 4 W. Trevor King 2021-04-05 17:46:49 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.