1812584 – cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Bug 1812584 - cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Summary: cluster-etcd-operator should not scale when upgrading from 4.3 to 4.4

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1811706 1812860 1813057 (view as bug list)
Depends On:
Blocks:	1813061 1813190 1813341
TreeView+	depends on / blocked

Reported:	2020-03-11 15:55 UTC by Alay Patel
Modified:	2021-04-05 17:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1813061 1813341 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:19:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 253	0	None	closed	Bug 1812584: fix scaling during upgrade and add unit tests	2020-11-30 11:42:03 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:20:06 UTC

Description Alay Patel 2020-03-11 15:55:02 UTC

Description of problem:
It was observed that few manual upgrade runs failed and ended up having 4 etcd members. Upon closer look, the etcd operator looks for etcd member using the pod's node name to check if the member exists. In 4.3 and previous version, the discovery init container used to set hostname as etcd member name. 

During an upgrade, the operator sees an unready pod, does not find the member with hostname, adds the pod as a member. The etcd membership is extended to 4 member with 2 members point to the same pod as follows

-----
sh-4.2# etcdctl member list
6eaa8b03968621d, started, etcd-member-ip-10-0-139-57.us-west-2.compute.internal, https://etcd-0.scooter.group-b.devcluster.openshift.com:2380, https://10.0.139.57:2379
607b6768da8a2af5, started, ip-10-0-139-57.us-west-2.compute.internal, https://10.0.139.57:2380, https://10.0.139.57:2379
7ac864e4e29706a1, started, ip-10-0-170-79.us-west-2.compute.internal, https://etcd-2.scooter.group-b.devcluster.openshift.com:2380, https://10.0.170.79:2379
ef15d118336ebace, started, ip-10-0-155-153.us-west-2.compute.internal, https://etcd-1.scooter.group-b.devcluster.openshift.com:2380, https://10.0.155.153:2379


Version-Release number of selected component (if applicable):


How reproducible:
1/4 manual runs in my experience. Sometimes leads to total loss of control plane on upgrade

Steps to Reproduce:
1.upgrade to a recent 4.4 cluster from 4.3

Comment 3 Sam Batschelet 2020-03-13 20:08:26 UTC

*** Bug 1813057 has been marked as a duplicate of this bug. ***

Comment 4 Sam Batschelet 2020-03-13 20:40:18 UTC

*** Bug 1812860 has been marked as a duplicate of this bug. ***

Comment 5 ge liu 2020-03-16 11:43:31 UTC

Verified,  
we tried several times from 4.4.0-0.nightly-2020-03-13-053209 to 4.5.0-0.nightly-2020-03-16-031426 with profile: IPI on AWS Cutom VPC with RHCOS & RHEL7.7(FIPS off), but on 4.3 to 4.4, it seem always appears on upi vsphere platform. we will try it more

Comment 6 ge liu 2020-03-16 15:05:52 UTC

As sync with Sam, we may close this one, and open a new one if there is problem on vsphere platform.

Comment 7 Alay Patel 2020-03-16 17:18:57 UTC

*** Bug 1811706 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2020-03-16 20:00:08 UTC

(In reply to Alay Patel from comment #0)
> 1/4 manual runs in my experience. Sometimes leads to total loss of control
> plane on upgrade

This is serious enough that I'm adding the UpgradeBlocker keyword.  I'm agnostic about pulling edges from candidate-4.4 based on this issue, but we certainly don't want to ship a fast/stable 4.3 -> 4.4 edge until it is fixed.

Comment 9 Ke Wang 2020-03-23 09:56:07 UTC

*** Bug 1812860 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2020-07-13 17:19:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.