Bug 1811706

Summary:	On upgrade: EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists
Product:	OpenShift Container Platform	Reporter:	Luis Sanchez <sanchezl>
Component:	Etcd Operator	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	4.4	CC:	alpatel, sbatsche, sdodson, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-16 17:18:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Luis Sanchez 2020-03-09 15:23:18 UTC

Description of problem:

Upgrade from 4.3 to 4.4 stuck on etcd Degraded:

	EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists

Versions:
4.3.0-0.nightly-2020-03-06-102925 to 4.4.0-0.nightly-2020-03-09-021257

Peer URLS:
[root@ip-10-0-170-79 kubernetes]# etcdctl member list 
6eaa8b03968621d, started, etcd-member-ip-10-0-139-57.us-west- 2.compute.internal, https://etcd-0.scooter.group-b.devcluster.openshift.com:2380, https://10.0.139.57:2379
607b6768da8a2af5, started, ip-10-0-139-57.us-west-2.compute.internal, https://10.0.139.57:2380, https://10.0.139.57:2379
7ac864e4e29706a1, started, ip-10-0-170-79.us-west-2.compute.internal, https://etcd-2.scooter.group-b.devcluster.openshift.com:2380, https://10.0.170.79:2379
ef15d118336ebace, started, ip-10-0-155-153.us-west-2.compute.internal, https://etcd-1.scooter.group-b.devcluster.openshift.com:2380, https://10.0.155.153:2379

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 W. Trevor King 2020-03-14 04:10:05 UTC

This came up in 4.3.5 -> 4.4.0-rc.1 CI testing [1,2].  We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Looks like we hit this in CI every few hours [3], for 15 hits over the past ~3d.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/443
[2]: https://github.com/openshift/cincinnati-graph-data/pull/118#issuecomment-599007279
[3]: https://search.svc.ci.openshift.org/?search=EtcdMemberIPMigratorDegraded%3A+etcdserver%3A+Peer+URLs+already+exists&maxAge=168h&context=-1&type=build-log

Comment 2 W. Trevor King 2020-03-14 04:13:15 UTC

Setting high priority until we get an impact statement, since hung updates with unknown root causes are pretty bad.  Once we have a handle on the cause and impact, we can adjust the priority as appropriate.

Comment 3 W. Trevor King 2020-03-14 04:34:10 UTC

Possibly this is a bug introduced by bug 1812071 [1] or bug 1812210 [2]?

[1]: https://github.com/openshift/cluster-kube-apiserver-operator/pull/791
[2]: https://github.com/openshift/cluster-kube-apiserver-operator/pull/792

Comment 4 Alay Patel 2020-03-16 17:18:57 UTC


*** This bug has been marked as a duplicate of bug 1812584 ***

Comment 5 W. Trevor King 2021-04-05 17:36:22 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 6 Red Hat Bugzilla 2023-09-15 00:30:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days