1815179 – Upgrades from 4.3.5 failing since 2020-03-19: Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2020-03-19-140914: 83% complete

Bug 1815179 - Upgrades from 4.3.5 failing since 2020-03-19: Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2020-03-19-140914: 83% complete

Summary: Upgrades from 4.3.5 failing since 2020-03-19: Cluster did not complete upgrad...

Keywords:
Status:	CLOSED DUPLICATE of bug 1815539
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-19 17:45 UTC by Devan Goodwin
Modified:	2023-09-15 00:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-20 23:29:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2020-03-19 17:45:47 UTC

Description of problem:

As of last night, upgrades from 4.3.5 to 4.4.0-ci appear to have started failing. The problem looks like it boils down to unhealthy etcd members?

Two example test runs:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/22128

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/22060

Mar 19 08:51:04.926 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-142-37.ec2.internal (23 times)
Mar 19 08:52:30.393 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-142-37.ec2.internal (10 times)

::EtcdMembers_UnhealthyMembers: EtcdMembersControllerDegraded: node lister not synced\nEtcdMembersDegraded: ip-10-0-143-155.us-west-1.compute.internal,ip-10-0-149-42.us-west-1.compute.internal members are unhealthy,  members are unknown



How reproducible:

At present looks to be 5 failures in a row now.


Additional info:

Comment 1 Aditya Narayanaswamy 2020-03-20 15:16:43 UTC

A similar error is showing up for upgrades from 4.4.0-rc2 to 4.5.0

Prow link : https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/22342

Error message: [Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]

not sure if they are related.

Comment 3 Scott Dodson 2020-03-20 16:52:54 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Comment 4 Sam Batschelet 2020-03-20 23:29:29 UTC


*** This bug has been marked as a duplicate of bug 1815539 ***

Comment 5 W. Trevor King 2021-04-05 17:45:56 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 6 Red Hat Bugzilla 2023-09-15 00:30:24 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.