1840531 – etcd operator went into degraded status during rollback from 4.5 to 4.4

Bug 1840531 - etcd operator went into degraded status during rollback from 4.5 to 4.4

Summary: etcd operator went into degraded status during rollback from 4.5 to 4.4

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Suresh Kolichala
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1846025
TreeView+	depends on / blocked

Reported:	2020-05-27 07:19 UTC by liujia
Modified:	2020-07-13 17:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1846025 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:41:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:42:11 UTC

Description liujia 2020-05-27 07:19:05 UTC

Description of problem:
Do rollback test against v4.4.5-v4.5 nightly-v4.4.5 path. During the downgrade from v4.5 nightly to v4.4.5, etcd went into degraded status and rollback failed.

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-26-224432   True        True          20m     Unable to apply 4.4.5: the cluster operator etcd is degraded

# ./oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd   4.4.5     True        True          True       2m36s

# ./oc describe co etcd
...
Status:
  Conditions:
    Last Transition Time:  2020-05-27T06:02:28Z
    Message:               StaticPodsDegraded: nodes/ip-10-0-223-169.us-east-2.compute.internal pods/etcd-ip-10-0-223-169.us-east-2.compute.internal container="etcd" is not ready
StaticPodsDegraded: nodes/ip-10-0-223-169.us-east-2.compute.internal pods/etcd-ip-10-0-223-169.us-east-2.compute.internal container="etcd" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-223-169.us-east-2.compute.internal_openshift-etcd(60224905fe5e8c763ce7cadd44f2e4ca)"
EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-223-169.us-east-2.compute.internal is unhealthy
    Reason:                EtcdMembers_UnhealthyMembers::StaticPods_Error
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-27T06:00:47Z
    Message:               NodeInstallerProgressing: 3 nodes are at revision 2; 0 nodes have achieved new revision 3
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-27T06:21:37Z
    Message:               StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2; 0 nodes have achieved new revision 3
EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-223-169.us-east-2.compute.internal is unhealthy
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-05-27T05:02:21Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  etcds
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd
    Resource:  namespaces
...

Checked it on node ip-10-0-223-169.us-east-2.compute.internal to find that, etcd container keep restart with following error info:
2020-05-27 06:33:09.738612 I | rafthttp: started HTTP pipelining with peer 3b323fe5abf25cff
2020-05-27 06:33:09.738689 E | rafthttp: failed to find member 3b323fe5abf25cff in cluster da0dad007e3f46db
2020-05-27 06:33:09.738842 I | rafthttp: started HTTP pipelining with peer e0fe40b8cfff6089
2020-05-27 06:33:09.738867 E | rafthttp: failed to find member e0fe40b8cfff6089 in cluster da0dad007e3f46db
2020-05-27 06:33:09.739045 E | rafthttp: failed to find member 3b323fe5abf25cff in cluster da0dad007e3f46db
2020-05-27 06:33:09.739086 E | rafthttp: failed to find member e0fe40b8cfff6089 in cluster da0dad007e3f46db
2020-05-27 06:33:09.740589 N | etcdserver/membership: set the initial cluster version to 3.4
2020-05-27 06:33:09.740664 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.18 is lower than determined cluster version: 3.4).

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-26-224432 to 4.4.5

How reproducible:
always

Steps to Reproduce:
1. Run downgrade from v4.5 to v4.4
# ./oc adm upgrade --allow-explicit-upgrade --force --to-image quay.io/openshift-release-dev/ocp-release@sha256:4a461dc23a9d323c8bd7a8631bed078a9e5eec690ce073f78b645c83fb4cdf74
2.
3.

Actual results:
downgrade failed

Expected results:
downgrade succeed.

Additional info:

Comment 2 Sam Batschelet 2020-05-27 13:48:07 UTC

> 2020-05-27 06:33:09.740664 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.18 is lower than determined cluster version: 3.4).

rollback will not be possible 4.5 to 4.4 because we are upgrading the etcd minor version.

Comment 5 Ben Parees 2020-06-09 14:05:57 UTC

this bug is assumed to be the reason https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 is consistently failing.

Comment 6 Sam Batschelet 2020-06-11 21:02:48 UTC

> this bug is assumed to be the reason https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 is consistently failing.

Yeah, it 100% is the reason. To be clear the PR against this BZ is not to fix the test, the underlying condition is not something we can resolve. But the solution will ensure that a backup exists on the cluster in the situation where 4.5 upgrade fails. This will provide customer/support  a clean path back to 4.4.z if they somehow forget to take a backup themselves.

Comment 12 ge liu 2020-06-19 08:20:08 UTC

The rest test still in processing....

Comment 16 errata-xmlrpc 2020-07-13 17:41:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.