Description of problem: Do rollback test against v4.4.5-v4.5 nightly-v4.4.5 path. During the downgrade from v4.5 nightly to v4.4.5, etcd went into degraded status and rollback failed. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-26-224432 True True 20m Unable to apply 4.4.5: the cluster operator etcd is degraded # ./oc get co etcd NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.4.5 True True True 2m36s # ./oc describe co etcd ... Status: Conditions: Last Transition Time: 2020-05-27T06:02:28Z Message: StaticPodsDegraded: nodes/ip-10-0-223-169.us-east-2.compute.internal pods/etcd-ip-10-0-223-169.us-east-2.compute.internal container="etcd" is not ready StaticPodsDegraded: nodes/ip-10-0-223-169.us-east-2.compute.internal pods/etcd-ip-10-0-223-169.us-east-2.compute.internal container="etcd" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-223-169.us-east-2.compute.internal_openshift-etcd(60224905fe5e8c763ce7cadd44f2e4ca)" EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-223-169.us-east-2.compute.internal is unhealthy Reason: EtcdMembers_UnhealthyMembers::StaticPods_Error Status: True Type: Degraded Last Transition Time: 2020-05-27T06:00:47Z Message: NodeInstallerProgressing: 3 nodes are at revision 2; 0 nodes have achieved new revision 3 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2020-05-27T06:21:37Z Message: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2; 0 nodes have achieved new revision 3 EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-223-169.us-east-2.compute.internal is unhealthy Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-05-27T05:02:21Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: etcds Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-etcd-operator Resource: namespaces Group: Name: openshift-etcd Resource: namespaces ... Checked it on node ip-10-0-223-169.us-east-2.compute.internal to find that, etcd container keep restart with following error info: 2020-05-27 06:33:09.738612 I | rafthttp: started HTTP pipelining with peer 3b323fe5abf25cff 2020-05-27 06:33:09.738689 E | rafthttp: failed to find member 3b323fe5abf25cff in cluster da0dad007e3f46db 2020-05-27 06:33:09.738842 I | rafthttp: started HTTP pipelining with peer e0fe40b8cfff6089 2020-05-27 06:33:09.738867 E | rafthttp: failed to find member e0fe40b8cfff6089 in cluster da0dad007e3f46db 2020-05-27 06:33:09.739045 E | rafthttp: failed to find member 3b323fe5abf25cff in cluster da0dad007e3f46db 2020-05-27 06:33:09.739086 E | rafthttp: failed to find member e0fe40b8cfff6089 in cluster da0dad007e3f46db 2020-05-27 06:33:09.740589 N | etcdserver/membership: set the initial cluster version to 3.4 2020-05-27 06:33:09.740664 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.18 is lower than determined cluster version: 3.4). Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-26-224432 to 4.4.5 How reproducible: always Steps to Reproduce: 1. Run downgrade from v4.5 to v4.4 # ./oc adm upgrade --allow-explicit-upgrade --force --to-image quay.io/openshift-release-dev/ocp-release@sha256:4a461dc23a9d323c8bd7a8631bed078a9e5eec690ce073f78b645c83fb4cdf74 2. 3. Actual results: downgrade failed Expected results: downgrade succeed. Additional info:
> 2020-05-27 06:33:09.740664 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.18 is lower than determined cluster version: 3.4). rollback will not be possible 4.5 to 4.4 because we are upgrading the etcd minor version.
this bug is assumed to be the reason https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 is consistently failing.
> this bug is assumed to be the reason https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 is consistently failing. Yeah, it 100% is the reason. To be clear the PR against this BZ is not to fix the test, the underlying condition is not something we can resolve. But the solution will ensure that a backup exists on the cluster in the situation where 4.5 upgrade fails. This will provide customer/support a clean path back to 4.4.z if they somehow forget to take a backup themselves.
The rest test still in processing....
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409