+++ This bug was initially created as a clone of Bug #1997347 +++ OCP 4.9 will include a minor bump to etcd bringing it to v3.5.0. Once the cluster has upgraded to 3.5 etcd will fail in a rollback scenario. For this reason, we must ensure that a valid backup exists for the user before the upgrade. This backup safety net can be used to mitigate the risk of failed upgrade in a case where the customer for got to take a backup. Otherwise, without a backup, the customer would be forced to create a new cluster. --- Additional comment from Sam Batschelet on 2021-08-27 15:23:17 UTC --- I wanted to give an update, a few assumptions were proven invalid for backup pre-checks around storage. I am working on a new approach that performs these checks in the pod init containers. This should as a bonus simplify the operator code.
Following [1] to do some pre-merge testing here too (easier, because we have signed 4.9 releases to use), I launched a cluster-bot cluster with 'launch 4.8,openshift/cluster-etcd-operator#652,openshift/cluster-version-operator#649'. Updating that cluster to 4.9.0-rc.0: $ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d1c1401fdbfe0820036dd3f3cc5df1539b5a101fe9f21f1845e55d8655000f66 $ oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-09-07T18:54:40Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2021-09-07T19:05:31Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2021-09-07T18:55:52Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2021-09-07T18:54:40Z Upgradeable=True AsExpected: All is well 2021-09-07T19:20:56Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-07_192056 on node "ci-ln-7wqizib-f76d1-j78qx-master-1" $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-09-07T18:50:35Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-09-07T19:15:06Z Available=True : Done applying 4.8.0-0.ci.test-2021-09-07-184416-ci-ln-7wqizib-latest 2021-09-07T19:20:51Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2021-09-07T19:20:30Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.0-rc.0: it may not be safe to apply this update Sits like this for a bit, until the CVO takes another pass at downloading and verifying the release and checking preconditions. After which: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-09-07T18:50:35Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-09-07T19:15:06Z Available=True : Done applying 4.8.0-0.ci.test-2021-09-07-184416-ci-ln-7wqizib-latest 2021-09-07T19:22:20Z Failing=False : 2021-09-07T19:20:30Z Progressing=True : Working towards 4.9.0-rc.0: 4 of 734 done (0% complete) So looks good to me :) [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1997347#c5
Hi Trevor, > 2021-09-07T19:20:56Z RecentBackup=True > 2021-09-07T19:22:20Z Failing=False CVO's Failing condition goes to False at 2021-09-07T19:22:20Z, while etcd's RecentBackup condition goes to True at 2021-09-07T19:20:56Z. CVO delays 1m30 or so on the condition transition. Does CVO need any improvements on it? Thanks!
> CVO delays 1m30 or so on the condition transition. Does CVO need any improvements on it? Yeah, it sits for a bit to avoid hotlooping, and then pulls the release again (or launches a new container consuming the already-downloaded release image) and takes a second pass at checking upgrade preconditions. We could probably squeeze that delay down with some CVO-side work, but I think that would be a forward-looking RFE, and not something that should block us releasing this initial implementation.
It's very important that we have a clear signal on this feature working as expected. Do you have a test case in jenkins that we can exercise to ensure we have no issues? I am working on a possible addition to periodic testing that would allow testing of minor nightly images. But it would be good to have multiple test runs on the above.
Sam, currently I have the test case [1] in the Polarion which is not automated yet. I'll get it automated soon so that we can test it periodically easily. The test case only covers the testing from CVO side. Ge Liu works on the testing from etcd side. [1] https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-44209
Sam, the filename displayed in the RecentBackup message is incorrect. The RecentBackup condition is telling "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541 on node \"yangyang0915-2-88b9l-master-0.c.openshift-qe.internal\"", but the file does not exist. # oc debug node/yangyang0915-2-88b9l-master-0.c.openshift-qe.internal Starting pod/yangyang0915-2-88b9l-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.2 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541 ls: cannot access '/etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541': No such file or directory sh-4.4# ls /etc/kubernetes/cluster-backup/ upgrade-backup-2021-09-15_100535 The exact file name is upgrade-backup-2021-09-15_100535 rather than upgrade-backup-2021-09-15_100541.
Thank you for the report. We can look into this as a 4.9.z fix I would not consider this blocking release as the goal is for a functional backup to exist. Thank you for working with us on automation.
Thanks. I just opened a bz [1] to track the filename issue. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2004451 I'm automating my test case using QE's internal framework and I need to build the jenkins job manually. Do you have any test plan for this feature? Do we need to test every 4.8 nightly release?
> there are some info about "version: 4.9.0-rc.1" in clusterversion, could u pls double if it's an issue? It looks correct. The history is telling the cluster is running with 4.8.0-0.nightly-2021-09-15-162303 and it's being upgraded to 4.9.0-rc.1 but not completed yet. The desired version 4.9.0-rc.1 is telling the cluster is expected to be upgraded to 4.9.0-rc.1 if it's not identical with the current version. Ah, I'm curious how did you restore with the backup.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3511