Description of problem: During a serial upgrade from v4.9-v4.10-v4.11, the upgrade-triggered etcd backup will be skipped during v4.10-v4.11 upgrade due to RecentBackup was already set to True during the 1st upgrade. There should be twice backup since there are two upgrade(from v4.9-v4.10, and then from v4.10 to v4.11), but there is only one backup file of v4.9. sh-4.4# ls -la /etc/kubernetes/cluster-backup total 0 drwxr-xr-x. 3 root root 46 Apr 28 06:11 . drwxr-xr-x. 7 root root 224 Apr 28 06:11 .. drwxr-xr-x. 2 root root 132 Apr 28 06:11 upgrade-backup-2022-04-28_061106 No cluster-backup pod. # ./oc -n openshift-etcd get pod|grep backup # Version-Release number of selected component (if applicable): v4.10.12 How reproducible: always Steps to Reproduce: 1. Install v4.9.30 2. Upgrade v4.9.30 to v4.10.12, during this upgrade, etcd is backup successfully before upgrade. # ./oc get clusterversion -ojson|jq .items[].status.history [ { "completionTime": "2022-04-28T07:17:11Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:f77f4f75c1e1a4ddd0a0355f298a834db3473fd9ca473235013e9419d1df16db", "startedTime": "2022-04-28T06:10:43Z", "state": "Completed", "verified": true, "version": "4.10.12" }, { "completionTime": "2022-04-28T03:50:36Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:af9369a6d57f40457440dd730f7f6a640837cb5ced64231b43b39fb2e5835fa6", "startedTime": "2022-04-28T03:29:55Z", "state": "Completed", "verified": false, "version": "4.9.30" } ] # ./oc get co etcd -ojson|jq .status.conditions[-1] { "lastTransitionTime": "2022-04-28T06:11:13Z", "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"", "reason": "UpgradeBackupSuccessful", "status": "True", "type": "RecentBackup" } 3. Continue upgrade to v4.11.0-0.nightly-2022-04-26-181148 Still the old backup file for v4.9. # ./oc get co etcd -ojson|jq .status.conditions[-1] { "lastTransitionTime": "2022-04-28T06:11:13Z", "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"", "reason": "UpgradeBackupSuccessful", "status": "True", "type": "RecentBackup" } Actual results: Upgrade-triggered etcd backup was skipped when there is already a backup in previous upgrade. Expected results: Upgrade-triggered etcd backup should be executed before the upgrade. Additional info: Not sure if cvo side or etcd side is suitable to get a fix, so we track the bug in cvo first. Discussing with etcd qe, at least there should be new backup file to replace the old one.
Reading through [1], I'm not actually noticing anything that moves RecentBackup from True to False after a previous, successful backup is no longer considered "recent". I dunno what the freshness threshold would be (minutes? Hours?). But I think it's up to etcd to make that call, and set RecentBackup=False (while still keeping the condition around to point at the now-stale backup? Or removing the condition?) when they think the backup isn't fresh enough. If that ends up allowing: 1. Cluster is on 4.9 2. Requested update to 4.10 3. etcd takes a backup 4. Update to 4.10 5. Requested update to 4.11 6. Update to 4.11 7. etcd decides the 4.9 backup is stale, and sets RecentBackup=False that's fine with me. In the event of a disaster, the user can restore to step 3, and they can repeat as many of the subsequent steps as they like. The value of a second snapshot at 5 doesn't seem all that high, since it would just pick up an hour or so of the step 4 activity. [1]: https://github.com/openshift/cluster-etcd-operator/blob/95049e93f7acb4bd9ca7c684702390671e7a1371/pkg/operator/upgradebackupcontroller/upgradebackupcontroller.go
May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? If you used --force, no backup will be taken @jiajliu
(In reply to melbeher from comment #2) > May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? > > If you used --force, no backup will be taken > > @jiajliu No `--force`
I raised a fix here https://github.com/openshift/cluster-etcd-operator/pull/835 please test extensively @geliu
Hello melbeher, this bug fixed in 4.11, so I suppose we have not fix it in 4.10, exact? will it be backport to 4.10?
status condition on cvo listed in Comment 6 @melbeher
So the backup status should be in CEO conditions, not CVO .. You can see this in Comment #1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
Errata shipped; presumably all the NEEDINFO were addressed :)