Description of problem: Using channel candidate-4.10 and upgrading without force, CVO reports that the upgrade is unsafe for about 4 minutes before proceeding: [root@ip-172-31-18-207 tmp]# oc adm upgrade Cluster version is 4.9.23 Upstream is unset, so the cluster will use an appropriate default. Channel: candidate-4.10 (available channels: candidate-4.10, candidate-4.9, fast-4.9) Recommended updates: VERSION IMAGE 4.10.2 quay.io/openshift-release-dev/ocp-release@sha256:b807908afe1b7b25a8727f092cc11a104bf3996908a80f035d079b014b4b96b6 4.10.1 quay.io/openshift-release-dev/ocp-release@sha256:95a31c7227c105311eb9f461972d5db90fa22270f017ac6eef4d1f95d3e1675c 4.10.0 quay.io/openshift-release-dev/ocp-release@sha256:a63c470411c087c0568729fa5faf32cd8d5fa1db3c73e28f3989f17f5c458351 4.10.0-rc.8 quay.io/openshift-release-dev/ocp-release@sha256:65e8dbc576556e0296e29ac1ae7496743e3494ad65a111e134c6e7202a498d11 4.10.0-rc.7 quay.io/openshift-release-dev/ocp-release@sha256:a1e6c5ad9f3e7a6faa93508d7c1458f3fcd13869fdb1ac6e2dab7cc0f9269847 4.10.0-rc.6 quay.io/openshift-release-dev/ocp-release@sha256:aed2a6eb5c99da354dfd4182f0fcdf7b8bc1c0c7db356644cf1fc669b1052012 4.10.0-rc.5 quay.io/openshift-release-dev/ocp-release@sha256:5ebccbbcbeeb3987f9ee3fa7f593c5e24bdd89efd9c6b2e6d525346e4a235ced [root@ip-172-31-18-207 tmp]# oc adm upgrade --to 4.10.2 Updating to 4.10.2 [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 8s Working towards 4.10.2: downloading update [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 15s Working towards 4.10.2: downloading update [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 25s Unable to apply 4.10.2: it may not be safe to apply this update [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 72s Unable to apply 4.10.2: it may not be safe to apply this update [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 2m41s Unable to apply 4.10.2: it may not be safe to apply this update [root@ip-172-31-18-207 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.23 True True 4m22s Working towards 4.10.2: 94 of 770 done (12% complete) While reporting that the upgrade is not safe oc get -o yaml clusterversion shows this status: conditions: - lastTransitionTime: "2022-03-07T14:38:31Z" message: Done applying 4.9.23 status: "True" type: Available - lastTransitionTime: "2022-03-07T15:20:31Z" message: 'Precondition "EtcdRecentBackup" failed because of "ControllerStarted": ' reason: UpgradePreconditionCheckFailed status: "True" type: Failing - lastTransitionTime: "2022-03-07T15:20:11Z" message: 'Unable to apply 4.10.2: it may not be safe to apply this update' reason: UpgradePreconditionCheckFailed status: "True" type: Progressing - lastTransitionTime: "2022-03-07T15:19:37Z" status: "True" type: RetrievedUpdates There is no end-user facing indication why the upgrade is not proceeding. Version-Release number of the following components: 4.10.2 How reproducible: Always Steps to Reproduce: 1. Install 4.9.23 2. Set the CVO channel to candidate-4.10 3. oc adm upgrade --to 4.10.2 4. oc get clusterversion several times
(In reply to Mike Fiedler from comment #0) > [root@ip-172-31-18-207 tmp]# oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.9.23 True True 25s Unable to apply > 4.10.2: it may not be safe to apply this update > ... > message: 'Precondition "EtcdRecentBackup" failed because of > "ControllerStarted": ' > reason: UpgradePreconditionCheckFailed > status: "True" > ... > There is no end-user facing indication why the upgrade is not proceeding. This is the CVO trying to "simplify" error messages by making assumptions based on the reason. In this case, precondition failures [1], which were once mostly about verifying signatures on the target release, but which now include things like "has etcd completed a pre-minor-bump snapshot?" [2]. I think we should drop that attempt at simplification, and just pass through the underlying message. But either way, this is cosmetic, and certainly not a blocker. [1]: https://github.com/openshift/cluster-version-operator/blob/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L242-L243 [2]: https://github.com/openshift/cluster-version-operator/pull/637
Confirming the same UX on 4.8 -> 4.9 with a cluster-bot cluster: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.8.33 $ oc adm upgrade channel fast-4.9 $ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.8-kube-1.22-api-removals-in-4.9":"true"}}' --type=merge In one terminal, poll state: $ while sleep 5; do date --iso=s --utc; echo "ClusterVersion:"; oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; echo "etcd ClusterOperator"; oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; done In another, trigger the update: $ oc adm upgrade --to 4.9.23 In the polling terminal, before the update, things looked like: 2022-03-07T18:53:50+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:47:44Z Failing=False : 2022-03-07T18:53:49Z Progressing=True DownloadingUpdate: Working towards 4.9.23: downloading update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown ControllerStarted: 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well As we got into the update: 2022-03-07T18:53:56+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:47:44Z Failing=False : 2022-03-07T18:53:49Z Progressing=True DownloadingUpdate: Working towards 4.9.23: downloading update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown ControllerStarted: 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well The CVO notices the etcd state, and complains about it the same way it does in 4.9 -> 4.10: 2022-03-07T18:54:01+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:53:59Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T18:53:49Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.23: it may not be safe to apply this update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown UpgradeBackupInProgress: Backup pod phase: "Running" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well Etcd operator completes the backup at 18:54:05Z, and 14s later, the CVO still hasn't noticed and is still talking about ControllerStarted: 2022-03-07T18:54:19+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:53:59Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T18:53:49Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.23: it may not be safe to apply this update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:54:05Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_185405 on node "ci-ln-y1rj17k-72292-4d9v8-master-1" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well But 19s after etcd has completed the backup, the CVO has accepted the update, and is getting on with reconciling the incoming manifests: 2022-03-07T18:54:24+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:54:23Z Failing=False : 2022-03-07T18:53:49Z Progressing=True : Working towards 4.9.23: 9 of 738 done (1% complete) 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:54:05Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_185405 on node "ci-ln-y1rj17k-72292-4d9v8-master-1" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well So if folks feel like a bit of time where the CVO is saying "it may not be safe to apply this update" while it waits for the etcd backup is worth release notes, we may want to add those release notes to 4.9 as well. Personally, I think the risk of confusion here is small, and I'd rather fix the bug, instead of trying to explain this transient, mis-reported string in release notes.
Created attachment 1864435 [details] logs from the outgoing CVO pod And checking 4.9 -> 4.10 again: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.9.23 $ oc adm upgrade channel candidate-4.10 In one terminal: $ while sleep 5; do date --iso=s --utc; echo "ClusterVersion:"; oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; echo "etcd ClusterOperator"; oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; done In another: $ oc adm upgrade --to 4.10.2 Updating to 4.10.2 Excerpts from the monitor: Before the update: 2022-03-07T19:58:19+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T19:36:32Z Failing=False : 2022-03-07T19:37:02Z Progressing=False : Cluster version is 4.9.23 etcd ClusterOperator 2022-03-07T19:19:46Z RecentBackup=Unknown ControllerStarted: 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well CVO downloading the new release image: 2022-03-07T19:58:25+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T19:36:32Z Failing=False : 2022-03-07T19:58:26Z Progressing=True DownloadingUpdate: Working towards 4.10.2: downloading update etcd ClusterOperator 2022-03-07T19:19:46Z RecentBackup=Unknown ControllerStarted: 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well CVO is waiting on etcd, and etcd is waiting on the backup: 2022-03-07T19:58:48+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update etcd ClusterOperator 2022-03-07T19:19:46Z RecentBackup=Unknown UpgradeBackupInProgress: Backup pod phase: "Pending" 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well etcd completes the backup, CVO hasn't noticed yet: 2022-03-07T19:58:53+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update etcd ClusterOperator 2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0" 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well And 2 and a half minutes after etcd completes the backup, the CVO still hasn't noticed: 2022-03-07T20:01:29+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update etcd ClusterOperator 2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0" 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well But once it does notice, it starts rolling out the incoming manifests: 2022-03-07T20:01:34+00:00 ClusterVersion: 2022-03-07T19:57:23Z RetrievedUpdates=True : 2022-03-07T19:36:32Z Available=True : Done applying 4.9.23 2022-03-07T20:01:31Z Failing=False : 2022-03-07T19:58:26Z Progressing=True : Working towards 4.10.2: 5 of 770 done (0% complete) etcd ClusterOperator 2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0" 2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well Events included more detail: $ oc -n openshift-cluster-version get events ... 91s Warning PreconditionsFailed clusterversion/version preconditions failed for payload loaded version="4.10.2" image="quay.io/openshift-release-dev/ocp-release@sha256:b807908afe1b7b25a8727f092cc11a104bf3996908a80f035d079b014b4b96b6" failures=Precondition "EtcdRecentBackup" failed because of "ControllerStarted": I'm attaching CVO logs that I streamed down from the outgoing CVO pod: $ oc -n openshift-cluster-version logs -f cluster-version-operator-78d979db4-gsgvw >cvo.log
Moving the version back, although I'll pick 4.2.0. The CVO has rolled up under "it may not be safe to apply this update" since 2019 [1]. [1]: https://github.com/openshift/cluster-version-operator/blame/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L243
Getting the situations causing "it may not be safe to apply this update" that we observed listed here: 1. Resource deletion is in progress when upgrade (it sometimes blocks upgrade from 4.9 to 4.10 because [1] isn't introduced back) 2. Machine config pool is updating when upgrade 3. Etcd is backingup when upgrade 4. Target image is unsigned [1] https://bugzilla.redhat.com/show_bug.cgi?id=1822752
Although "EtcdRecentBackup" is different than our other preconditions - it's an ongoing automated process that's expected to complete and resolve the precondition failure - the message "it may not be safe to apply this update" still applies until the Etcd backup has completed. Closing this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2065507 which enhances `oc adm upgrade` to provide amplifying information about upgrade status. *** This bug has been marked as a duplicate of bug 2065507 ***