Bug 2061444
Summary: | CVO reports "it may not be safe to apply this update" for all precondition check failures | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Cluster Version Operator | Assignee: | Over the Air Updates <aos-team-ota> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Yang Yang <yanyang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.2.0 | CC: | aos-bugs, erich, jack.ottofaro, lmohanty, omichael, travi, wking, yanyang | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.11.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Release Note | |||||
Doc Text: |
CVO reports "it may not be safe to apply this update" for all precondition check failures. The precondition checks are performed before actual update starts and the update will move ahead after the precondition tests pass.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 2061496 (view as bug list) | Environment: | |||||
Last Closed: | 2022-04-26 18:26:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Mike Fiedler
2022-03-07 15:34:13 UTC
(In reply to Mike Fiedler from comment #0) > [root@ip-172-31-18-207 tmp]# oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.9.23 True True 25s Unable to apply > 4.10.2: it may not be safe to apply this update > ... > message: 'Precondition "EtcdRecentBackup" failed because of > "ControllerStarted": ' > reason: UpgradePreconditionCheckFailed > status: "True" > ... > There is no end-user facing indication why the upgrade is not proceeding. This is the CVO trying to "simplify" error messages by making assumptions based on the reason. In this case, precondition failures [1], which were once mostly about verifying signatures on the target release, but which now include things like "has etcd completed a pre-minor-bump snapshot?" [2]. I think we should drop that attempt at simplification, and just pass through the underlying message. But either way, this is cosmetic, and certainly not a blocker. [1]: https://github.com/openshift/cluster-version-operator/blob/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L242-L243 [2]: https://github.com/openshift/cluster-version-operator/pull/637 Confirming the same UX on 4.8 -> 4.9 with a cluster-bot cluster: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.8.33 $ oc adm upgrade channel fast-4.9 $ oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.8-kube-1.22-api-removals-in-4.9":"true"}}' --type=merge In one terminal, poll state: $ while sleep 5; do date --iso=s --utc; echo "ClusterVersion:"; oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; echo "etcd ClusterOperator"; oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; done In another, trigger the update: $ oc adm upgrade --to 4.9.23 In the polling terminal, before the update, things looked like: 2022-03-07T18:53:50+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:47:44Z Failing=False : 2022-03-07T18:53:49Z Progressing=True DownloadingUpdate: Working towards 4.9.23: downloading update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown ControllerStarted: 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well As we got into the update: 2022-03-07T18:53:56+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:47:44Z Failing=False : 2022-03-07T18:53:49Z Progressing=True DownloadingUpdate: Working towards 4.9.23: downloading update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown ControllerStarted: 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well The CVO notices the etcd state, and complains about it the same way it does in 4.9 -> 4.10: 2022-03-07T18:54:01+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:53:59Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T18:53:49Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.23: it may not be safe to apply this update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:30:10Z RecentBackup=Unknown UpgradeBackupInProgress: Backup pod phase: "Running" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well Etcd operator completes the backup at 18:54:05Z, and 14s later, the CVO still hasn't noticed and is still talking about ControllerStarted: 2022-03-07T18:54:19+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:53:59Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-03-07T18:53:49Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.23: it may not be safe to apply this update 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:54:05Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_185405 on node "ci-ln-y1rj17k-72292-4d9v8-master-1" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well But 19s after etcd has completed the backup, the CVO has accepted the update, and is getting on with reconciling the incoming manifests: 2022-03-07T18:54:24+00:00 ClusterVersion: 2022-03-07T18:48:14Z Available=True : Done applying 4.8.33 2022-03-07T18:54:23Z Failing=False : 2022-03-07T18:53:49Z Progressing=True : Working towards 4.9.23: 9 of 738 done (1% complete) 2022-03-07T18:52:56Z RetrievedUpdates=True : etcd ClusterOperator 2022-03-07T18:54:05Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_185405 on node "ci-ln-y1rj17k-72292-4d9v8-master-1" 2022-03-07T18:32:47Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2022-03-07T18:40:16Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: No unstarted etcd members found 2022-03-07T18:31:57Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3 EtcdMembersAvailable: 3 members are available 2022-03-07T18:30:11Z Upgradeable=True AsExpected: All is well So if folks feel like a bit of time where the CVO is saying "it may not be safe to apply this update" while it waits for the etcd backup is worth release notes, we may want to add those release notes to 4.9 as well. Personally, I think the risk of confusion here is small, and I'd rather fix the bug, instead of trying to explain this transient, mis-reported string in release notes. Created attachment 1864435 [details]
logs from the outgoing CVO pod
And checking 4.9 -> 4.10 again:
$ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
4.9.23
$ oc adm upgrade channel candidate-4.10
In one terminal:
$ while sleep 5; do date --iso=s --utc; echo "ClusterVersion:"; oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; echo "etcd ClusterOperator"; oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'; echo; done
In another:
$ oc adm upgrade --to 4.10.2
Updating to 4.10.2
Excerpts from the monitor:
Before the update:
2022-03-07T19:58:19+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T19:36:32Z Failing=False :
2022-03-07T19:37:02Z Progressing=False : Cluster version is 4.9.23
etcd ClusterOperator
2022-03-07T19:19:46Z RecentBackup=Unknown ControllerStarted:
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
CVO downloading the new release image:
2022-03-07T19:58:25+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T19:36:32Z Failing=False :
2022-03-07T19:58:26Z Progressing=True DownloadingUpdate: Working towards 4.10.2: downloading update
etcd ClusterOperator
2022-03-07T19:19:46Z RecentBackup=Unknown ControllerStarted:
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
CVO is waiting on etcd, and etcd is waiting on the backup:
2022-03-07T19:58:48+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted":
2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update
etcd ClusterOperator
2022-03-07T19:19:46Z RecentBackup=Unknown UpgradeBackupInProgress: Backup pod phase: "Pending"
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
etcd completes the backup, CVO hasn't noticed yet:
2022-03-07T19:58:53+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted":
2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update
etcd ClusterOperator
2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0"
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
And 2 and a half minutes after etcd completes the backup, the CVO still hasn't noticed:
2022-03-07T20:01:29+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T19:58:47Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted":
2022-03-07T19:58:26Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.2: it may not be safe to apply this update
etcd ClusterOperator
2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0"
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
But once it does notice, it starts rolling out the incoming manifests:
2022-03-07T20:01:34+00:00
ClusterVersion:
2022-03-07T19:57:23Z RetrievedUpdates=True :
2022-03-07T19:36:32Z Available=True : Done applying 4.9.23
2022-03-07T20:01:31Z Failing=False :
2022-03-07T19:58:26Z Progressing=True : Working towards 4.10.2: 5 of 770 done (0% complete)
etcd ClusterOperator
2022-03-07T19:58:51Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-03-07_195848 on node "ci-ln-5cjbzj2-72292-9gkd5-master-0"
2022-03-07T19:19:46Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2022-03-07T19:29:39Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2022-03-07T19:21:42Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2022-03-07T19:20:58Z Upgradeable=True AsExpected: All is well
Events included more detail:
$ oc -n openshift-cluster-version get events
...
91s Warning PreconditionsFailed clusterversion/version preconditions failed for payload loaded version="4.10.2" image="quay.io/openshift-release-dev/ocp-release@sha256:b807908afe1b7b25a8727f092cc11a104bf3996908a80f035d079b014b4b96b6" failures=Precondition "EtcdRecentBackup" failed because of "ControllerStarted":
I'm attaching CVO logs that I streamed down from the outgoing CVO pod:
$ oc -n openshift-cluster-version logs -f cluster-version-operator-78d979db4-gsgvw >cvo.log
Moving the version back, although I'll pick 4.2.0. The CVO has rolled up under "it may not be safe to apply this update" since 2019 [1]. [1]: https://github.com/openshift/cluster-version-operator/blame/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L243 Getting the situations causing "it may not be safe to apply this update" that we observed listed here: 1. Resource deletion is in progress when upgrade (it sometimes blocks upgrade from 4.9 to 4.10 because [1] isn't introduced back) 2. Machine config pool is updating when upgrade 3. Etcd is backingup when upgrade 4. Target image is unsigned [1] https://bugzilla.redhat.com/show_bug.cgi?id=1822752 Although "EtcdRecentBackup" is different than our other preconditions - it's an ongoing automated process that's expected to complete and resolve the precondition failure - the message "it may not be safe to apply this update" still applies until the Etcd backup has completed. Closing this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2065507 which enhances `oc adm upgrade` to provide amplifying information about upgrade status. *** This bug has been marked as a duplicate of bug 2065507 *** |