Bug 1985802
| Summary: | cluster-version-operator needs to handle 60 seconds downtime of API server gracefully in SNO | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
| Component: | Cluster Version Operator | Assignee: | Lalatendu Mohanty <lmohanty> |
| Status: | CLOSED ERRATA | QA Contact: | Pedro Amoedo <pamoedom> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.9 | CC: | aos-bugs, jokerman, lmohanty, nelluri, pamoedom, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | All | ||
| Whiteboard: | chaos | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:41:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1984730 | ||
|
Description
Naga Ravi Chaitanya Elluri
2021-07-25 21:49:38 UTC
[Pre-Merge QA Testing]
- Custom version that includes this PR:
~~~
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False 68m Cluster version is 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest
~~~
- Status of the cluster after fresh installation:
~~~
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-00.pamoedo-snotest3.qe.devcluster.openshift.com Ready master,worker 86m v1.21.1+38b3ecc
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 67m
baremetal 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m
cloud-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
cloud-credential 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 105m
cluster-autoscaler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m
config-operator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
console 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m
csi-snapshot-controller 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
dns 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m
etcd 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m
image-registry 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 72m
ingress 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m
insights 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m
kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m
kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m
kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m
kube-storage-version-migrator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
machine-api 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m
machine-approver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
machine-config 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m
marketplace 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m
monitoring 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m
network 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 81m
node-tuning 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
openshift-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m
openshift-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m
openshift-samples 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 74m
operator-lifecycle-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
operator-lifecycle-manager-catalog 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
operator-lifecycle-manager-packageserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 75m
service-ca 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
storage 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m
~~~
- Forced a kubeapiserver redeploy:
~~~
$ oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}"
kubeapiserver.operator.openshift.io/cluster patched
$ oc describe kubeapiserver/cluster | grep Redeployment
f:forceRedeploymentReason:
Force Redeployment Reason: Forcing new revision with random number 14640 to make message unique
~~~
- After some minutes, the clusteroperators finished to progress and all of them are properly running as expected:
~~~
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-00.pamoedo-snotest3.qe.devcluster.openshift.com Ready master,worker 124m v1.21.1+38b3ecc
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 115m
baremetal 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m
cloud-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
cloud-credential 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 153m
cluster-autoscaler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m
config-operator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
console 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m
csi-snapshot-controller 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
dns 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m
etcd 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m
image-registry 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 120m
ingress 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 122m
insights 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 122m
kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 121m
kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m
kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m
kube-storage-version-migrator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
machine-api 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m
machine-approver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
machine-config 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m
marketplace 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m
monitoring 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m
network 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 129m
node-tuning 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
openshift-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m
openshift-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m
openshift-samples 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 123m
operator-lifecycle-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
operator-lifecycle-manager-catalog 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
operator-lifecycle-manager-packageserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 123m
service-ca 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
storage 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m
~~~
Best Regards.
[Pre-Merge QA Testing]-Extension
Taking advantage of the testing cluster I have also forced a redeploy of "kubecontrollermanager/cluster" and "kubescheduler/cluster" with the following commands:
~~~
$ oc patch kubecontrollermanager/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}"
kubecontrollermanager.operator.openshift.io/cluster patched
$ oc patch kubescheduler/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}"
kubescheduler.operator.openshift.io/cluster patched
~~~
Both operations progressed quickly and without errors, all clusteroperators are working as expected:
~~~
$ oc get co | grep "kube-apiserver\|kube-controller-manager\|kube-scheduler"
kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 151m
kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 156m
kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 155m
$ oc get pods -A | grep "kube-apiserver-master\|kube-controller-manager-master\|kube-scheduler-master"
openshift-kube-apiserver kube-apiserver-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 5/5 Running 0 52m
openshift-kube-controller-manager kube-controller-manager-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 4/4 Running 0 6m46s
openshift-kube-scheduler openshift-kube-scheduler-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 3/3 Running 0 4m51s
~~~
Regards.
[QA Summary] [Version] ~~~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-07-175228 True False 9m Cluster version is 4.9.0-0.nightly-2021-08-07-175228 $ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-07-175228 | grep cluster-version-operator cluster-version-operator https://github.com/openshift/cluster-version-operator 0ec39d9b2ab1feee8815d7b6b4bbe2db23daf847 [pamoedo@p50 cluster-version-operator] $ git --no-pager log --oneline --first-parent origin/master -3 0ec39d9b (HEAD -> master, origin/release-4.9, origin/release-4.10, origin/master, origin/HEAD) Merge pull request #634 from LalatenduMohanty/BZ_1985802 6e9ea6f5 Merge pull request #636 from sdodson/approvers_emeritus bd36a2e1 Merge pull request #635 from jan--f/patch-1 ~~~ [Parameters] BareMetal SNO installation with default values. [Results] As expected, the installation succeeded with latest 4-9.nightly and all operators were ready: ~~~ $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.pamoedo-bz1985802.qe.devcluster.openshift.com Ready master,worker 26m v1.21.1+8268f88 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-08-07-175228 True False False 6m16s baremetal 4.9.0-0.nightly-2021-08-07-175228 True False False 22m cloud-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 25m cloud-credential 4.9.0-0.nightly-2021-08-07-175228 True False False 36m cluster-autoscaler 4.9.0-0.nightly-2021-08-07-175228 True False False 22m config-operator 4.9.0-0.nightly-2021-08-07-175228 True False False 23m console 4.9.0-0.nightly-2021-08-07-175228 True False False 12m csi-snapshot-controller 4.9.0-0.nightly-2021-08-07-175228 True False False 12m dns 4.9.0-0.nightly-2021-08-07-175228 True False False 22m etcd 4.9.0-0.nightly-2021-08-07-175228 True False False 21m image-registry 4.9.0-0.nightly-2021-08-07-175228 True False False 11m ingress 4.9.0-0.nightly-2021-08-07-175228 True False False 17m insights 4.9.0-0.nightly-2021-08-07-175228 True False False 16m kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 23m machine-api 4.9.0-0.nightly-2021-08-07-175228 True False False 22m machine-approver 4.9.0-0.nightly-2021-08-07-175228 True False False 22m machine-config 4.9.0-0.nightly-2021-08-07-175228 True False False 22m marketplace 4.9.0-0.nightly-2021-08-07-175228 True False False 22m monitoring 4.9.0-0.nightly-2021-08-07-175228 True False False 12m network 4.9.0-0.nightly-2021-08-07-175228 True False False 23m node-tuning 4.9.0-0.nightly-2021-08-07-175228 True False False 22m openshift-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 12m openshift-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 20m openshift-samples 4.9.0-0.nightly-2021-08-07-175228 True False False 15m operator-lifecycle-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 22m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-08-07-175228 True False False 22m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-08-07-175228 True False False 19m service-ca 4.9.0-0.nightly-2021-08-07-175228 True False False 23m storage 4.9.0-0.nightly-2021-08-07-175228 True False False 23m ~~~ After forcing a redeployment of "kubeapiserver/cluster", "kubescheduler/cluster" and "kubecontrollermanager/cluster", all operators successfully recovered and pods running as expected: ~~~ $ oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc patch kubescheduler/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc patch kubecontrollermanager/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc get co | grep kube- kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True True False 21m NodeInstallerProgressing: 1 nodes are at revision 6; 0 nodes have achieved new revision 7 kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True True False 22m NodeInstallerProgressing: 1 nodes are at revision 9; 0 nodes have achieved new revision 10 kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True True False 22m NodeInstallerProgressing: 1 nodes are at revision 8; 0 nodes have achieved new revision 9 kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 24m $ oc get co | grep kube- kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 28m $ oc get pods -A | grep "kube-apiserver-master\|kube-controller-manager-master\|kube-scheduler-master" openshift-kube-apiserver kube-apiserver-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 5/5 Running 0 3m48s openshift-kube-controller-manager kube-controller-manager-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 4/4 Running 1 3m49s openshift-kube-scheduler openshift-kube-scheduler-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 3/3 Running 1 4m38s ~~~ Best Regards. *** Bug 1969257 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |