Bug 2003947
| Summary: | [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Degraded: master nodes drained too quickly | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Chaloupka <jchaloup> |
| Component: | openshift-apiserver | Assignee: | Jan Chaloupka <jchaloup> |
| Status: | CLOSED NOTABUG | QA Contact: | Ke Wang <kewang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.7 | CC: | aos-bugs, kewang, mfojtik, rgangwar, sttts, wking, xxia |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.7.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2003946 | Environment: | |
| Last Closed: | 2022-03-02 12:51:57 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2003946 | ||
| Bug Blocks: | |||
|
Description
Jan Chaloupka
2021-09-14 08:12:46 UTC
https://github.com/openshift/library-go/pull/1056/commits/8efd1883d406cc389eb25e2a257c4451dfbd668c needs to be ported to library-go in 4.7 as well. Let's revisit the backport 4.7 once there's a customer need. The bug was reproduced in our 4.6-> 4.7 upgrade, need to back-port to library-go in 4.7 as well. Otherwise, customer will hit the bug. During upgrade from 4.6-> 4.7, kube-apiserver goes into DEGRADED,
$ oc get clusteroperators
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
...
dns 4.6.55 True False False 4h3m
...
kube-apiserver 4.7.43 True True True 4h1m
...
machine-config 4.6.55 True False False 4h3m
Checked must-gather log, kube-apiserver goes Degrared at last because of unready node newugd-5971-pl9kh-control-plane-1.
2022-02-22T09:28:57.052569491Z I0222 09:28:57.039818 1 status_controller.go:213] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-02-22T09: 20:37Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\" installer-7-newugd-5971-pl9kh-control-plane-2\" -oyaml`","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-02-22T09:18:17Z","message":"NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-02-22T07:44:15Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 6; 0 nodes have achieved new revision 7","reason":"AsExpected","status":"True","type":"A vailable"},{"lastTransitionTime":"2022-02-22T07:41:58Z","message":"KubeletMinorVersionUpgradeable: Kubelet minor versions on 6 nodes are behind the expected API server version; nevertheless, they will continue to be supported in the next OpenShift minor version upgrade.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2022-02-22T09:28:57.072422040Z I0222 09:28:57.072352 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserve r-operator", UID:"c99ea6b0-5aae-4cdb-9f89-a63525f9aa26", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"newugd-5971-pl9kh-control-plane-1\" not ready since 2022-02-22 09:28:46 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])\nNodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-7-newugd-5971-pl9kh-control-plane-2\" -o yaml`" to "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-7-newugd-5971-pl9kh-control-plane-2\" -oyaml`"
Track the node status during upgrade with config machine, in log namespaces/openshift-machine-config-operator/pods/machine-config-daemon-99jhw/machine-config-daemon/machine-config-daemon/logs/current.log,
machine config for node newugd-5971-pl9kh-control-plane-1,
2022-02-22T09:18:08.851003218Z I0222 09:18:08.850955 99197 daemon.go:383] Node newugd-5971-pl9kh-control-plane-1 is part of the control plane
...
2022-02-22T09:25:30.912795104Z I0222 09:25:30.912751 99197 update.go:1714] Update prepared; beginning drain
...
2022-02-22T09:26:46.204696209Z I0222 09:26:46.204678 99197 update.go:1714] drain complete
...
2022-02-22T09:26:46.326050179Z I0222 09:26:46.325997 99197 update.go:1714] initiating reboot: Node will reboot into config rendered-master-c43d10a537590420b7654f576eb5668a
...
2022-02-22T09:28:57.312231275Z I0222 09:28:57.312169 2771 update.go:1714] completed update for config rendered-master-c43d10a537590420b7654f576eb5668a
Finished machine config on node newugd-5971-pl9kh-control-plane-1 at 09:28:57.312169.
Checked the machine-config log, namespaces/openshift-machine-config-operator/pods/machine-config-controller-7799944b89-8pq4j/machine-config-controller/machine-config-controller/logs/current.log
...
2022-02-22T09:28:01.367703684Z I0222 09:28:01.367628 1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting OutOfDisk=Unknown
2022-02-22T09:28:46.839548728Z I0222 09:28:46.839434 1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting NotReady=False
2022-02-22T09:28:57.004252577Z I0222 09:28:57.004209 1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting unready: node newugd-5971-pl9kh-control-plane-1 is reporting Unschedulable
2022-02-22T09:28:57.255035779Z I0222 09:28:57.251994 1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Completed update to rendered-master-c43d10a537590420b7654f576eb5668a
2022-02-22T09:28:57.347001835Z I0222 09:28:57.346582 1 node_controller.go:419] Pool master: node newugd-5971-pl9kh-control-plane-1: Reporting ready
Node newugd-5971-pl9kh-control-plane-1 is ready at 09:28:57.346582, though the status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded at 09:28:57.072352: The master nodes not ready: node \"newugd-5971-pl9kh-control-plane-1\" not ready since 2022-02-22 09:28:46.
We can see kube-apiserver pods do not have enough time to get back running/available in time when master nodes are drained. the kube-apiserver operator allow 1 replica to be unavailable and change condition/Degraded to true.
Ke Wang, this issue is specifically about openshift-apiserver operator. In order to prevent kube-apiserver from going Degraded, there's more to be backported. Please see https://issues.redhat.com/browse/WRKLDS-293 for more detailed information. The improvements are targeted for 4.10. Given the issue is reported against 4.7 it is unlikely we would backport all the functionality there without providing a customer escalation. |