Bug 1886627
| Summary: | Kube-apiserver pods restarting/reinitializing periodically | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
| Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | aos-bugs, mfojtik, nelluri, wking, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | aos-scalability-46 | ||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:24:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Naga Ravi Chaitanya Elluri
2020-10-09 00:16:53 UTC
When did you see the restarts? How did you notice them? We noticed the kube-apiserver pods reinitializing one after the other sequentially at 2020-10-08T16:16:39 UTC and 2020-10-08T21:07:47 UTC. It can also be confirmed by looking at the age of the kube-apiserver pods when compared to the cluster but the restarts count is seen to be 0 though. The kube-apiserver deployment at 9pm was due to cert rotation. That's totally expected hours after installation. I don't see that it went degraded around 9pm. I checked both times: 2020-10-08T16:16:39 UTC and 2020-10-08T21:07:47 UTC. As written above I see a new revision due to cert for the latter. I don't see anything around the former: no new revision and no condition changes. Moving out of blocker list until proven otherwise. Disregard #6, was in the wrong log file.
For the 16:16 timestamp:
2020-10-08T16:15:25.559671713Z I1008 16:15:25.559580 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"28b539b0-66c0-4551-9c93-d19a56ad9e82", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 7 triggered by "configmap/kubelet-serving-ca has changed"
and then some minutes later:
2020-10-08T16:21:04.174926102Z I1008 16:21:04.174863 1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-10-08T16:21:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: ","reason":"NodeInstaller_InstallerPodFailed","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-10-08T16:16:15Z","message":"NodeInstallerProgressing: 2 nodes are at revision 6; 1 nodes are at revision 7; 0 nodes have achieved new revision 8","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-10-07T21:20:08Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 6; 1 nodes are at revision 7; 0 nodes have achieved new revision 8","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-10-07T21:17:39Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
So the installer failed to run. Digging further to find out why.
So, the message "1 nodes are failing on revision 7" is a false positive. The logs reveal that there is a revision 8 pod pending and the revision 7 kube-apiserver pod is not yet ready. We used to mark that revision as failed, and that bubbled up to the condition show the operator degraded. This confirms this is cosmetics and no 4.6.0 blocker. Have a cluster uptime over 6 hours. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-17-034503 True False 7h6m Cluster version is 4.7.0-0.nightly-2020-10-17-034503 $ oc get co | grep -v '.True.*False.*False' $ oc get pods -A | grep -vE 'Running|Completed' The cluster is well. Checked if the similar fake message "1 nodes are failing on revision 7" can be found. $ oc debug node/ip-xx-xx-137-45.us-east-2.compute.internal sh-4.4# cd /var/log/pods sh-4.4# grep -nr "2 nodes are at revision.*1 nodes are at revision.*0 nodes have achieved new revision" openshift-* sh-4.4# grep -nr "1 nodes are failing on revision" openshift-* Nothing at all from results, so move the bug VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |