Bug 1922185
| Summary: | Restart count of kube-controller-manager pods is observed to be higher and increasing as the cluster grows older. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lakshmi Ravichandran <lakshmi.ravichandran1> | ||||
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | zhou ying <yinzhou> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 4.7 | CC: | aos-bugs, Holger.Wolf, mfojtik, wvoesch | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-01 10:34:16 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1903544 | ||||||
| Attachments: |
|
||||||
Hi Team, kindly, correcting myself on the Description part of cluster’s resource specification. The cluster under test has the following resource specifications: master nodes - 4 CPU / 16G , worker nodes 01,02 - 2 CPU / 8G, worker 03 - 4 CPU / 16G (bootstrap node turned as master node) Every development cluster built from master branch has a patch which shortens the default certificate rotation period from 30 days to 1/60th of that, the patch is in: https://github.com/openshift/cluster-kube-apiserver-operator/blob/47de23de5c544bfe0649e0045c6ba667af5e469a/pkg/operator/certrotationcontroller/certrotationcontroller.go#L128 After code freeze, right after a release branch is created that patch is being removed, this effort for 4.7 is being tracked in: https://bugzilla.redhat.com/show_bug.cgi?id=1883790 This effort allows us to have sufficient stress tests during development period that the certificate rotation mechanism is not affected or broken due to any of the changes. You can check metrics, the restarts should approximately every 6h or so. I'm going to close since this is not a bug, but a development feature ;-) Feel free to open if you see it's different. |
Created attachment 1752030 [details] oc describe pod/kube-controller-manager-master-03.ocp-m3558030.lnxne.boe Description of problem: The kube-controller-manager pods running on each master node has a higher restart count and is observed to be increasing as the cluster grows older. The other pods in a default openshift cluster has either restart count of zero or a very lower value. The kube-controller-manager pod has three containers : 1. kube-controller-manager 2. kube-controller-manager-cert-syncer 3. kube-controller-manager-recovery-controller -- out of which, the restart is only observed only in this container contributing to the total restart value of the pod. Note: Similar restarts of openshift-kube-scheduler pod's - kube-scheduler-recovery-controller container is also observed at the same time. for example: ---- these output snap were taken after a couple of stressors has been executed on the cluster over the last days. I [root@m3558030 ~]# oc get pods -A | grep openshift-kube-scheduler-master; oc get pods -A | grep kube-controller-manager-master openshift-kube-scheduler openshift-kube-scheduler-master-01.ocp-m3558030.lnxne.boe 3/3 Running 15 5d22h openshift-kube-scheduler openshift-kube-scheduler-master-02.ocp-m3558030.lnxne.boe 3/3 Running 20 5d22h openshift-kube-scheduler openshift-kube-scheduler-master-03.ocp-m3558030.lnxne.boe 3/3 Running 19 5d21h openshift-kube-controller-manager kube-controller-manager-master-01.ocp-m3558030.lnxne.boe 4/4 Running 13 5d21h openshift-kube-controller-manager kube-controller-manager-master-02.ocp-m3558030.lnxne.boe 4/4 Running 16 5d21h openshift-kube-controller-manager kube-controller-manager-master-03.ocp-m3558030.lnxne.boe 4/4 Running 16 5d21h II [root@m3558030 ~]# oc get pods -A | grep openshift-kube-scheduler-master; oc get pods -A | grep kube-controller-manager-master openshift-kube-scheduler openshift-kube-scheduler-master-01.ocp-m3558030.lnxne.boe 3/3 Running 15 6d3h openshift-kube-scheduler openshift-kube-scheduler-master-02.ocp-m3558030.lnxne.boe 3/3 Running 21 6d3h openshift-kube-scheduler openshift-kube-scheduler-master-03.ocp-m3558030.lnxne.boe 3/3 Running 20 6d3h openshift-kube-controller-manager kube-controller-manager-master-01.ocp-m3558030.lnxne.boe 4/4 Running 14 6d3h openshift-kube-controller-manager kube-controller-manager-master-02.ocp-m3558030.lnxne.boe 4/4 Running 17 6d3h openshift-kube-controller-manager kube-controller-manager-master-03.ocp-m3558030.lnxne.boe 4/4 Running 16 6d3h I have attached the output of oc describe pod/kube-controller-manager-master-03.ocp-m3558030.lnxne.boe in the attachments. additional info: The cluster’s resource spec is master nodes - 4 CPU / 16G , worker nodes 01,02 - 2 CPU / 8G (increased memory needed to set logging stack), worker 03 - 4 CPU / 16G (bootstrap node turned as master node) [root@m3558030 ~]# oc get nodes NAME STATUS ROLES AGE VERSION bootstrap-0.ocp-m3558030.lnxne.boe Ready worker 6d21h v1.20.0+f0a2ec9 master-01.ocp-m3558030.lnxne.boe Ready master 6d21h v1.20.0+f0a2ec9 master-02.ocp-m3558030.lnxne.boe Ready master 6d21h v1.20.0+f0a2ec9 master-03.ocp-m3558030.lnxne.boe Ready master 6d21h v1.20.0+f0a2ec9 worker-01.ocp-m3558030.lnxne.boe Ready worker 6d21h v1.20.0+f0a2ec9 worker-02.ocp-m3558030.lnxne.boe Ready worker 6d21h v1.20.0+f0a2ec9 Please, kindly let me know what other logs would interest you, I shall gladly provide them. Version-Release number of selected component (if applicable): Client Version: 4.7.0-0.nightly-s390x-2021-01-22-120029 Server Version: 4.7.0-0.nightly-s390x-2021-01-22-120029 Kubernetes Version: v1.20.0+f0a2ec9 How reproducible: almost everytime under the test conditions; however the restart count of kube-controller-manager, openshift-kube-scheduler pods vary. Steps to Reproduce: 1. Install a healthy OCP 4.7 cluster on s390x environment 2. Schedule a memory stress workload in one of the worker nodes for 24 hours or more. stress-ng tool was used in this exercise. 3. Continuously monitor the restart count of kube-controller-manager, openshift-kube-scheduler pods and observe it to be increasing. Actual results: The pods restart count is observed to be increasing while other pods in the cluster (oc get pods -A) wither have 0 or very minimal restart count. Expected results: The pods should either not experience any restart or a lesser restart count. Additional info: A separate bug is raised on openshift-kube-scheduler component as the restart of kube-scheduler-recovery-controller container are also observed during this exercise.