Bug 1922185

Summary: Restart count of kube-controller-manager pods is observed to be higher and increasing as the cluster grows older.
Product: OpenShift Container Platform Reporter: Lakshmi Ravichandran <lakshmi.ravichandran1>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED NOTABUG QA Contact: zhou ying <yinzhou>
Severity: low Docs Contact:
Priority: low    
Version: 4.7CC: aos-bugs, Holger.Wolf, mfojtik, wvoesch
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-01 10:34:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903544    
Attachments:
Description Flags
oc describe pod/kube-controller-manager-master-03.ocp-m3558030.lnxne.boe none

Description Lakshmi Ravichandran 2021-01-29 12:21:09 UTC
Created attachment 1752030 [details]
oc describe pod/kube-controller-manager-master-03.ocp-m3558030.lnxne.boe

Description of problem:
The kube-controller-manager pods running on each master node has a higher restart count and is observed to be increasing as the cluster grows older.
The other pods in a default openshift cluster has either restart count of zero or a very lower value.

The kube-controller-manager pod has three containers :
1. kube-controller-manager       
2. kube-controller-manager-cert-syncer
3. kube-controller-manager-recovery-controller  -- out of which, the restart is only observed only in this container contributing to the total restart value of the pod.

Note: Similar restarts of openshift-kube-scheduler pod's - kube-scheduler-recovery-controller container is also observed at the same time.

for example:
---- these output snap were taken after a couple of stressors has been executed on the cluster over the last days.
I
[root@m3558030 ~]# oc get pods -A | grep openshift-kube-scheduler-master; oc get pods -A | grep kube-controller-manager-master
openshift-kube-scheduler                           openshift-kube-scheduler-master-01.ocp-m3558030.lnxne.boe   3/3     Running     15         5d22h
openshift-kube-scheduler                           openshift-kube-scheduler-master-02.ocp-m3558030.lnxne.boe   3/3     Running     20         5d22h
openshift-kube-scheduler                           openshift-kube-scheduler-master-03.ocp-m3558030.lnxne.boe   3/3     Running     19         5d21h
openshift-kube-controller-manager                  kube-controller-manager-master-01.ocp-m3558030.lnxne.boe    4/4     Running     13         5d21h
openshift-kube-controller-manager                  kube-controller-manager-master-02.ocp-m3558030.lnxne.boe    4/4     Running     16         5d21h
openshift-kube-controller-manager                  kube-controller-manager-master-03.ocp-m3558030.lnxne.boe    4/4     Running     16         5d21h

II
[root@m3558030 ~]# oc get pods -A | grep openshift-kube-scheduler-master; oc get pods -A | grep kube-controller-manager-master
openshift-kube-scheduler                           openshift-kube-scheduler-master-01.ocp-m3558030.lnxne.boe   3/3     Running     15         6d3h
openshift-kube-scheduler                           openshift-kube-scheduler-master-02.ocp-m3558030.lnxne.boe   3/3     Running     21         6d3h
openshift-kube-scheduler                           openshift-kube-scheduler-master-03.ocp-m3558030.lnxne.boe   3/3     Running     20         6d3h
openshift-kube-controller-manager                  kube-controller-manager-master-01.ocp-m3558030.lnxne.boe    4/4     Running     14         6d3h
openshift-kube-controller-manager                  kube-controller-manager-master-02.ocp-m3558030.lnxne.boe    4/4     Running     17         6d3h
openshift-kube-controller-manager                  kube-controller-manager-master-03.ocp-m3558030.lnxne.boe    4/4     Running     16         6d3h

I have attached the output of oc describe pod/kube-controller-manager-master-03.ocp-m3558030.lnxne.boe in the attachments.

additional info:
The cluster’s resource spec is  
master nodes - 4 CPU  / 16G ,
worker nodes 01,02 - 2 CPU / 8G (increased memory needed to set logging stack),
worker 03 - 4 CPU / 16G (bootstrap node turned as master node)

[root@m3558030 ~]# oc get nodes
NAME                                 STATUS   ROLES    AGE     VERSION
bootstrap-0.ocp-m3558030.lnxne.boe   Ready    worker   6d21h   v1.20.0+f0a2ec9
master-01.ocp-m3558030.lnxne.boe     Ready    master   6d21h   v1.20.0+f0a2ec9
master-02.ocp-m3558030.lnxne.boe     Ready    master   6d21h   v1.20.0+f0a2ec9
master-03.ocp-m3558030.lnxne.boe     Ready    master   6d21h   v1.20.0+f0a2ec9
worker-01.ocp-m3558030.lnxne.boe     Ready    worker   6d21h   v1.20.0+f0a2ec9
worker-02.ocp-m3558030.lnxne.boe     Ready    worker   6d21h   v1.20.0+f0a2ec9

Please, kindly let me know what other logs would interest you, I shall gladly provide them.

Version-Release number of selected component (if applicable):
Client Version: 4.7.0-0.nightly-s390x-2021-01-22-120029
Server Version: 4.7.0-0.nightly-s390x-2021-01-22-120029
Kubernetes Version: v1.20.0+f0a2ec9

How reproducible:
almost everytime under the test conditions; however the restart count of kube-controller-manager, openshift-kube-scheduler pods vary.

Steps to Reproduce:
1. Install a healthy OCP 4.7 cluster on s390x environment
2. Schedule a memory stress workload in one of the worker nodes for 24 hours or more. stress-ng tool was used in this exercise.
3. Continuously monitor the restart count of kube-controller-manager, openshift-kube-scheduler pods and observe it to be increasing.

Actual results:
The pods restart count is observed to be increasing while other pods in the cluster (oc get pods -A) wither have 0 or very minimal restart count.

Expected results:
The pods should either not experience any restart or a lesser restart count.

Additional info:
A separate bug is raised on openshift-kube-scheduler component as the restart of kube-scheduler-recovery-controller container are also observed during this exercise.

Comment 1 Lakshmi Ravichandran 2021-01-29 15:29:46 UTC
Hi Team,
kindly, correcting myself on the Description part of cluster’s resource specification.
The cluster under test has the following resource specifications:

master nodes - 4 CPU  / 16G ,
worker nodes 01,02 - 2 CPU / 8G,
worker 03 - 4 CPU / 16G (bootstrap node turned as master node)

Comment 2 Maciej Szulik 2021-02-01 10:34:16 UTC
Every development cluster built from master branch has a patch which shortens the default 
certificate rotation period from 30 days to 1/60th of that, the patch is in:
https://github.com/openshift/cluster-kube-apiserver-operator/blob/47de23de5c544bfe0649e0045c6ba667af5e469a/pkg/operator/certrotationcontroller/certrotationcontroller.go#L128
After code freeze, right after a release branch is created that patch is being removed, 
this effort for 4.7 is being tracked in:
https://bugzilla.redhat.com/show_bug.cgi?id=1883790

This effort allows us to have sufficient stress tests during development period that
the certificate rotation mechanism is not affected or broken due to any of the changes.

You can check metrics, the restarts should approximately every 6h or so. 

I'm going to close since this is not a bug, but a development feature ;-) 
Feel free to open if you see it's different.