Description of problem: master nodes are oom-killing control plane pods routinely. The kube-apiserver is growing to consume most of the node memory. Version-Release number of selected component (if applicable): 4.4.9 UPI on vSphere How reproducible: Consistently in the customer environment every few hours Steps to Reproduce: 1. the onset of this issue was sudden. no known changes were made to aggravate the issue 2. 3. Actual results: kube-apiserver is being oom-killed Expected results: kube-apiserver should not be oom-killed Additional info: The following will be attached to the case: - audit logs - etcd performance check results - etcd object count - pprofs - sosreports from impacted nodes We used https://github.com/openshift/cluster-debug-tools to try to find any obvious offenders. The only odd thing that jumped out was a large number of image reads for images which did not seem to exist: Top 5 "GET": 19256x [ 977µs] [404] /apis/image.openshift.io/v1/images/sha256:0cbb436a0ff01b5a500b6a93fb52e25cbd2806bdf753a00c2386e874d6555e8a [system:apiserver] 16953x [ 907µs] [404] /apis/image.openshift.io/v1/images/sha256:66215b7881303d8370edf2e931c6a1b3ce9a657da85d49ad0ae3db2048ff02cd [system:apiserver] 16747x [ 1.079ms] [404] /apis/image.openshift.io/v1/images/sha256:9a5af3804ac141ad2bb1a3b52aa364a6269e0a1368535e976d66ff4afc49620f [system:apiserver] pprofs were collected from the kube-apiserver pods and indicate that that one of the pods(IP .166, master 2) spent over 26 of 30 seconds of profile time sitting in the ListResource handler.
Hi rvanderp, The issue for the installPlan being created is the missing SA? We (OCS) are trying to figure out how to reproduce this issue and this might be the clue we are looking for
Hi Raz - The lib-bucket-provisioner pod was crash-looping on the missing SA. I appeared that a new installplan was being created after the crash. I couldn't logically piece together why that would occur other than maybe a new installplan gets created when the pod got restarted, I wanted to review the source for that. There may have been other missing resources, but that was the only one I could find. We created the missing SA, which resolved that specific error but they hit other problems(which didn't really shock me as we didn't really have time to make sure the account had the right role bindings, RBAC, etc...). At that point we decided to remove the installplans to give the API server some breathing room and the cluster has been stable since then. I reproduced a similar issue on my own cluster by just installing 4.4.1 and letting it sit for a few hours.
*** Bug 1857676 has been marked as a duplicate of this bug. ***