Bug 1856990 - OLM creating and listing installplans continuously lets explode kube-apiserver memory consumption until OOM
Summary: OLM creating and listing installplans continuously lets explode kube-apiserve...
Keywords:
Status: CLOSED DUPLICATE of bug 1857424
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Evan Cordell
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1857676 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-14 21:50 UTC by rvanderp
Modified: 2023-12-15 18:27 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-24 03:56:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5221881 0 None None None 2020-07-17 19:51:47 UTC

Description rvanderp 2020-07-14 21:50:42 UTC
Description of problem:
master nodes are oom-killing control plane pods routinely.  The kube-apiserver is growing to consume most of the node memory.

Version-Release number of selected component (if applicable):
4.4.9 UPI on vSphere

How reproducible:
Consistently in the customer environment every few hours

Steps to Reproduce:
1. the onset of this issue was sudden.  no known changes were made to aggravate the issue
2.
3.

Actual results:
kube-apiserver is being oom-killed

Expected results:
kube-apiserver should not be oom-killed

Additional info:
The following will be attached to the case:
 - audit logs
 - etcd performance check results
 - etcd object count
 - pprofs
 - sosreports from impacted nodes

We used https://github.com/openshift/cluster-debug-tools to try to find any obvious offenders.  The only odd thing that jumped out was a large number of image reads for images which did not seem to exist:

Top 5 "GET":
  19256x [       977µs] [404] /apis/image.openshift.io/v1/images/sha256:0cbb436a0ff01b5a500b6a93fb52e25cbd2806bdf753a00c2386e874d6555e8a [system:apiserver]
  16953x [       907µs] [404] /apis/image.openshift.io/v1/images/sha256:66215b7881303d8370edf2e931c6a1b3ce9a657da85d49ad0ae3db2048ff02cd [system:apiserver]
  16747x [     1.079ms] [404] /apis/image.openshift.io/v1/images/sha256:9a5af3804ac141ad2bb1a3b52aa364a6269e0a1368535e976d66ff4afc49620f [system:apiserver]

pprofs were collected from the kube-apiserver pods and indicate that that one of the pods(IP .166, master 2) 
spent over 26 of 30 seconds of profile time sitting in the ListResource handler.

Comment 60 Raz Tamir 2020-07-16 14:09:40 UTC
Hi rvanderp,

The issue for the installPlan being created is the missing SA?
We (OCS) are trying to figure out how to reproduce this issue and this might be the clue we are looking for

Comment 61 rvanderp 2020-07-16 14:18:00 UTC
Hi Raz -

The lib-bucket-provisioner pod was crash-looping on the missing SA.  I appeared that a new installplan was being created after the crash.  I couldn't logically piece together why that would occur other than maybe a new installplan gets created when the pod got restarted, I wanted to review the source for that.  There may have been other missing resources, but that was the only one I could find.  We created the missing SA, which resolved that specific error but they hit other problems(which didn't really shock me as we didn't really have time to make sure the account had the right role bindings, RBAC, etc...).  At that point we decided to remove the installplans to give the API server some breathing room and the cluster has been stable since then.  

I reproduced a similar issue on my own cluster by just installing 4.4.1 and letting it sit for a few hours.

Comment 68 Andre Costa 2020-07-17 10:39:03 UTC
*** Bug 1857676 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.