Bug 1895799 - ACM pods OOMKilled after 2.0.4 upgrade to 2.1.0 on baremetal
Summary: ACM pods OOMKilled after 2.0.4 upgrade to 2.1.0 on baremetal
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Cluster Lifecycle
Version: rhacm-2.1
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: rhacm-2.2
Assignee: Hao Liu
QA Contact: Derek Ho
Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-09 06:05 UTC by Nikhil Gupta
Modified: 2024-03-25 16:59 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-15 20:14:16 UTC
Target Upstream Version:
Embargoed:
cdawson: rhacm-2.0.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github open-cluster-management backlog issues 6898 0 None None None 2021-02-22 14:41:56 UTC

Description Nikhil Gupta 2020-11-09 06:05:07 UTC
Description of problem:
After 2.0.4 upgrade to 2.1.0 on baremetal pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state.


Version-Release number of selected component (if applicable):
acm-2.1.0

How reproducible:
Always

Steps to Reproduce:
1. Do the ACM upgrade from 2.0.4 to 2.1.0 on baremetal

Actual results:
Pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state.

Expected results:
Upgrade should be smooth without any issues. 

Additional info:
I ended up editing the deployment for each of these pod stacks and increasing the memory limits. After the update the new pods have stayed up.

Comment 2 Nathan Weatherly 2020-11-09 13:56:32 UTC
Hey!

So these are issues pertaining to the cluster lifecycle and search squads. I'm going to  reassign them so they can triage/connect to known issues. 


Actually, I think I can only assign one squad at at time; I will assign cluster lifecycle first since they have _2_ OOMKilled pods.

Thanks!

Nathan

Comment 3 ljawale 2020-11-09 16:05:14 UTC
Hi, 

Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-addon-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? 

Thanks!

Comment 4 ljawale 2020-11-09 16:06:17 UTC
Hi, 

Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-import-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? 

Thanks!

Comment 5 Jason Huddleston 2020-11-09 23:19:52 UTC
Ran into this again this morning with a fresh install of ACM 2.1.0. The cluster only has 89 namespaces. I used to following commands as a work around 

oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster'


redisdeployment=$( oc -n open-cluster-management get deployment -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | grep redisgraph )

oc patch deployment managedcluster-import-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "managedcluster-import-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}'
oc patch deployment klusterlet-addon-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "klusterlet-addon-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}'
oc patch deployment $redisdeployment -p '{"spec": {"template": {"spec": {"containers": [{"name": "redisgraph","resources": {"limits": {"memory":"2Gi"}}}]}}}}'

unset redisdeployment

oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster'

Comment 6 ljawale 2020-11-10 20:41:30 UTC
Can you please provide me cluster details to debug further? Or can you ping me on slack (id @ljawale). How many ManagedClusters does this cluster have?

Comment 7 Jason Huddleston 2020-11-10 21:12:45 UTC
Cluster details sent through slack.

Comment 8 Jorge Padilla 2020-11-11 16:37:54 UTC
For the redisgraph pod:
Extrapolating from the number of secrets, seems like the cluster has a lot of resources in general (pods, configmaps, etc).  Redisgraph is used to provide the Search functionality. The required memory is expected to be proportional to the total resources in the hub + managed clusters.

Here's the search scalability documentation:
https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.1/html/install/installing#search-scalability


The current process is manual and not ideal. We are tracking an enhancement to improve the user experience in this scenario.

Comment 9 Mike Ng 2020-11-12 13:35:40 UTC
G2Bsync 725001861 comment 
 leena-jawale Tue, 10 Nov 2020 22:19:24 UTC 
 G2Bsync 
```
[corona@bastion ~]$ oc get secrets --all-namespaces --no-headers | wc -l
5297
[corona@bastion ~]$
```
After further investigation, it turned out that hub cluster has more secrets and `managedcluster-import-controller` and `klusterlet-addon-controller` cache all the secrets. So memory is increasing and pod are going in OOMKilled state. We have added fix for this in ACM 2.2 I think we should add that fix in ACM 2.1.1

Comment 11 Mike Ng 2021-01-05 17:46:03 UTC
G2Bsync 754780921 comment 
 juliana-hsu Tue, 05 Jan 2021 17:28:33 UTC 
 G2Bsync Fix is in 2.0.6 & higher, 2.1.1, and 2.2.

Comment 12 ariv 2021-01-10 12:27:12 UTC
I also have these issues in ACM 2.1
How can I get this fix?

Comment 13 Mike Ng 2021-01-11 14:10:53 UTC
G2Bsync 757489826 comment 
 juliana-hsu Sun, 10 Jan 2021 14:56:36 UTC 
 G2Bsync  Fix is in 2.0.6 & higher, 2.1.1, and 2.2.  Are you still having issues on these releases?

Comment 14 ariv 2021-01-11 14:23:51 UTC
issue resolved only after detaching the managed cluster and reimporting the managed cluster to ACM hub

Comment 15 Mike Ng 2021-01-11 19:54:45 UTC
G2Bsync 758177846 comment 
 leena-jawale Mon, 11 Jan 2021 19:39:49 UTC 
 G2Bsync

If you are on hub 2.1.1, then there is no need to detach and reimport cluster. Pod restart on upgrade 2.1.1 should pickup the memory fix.


Note You need to log in before you can comment on or make changes to this bug.