1895799 – ACM pods OOMKilled after 2.0.4 upgrade to 2.1.0 on baremetal

Bug 1895799 - ACM pods OOMKilled after 2.0.4 upgrade to 2.1.0 on baremetal

Summary: ACM pods OOMKilled after 2.0.4 upgrade to 2.1.0 on baremetal

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Cluster Lifecycle
Sub Component:
Version:	rhacm-2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	rhacm-2.2
Assignee:	Hao Liu
QA Contact:	Derek Ho
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-09 06:05 UTC by Nikhil Gupta
Modified:	2024-03-25 16:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-15 20:14:16 UTC
Target Upstream Version:
Embargoed:
Flags:	cdawson: rhacm-2.0.z+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 6898	0	None	None	None	2021-02-22 14:41:56 UTC

Description Nikhil Gupta 2020-11-09 06:05:07 UTC

Description of problem:
After 2.0.4 upgrade to 2.1.0 on baremetal pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state.


Version-Release number of selected component (if applicable):
acm-2.1.0

How reproducible:
Always

Steps to Reproduce:
1. Do the ACM upgrade from 2.0.4 to 2.1.0 on baremetal

Actual results:
Pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state.

Expected results:
Upgrade should be smooth without any issues. 

Additional info:
I ended up editing the deployment for each of these pod stacks and increasing the memory limits. After the update the new pods have stayed up.

Comment 2 Nathan Weatherly 2020-11-09 13:56:32 UTC

Hey!

So these are issues pertaining to the cluster lifecycle and search squads. I'm going to  reassign them so they can triage/connect to known issues. 


Actually, I think I can only assign one squad at at time; I will assign cluster lifecycle first since they have _2_ OOMKilled pods.

Thanks!

Nathan

Comment 3 ljawale 2020-11-09 16:05:14 UTC

Hi, 

Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-addon-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? 

Thanks!

Comment 4 ljawale 2020-11-09 16:06:17 UTC

Hi, 

Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-import-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? 

Thanks!

Comment 5 Jason Huddleston 2020-11-09 23:19:52 UTC

Ran into this again this morning with a fresh install of ACM 2.1.0. The cluster only has 89 namespaces. I used to following commands as a work around 

oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster'


redisdeployment=$( oc -n open-cluster-management get deployment -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | grep redisgraph )

oc patch deployment managedcluster-import-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "managedcluster-import-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}'
oc patch deployment klusterlet-addon-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "klusterlet-addon-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}'
oc patch deployment $redisdeployment -p '{"spec": {"template": {"spec": {"containers": [{"name": "redisgraph","resources": {"limits": {"memory":"2Gi"}}}]}}}}'

unset redisdeployment

oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster'

Comment 6 ljawale 2020-11-10 20:41:30 UTC

Can you please provide me cluster details to debug further? Or can you ping me on slack (id @ljawale). How many ManagedClusters does this cluster have?

Comment 7 Jason Huddleston 2020-11-10 21:12:45 UTC

Cluster details sent through slack.

Comment 8 Jorge Padilla 2020-11-11 16:37:54 UTC

For the redisgraph pod:
Extrapolating from the number of secrets, seems like the cluster has a lot of resources in general (pods, configmaps, etc).  Redisgraph is used to provide the Search functionality. The required memory is expected to be proportional to the total resources in the hub + managed clusters.

Here's the search scalability documentation:
https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.1/html/install/installing#search-scalability


The current process is manual and not ideal. We are tracking an enhancement to improve the user experience in this scenario.

Comment 9 Mike Ng 2020-11-12 13:35:40 UTC

G2Bsync 725001861 comment 
 leena-jawale Tue, 10 Nov 2020 22:19:24 UTC 
 G2Bsync 
```
[corona@bastion ~]$ oc get secrets --all-namespaces --no-headers | wc -l
5297
[corona@bastion ~]$
```
After further investigation, it turned out that hub cluster has more secrets and `managedcluster-import-controller` and `klusterlet-addon-controller` cache all the secrets. So memory is increasing and pod are going in OOMKilled state. We have added fix for this in ACM 2.2 I think we should add that fix in ACM 2.1.1

Comment 11 Mike Ng 2021-01-05 17:46:03 UTC

G2Bsync 754780921 comment 
 juliana-hsu Tue, 05 Jan 2021 17:28:33 UTC 
 G2Bsync Fix is in 2.0.6 & higher, 2.1.1, and 2.2.

Comment 12 ariv 2021-01-10 12:27:12 UTC

I also have these issues in ACM 2.1
How can I get this fix?

Comment 13 Mike Ng 2021-01-11 14:10:53 UTC

G2Bsync 757489826 comment 
 juliana-hsu Sun, 10 Jan 2021 14:56:36 UTC 
 G2Bsync  Fix is in 2.0.6 & higher, 2.1.1, and 2.2.  Are you still having issues on these releases?

Comment 14 ariv 2021-01-11 14:23:51 UTC

issue resolved only after detaching the managed cluster and reimporting the managed cluster to ACM hub

Comment 15 Mike Ng 2021-01-11 19:54:45 UTC

G2Bsync 758177846 comment 
 leena-jawale Mon, 11 Jan 2021 19:39:49 UTC 
 G2Bsync

If you are on hub 2.1.1, then there is no need to detach and reimport cluster. Pod restart on upgrade 2.1.1 should pickup the memory fix.

Note You need to log in before you can comment on or make changes to this bug.