Description of problem: After 2.0.4 upgrade to 2.1.0 on baremetal pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state. Version-Release number of selected component (if applicable): acm-2.1.0 How reproducible: Always Steps to Reproduce: 1. Do the ACM upgrade from 2.0.4 to 2.1.0 on baremetal Actual results: Pods klusterlet-addon-controller, search-prod-42fde-redisgraph and managedcluster-import-controller were in a OOMKilled state. Expected results: Upgrade should be smooth without any issues. Additional info: I ended up editing the deployment for each of these pod stacks and increasing the memory limits. After the update the new pods have stayed up.
Hey! So these are issues pertaining to the cluster lifecycle and search squads. I'm going to reassign them so they can triage/connect to known issues. Actually, I think I can only assign one squad at at time; I will assign cluster lifecycle first since they have _2_ OOMKilled pods. Thanks! Nathan
Hi, Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-addon-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? Thanks!
Hi, Looks like this is a memory issue. And we have investigated it, if we add more namespaces > 350, klusterlet-addon-controller and managedcluster-import-controller are going in OOMKilled state. It is when new namespace is created, it is creating 8-9 new secrets and it is caching all those secret so memory is increasing. Can you please check how many namespaces you have on hub cluster? Thanks!
Ran into this again this morning with a fresh install of ACM 2.1.0. The cluster only has 89 namespaces. I used to following commands as a work around oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster' redisdeployment=$( oc -n open-cluster-management get deployment -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | grep redisgraph ) oc patch deployment managedcluster-import-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "managedcluster-import-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}' oc patch deployment klusterlet-addon-controller -p '{"spec": {"template": {"spec": {"containers": [{"name": "klusterlet-addon-controller","resources": {"limits": {"memory":"512Mi"}}}]}}}}' oc patch deployment $redisdeployment -p '{"spec": {"template": {"spec": {"containers": [{"name": "redisgraph","resources": {"limits": {"memory":"2Gi"}}}]}}}}' unset redisdeployment oc get deployments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].resources.limits.memory}{"\n"}{end}' | egrep 'redisgraph|klusterlet|managedcluster'
Can you please provide me cluster details to debug further? Or can you ping me on slack (id @ljawale). How many ManagedClusters does this cluster have?
Cluster details sent through slack.
For the redisgraph pod: Extrapolating from the number of secrets, seems like the cluster has a lot of resources in general (pods, configmaps, etc). Redisgraph is used to provide the Search functionality. The required memory is expected to be proportional to the total resources in the hub + managed clusters. Here's the search scalability documentation: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.1/html/install/installing#search-scalability The current process is manual and not ideal. We are tracking an enhancement to improve the user experience in this scenario.
G2Bsync 725001861 comment leena-jawale Tue, 10 Nov 2020 22:19:24 UTC G2Bsync ``` [corona@bastion ~]$ oc get secrets --all-namespaces --no-headers | wc -l 5297 [corona@bastion ~]$ ``` After further investigation, it turned out that hub cluster has more secrets and `managedcluster-import-controller` and `klusterlet-addon-controller` cache all the secrets. So memory is increasing and pod are going in OOMKilled state. We have added fix for this in ACM 2.2 I think we should add that fix in ACM 2.1.1
G2Bsync 754780921 comment juliana-hsu Tue, 05 Jan 2021 17:28:33 UTC G2Bsync Fix is in 2.0.6 & higher, 2.1.1, and 2.2.
I also have these issues in ACM 2.1 How can I get this fix?
G2Bsync 757489826 comment juliana-hsu Sun, 10 Jan 2021 14:56:36 UTC G2Bsync Fix is in 2.0.6 & higher, 2.1.1, and 2.2. Are you still having issues on these releases?
issue resolved only after detaching the managed cluster and reimporting the managed cluster to ACM hub
G2Bsync 758177846 comment leena-jawale Mon, 11 Jan 2021 19:39:49 UTC G2Bsync If you are on hub 2.1.1, then there is no need to detach and reimport cluster. Pod restart on upgrade 2.1.1 should pickup the memory fix.