Description of problem: systemReserved continually warns of more and more memory being utilized daily. "message = System memory usage of 9.361G on devocp4ec-nmthf-master-0 exceeds 90% of the reservation." Top showed kubelet and etcd taking up most of the memory usage (9-12%), but 2 pprof data snapshots show they are only using about 40-60MB. Kernel team found no evidence of a memory leak. (See comment #46: https://gss--c.visualforce.com/apex/Case_View?id=5002K00000r2vBG&sfdc.override=1#comment_a0a2K00000YGdefQAD) Is this a spurious error? Multiple nodes in an 8 node cluster are throwing this error and the customer is concerned. Version-Release number of selected component (if applicable): OCP 4.6.3 on Azure, RHCOS 8.3 Steps to Reproduce: 1. Spin up cluster 2. Get systemReserved error that node is using 90% of allocated CPU 3. Increase allocated CPU to beyond previous number, everything OK for a day then another error saying it is using more than 90% again.
Memory manager deals with hugepages, I believe this belongs to kubelet and conditions reporting.
The only thing I see in the etcd logs are some "took too long" warnings, mostly on master-2 but there are a couple on master-0 and master-1. ~~~ 2020-11-25T19:24:12.239620218Z 2020-11-25 19:24:12.239586 W | etcdserver: request "header:<ID:8010344485033188228 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" value_size:1509 >> failure:<>>" with result "size:7" took too long (303.371753ms) to execute 2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240017 W | etcdserver: read-only range request "key:\"/kubernetes.io/cronjobs/openshift-logging/curator\" " with result "range_response_count:1 size:3881" took too long (315.956509ms) to execute 2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240036 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/openshift-kube-scheduler/system:openshift:sa-listing-configmaps\" " with result "range_response_count:1 size:434" took too long (347.237496ms) to execute 2020-11-25T19:24:12.240431828Z 2020-11-25 19:24:12.240395 I | etcdserver/api/etcdhttp: /health OK (status code 200) 2020-11-25T19:24:12.241419640Z 2020-11-25 19:24:12.241396 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/openshiftcontrollermanagers/cluster\" " with result "range_response_count:1 size:2635" took too long (288.345967ms) to execute 2020-11-25T19:24:12.241640943Z 2020-11-25 19:24:12.241615 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/\" range_end:\"/kubernetes.io/monitoring.coreos.com/servicemonitors0\" count_only:true " with result "range_response_count:0 size:9" took too long (222.174448ms) to execute 2020-11-25T19:24:12.241900346Z 2020-11-25 19:24:12.241826 W | etcdserver: read-only range request "key:\"/kubernetes.io/ingress/\" range_end:\"/kubernetes.io/ingress0\" count_only:true " with result "range_response_count:0 size:7" took too long (239.421262ms) to execute 2020-11-25T19:24:17.021161848Z 2020-11-25 19:24:17.021107 I | etcdserver/api/etcdhttp: /health OK (status code 200) ~~~
With the way that Go works, the allocator will not release memory back to the OS until the system is under memory pressure [1]. We put this alert in so that we can see when this happens in production clusters. The memory allocator in Golang is going to change with Golang 1.16 (thus in future versions of Openshift). We highly recommend upgrading to 4.6.9+ since that does include a kernel patch for high memory scenarios on cloud machines. 1. https://github.com/golang/go/issues/42330
In 4.7 we are enabling an option to make crio and kubelet reclaim memory faster. I created a backport for 4.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1907929 https://github.com/openshift/machine-config-operator/pull/2397 *** This bug has been marked as a duplicate of bug 1907929 ***
I am using OCP 4.7.22 and getting similar error. Not sure what I may be missing. Aug 16, 2021, 8:22 PM System memory usage of 1.347G on infra3.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods. View details Aug 16, 2021, 8:22 PM System memory usage of 1.099G on infra1.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods.
Hitting the same error on both 4.6.z (4.6.40) and 4.8.z (4.8.3). Should this be looked at again?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days