Bug 1916501 - systemReserved complains 90% of memory is used.
Summary: systemReserved complains 90% of memory is used.
Keywords:
Status: CLOSED DUPLICATE of bug 1907929
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Harshal Patil
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 22:35 UTC by cshepher
Modified: 2024-03-25 17:51 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-09 20:38:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description cshepher 2021-01-14 22:35:49 UTC
Description of problem:
systemReserved continually warns of more and more memory being utilized daily.
"message = System memory usage of 9.361G on devocp4ec-nmthf-master-0 exceeds 90% of the reservation."
 
Top showed kubelet and etcd taking up most of the memory usage (9-12%), but 2 pprof data snapshots show they are only using about 40-60MB. Kernel team found no evidence of a memory leak. (See comment #46: https://gss--c.visualforce.com/apex/Case_View?id=5002K00000r2vBG&sfdc.override=1#comment_a0a2K00000YGdefQAD) Is this a spurious error?  Multiple nodes in an 8 node cluster are throwing this error and the customer is concerned.

Version-Release number of selected component (if applicable):
OCP 4.6.3 on Azure, RHCOS 8.3

Steps to Reproduce:
1.  Spin up cluster
2.  Get systemReserved error that node is using 90% of allocated CPU
3.  Increase allocated CPU to beyond previous number, everything OK for a day then another error saying it is using more than 90% again.

Comment 2 Martin Sivák 2021-01-15 09:33:24 UTC
Memory manager deals with hugepages, I believe this belongs to kubelet and conditions reporting.

Comment 4 cshepher 2021-01-15 22:57:29 UTC
The only thing I see in the etcd logs are some "took too long" warnings, mostly on master-2 but there are a couple on master-0 and master-1.

~~~
2020-11-25T19:24:12.239620218Z 2020-11-25 19:24:12.239586 W | etcdserver: request "header:<ID:8010344485033188228 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" value_size:1509 >> failure:<>>" with result "size:7" took too long (303.371753ms) to execute
2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240017 W | etcdserver: read-only range request "key:\"/kubernetes.io/cronjobs/openshift-logging/curator\" " with result "range_response_count:1 size:3881" took too long (315.956509ms) to execute
2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240036 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/openshift-kube-scheduler/system:openshift:sa-listing-configmaps\" " with result "range_response_count:1 size:434" took too long (347.237496ms) to execute
2020-11-25T19:24:12.240431828Z 2020-11-25 19:24:12.240395 I | etcdserver/api/etcdhttp: /health OK (status code 200)
2020-11-25T19:24:12.241419640Z 2020-11-25 19:24:12.241396 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/openshiftcontrollermanagers/cluster\" " with result "range_response_count:1 size:2635" took too long (288.345967ms) to execute
2020-11-25T19:24:12.241640943Z 2020-11-25 19:24:12.241615 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/\" range_end:\"/kubernetes.io/monitoring.coreos.com/servicemonitors0\" count_only:true " with result "range_response_count:0 size:9" took too long (222.174448ms) to execute
2020-11-25T19:24:12.241900346Z 2020-11-25 19:24:12.241826 W | etcdserver: read-only range request "key:\"/kubernetes.io/ingress/\" range_end:\"/kubernetes.io/ingress0\" count_only:true " with result "range_response_count:0 size:7" took too long (239.421262ms) to execute
2020-11-25T19:24:17.021161848Z 2020-11-25 19:24:17.021107 I | etcdserver/api/etcdhttp: /health OK (status code 200)
~~~

Comment 5 Ryan Phillips 2021-01-18 16:18:46 UTC
With the way that Go works, the allocator will not release memory back to the OS until the system is under memory pressure [1]. We put this alert in so that we can see when this happens in production clusters. The memory allocator in Golang is going to change with Golang 1.16 (thus in future versions of Openshift).

We highly recommend upgrading to 4.6.9+ since that does include a kernel patch for high memory scenarios on cloud machines. 

1. https://github.com/golang/go/issues/42330

Comment 9 Ryan Phillips 2021-02-09 20:38:32 UTC
In 4.7 we are enabling an option to make crio and kubelet reclaim memory faster. I created a backport for 4.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1907929 

https://github.com/openshift/machine-config-operator/pull/2397

*** This bug has been marked as a duplicate of bug 1907929 ***

Comment 12 hsbawa 2021-08-17 00:43:47 UTC
I am using OCP 4.7.22 and getting similar error. Not sure what I may be missing. 

Aug 16, 2021, 8:22 PM
System memory usage of 1.347G on infra3.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods.
View details
Aug 16, 2021, 8:22 PM
System memory usage of 1.099G on infra1.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods.

Comment 13 transient.sepia 2021-08-19 06:51:03 UTC
Hitting the same error on both 4.6.z (4.6.40) and 4.8.z (4.8.3). Should this be looked at again?

Comment 15 Red Hat Bugzilla 2023-10-21 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.