Bug 1916501
Summary: | systemReserved complains 90% of memory is used. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | cshepher |
Component: | Node | Assignee: | Harshal Patil <harpatil> |
Node sub component: | Kubelet | QA Contact: | MinLi <minmli> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, dkulkarn, harpatil, hsbawa, mleonard, rkshirsa, rphillips, rsandu, transient.sepia, tsweeney, yaoli |
Version: | 4.6 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-09 20:38:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
cshepher
2021-01-14 22:35:49 UTC
Memory manager deals with hugepages, I believe this belongs to kubelet and conditions reporting. The only thing I see in the etcd logs are some "took too long" warnings, mostly on master-2 but there are a couple on master-0 and master-1. ~~~ 2020-11-25T19:24:12.239620218Z 2020-11-25 19:24:12.239586 W | etcdserver: request "header:<ID:8010344485033188228 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/openshift-logging/monitor-elasticsearch-cluster\" value_size:1509 >> failure:<>>" with result "size:7" took too long (303.371753ms) to execute 2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240017 W | etcdserver: read-only range request "key:\"/kubernetes.io/cronjobs/openshift-logging/curator\" " with result "range_response_count:1 size:3881" took too long (315.956509ms) to execute 2020-11-25T19:24:12.240115924Z 2020-11-25 19:24:12.240036 W | etcdserver: read-only range request "key:\"/kubernetes.io/roles/openshift-kube-scheduler/system:openshift:sa-listing-configmaps\" " with result "range_response_count:1 size:434" took too long (347.237496ms) to execute 2020-11-25T19:24:12.240431828Z 2020-11-25 19:24:12.240395 I | etcdserver/api/etcdhttp: /health OK (status code 200) 2020-11-25T19:24:12.241419640Z 2020-11-25 19:24:12.241396 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/openshiftcontrollermanagers/cluster\" " with result "range_response_count:1 size:2635" took too long (288.345967ms) to execute 2020-11-25T19:24:12.241640943Z 2020-11-25 19:24:12.241615 W | etcdserver: read-only range request "key:\"/kubernetes.io/monitoring.coreos.com/servicemonitors/\" range_end:\"/kubernetes.io/monitoring.coreos.com/servicemonitors0\" count_only:true " with result "range_response_count:0 size:9" took too long (222.174448ms) to execute 2020-11-25T19:24:12.241900346Z 2020-11-25 19:24:12.241826 W | etcdserver: read-only range request "key:\"/kubernetes.io/ingress/\" range_end:\"/kubernetes.io/ingress0\" count_only:true " with result "range_response_count:0 size:7" took too long (239.421262ms) to execute 2020-11-25T19:24:17.021161848Z 2020-11-25 19:24:17.021107 I | etcdserver/api/etcdhttp: /health OK (status code 200) ~~~ With the way that Go works, the allocator will not release memory back to the OS until the system is under memory pressure [1]. We put this alert in so that we can see when this happens in production clusters. The memory allocator in Golang is going to change with Golang 1.16 (thus in future versions of Openshift). We highly recommend upgrading to 4.6.9+ since that does include a kernel patch for high memory scenarios on cloud machines. 1. https://github.com/golang/go/issues/42330 In 4.7 we are enabling an option to make crio and kubelet reclaim memory faster. I created a backport for 4.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1907929 https://github.com/openshift/machine-config-operator/pull/2397 *** This bug has been marked as a duplicate of bug 1907929 *** I am using OCP 4.7.22 and getting similar error. Not sure what I may be missing. Aug 16, 2021, 8:22 PM System memory usage of 1.347G on infra3.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods. View details Aug 16, 2021, 8:22 PM System memory usage of 1.099G on infra1.hsb.local exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The reservation may be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods. Hitting the same error on both 4.6.z (4.6.40) and 4.8.z (4.8.3). Should this be looked at again? The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |