Bug 1576547 - kube-state-metrics pods getting OoM killed @ 750 nodes
Summary: kube-state-metrics pods getting OoM killed @ 750 nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.2.0
Assignee: Frederic Branczyk
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-310
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-09 17:26 UTC by jmencak
Modified: 2019-10-16 06:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:27:40 UTC
Target Upstream Version:


Attachments (Terms of Use)
OoM kills (dmesg), oc get pods, oc describe pod kube-state-metrics* (98.57 KB, application/x-gzip)
2018-05-09 17:26 UTC, jmencak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kube-state-metrics issues 498 0 None None None 2018-07-31 07:36:41 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:27:56 UTC

Description jmencak 2018-05-09 17:26:52 UTC
Created attachment 1433972 [details]
OoM kills (dmesg), oc get pods, oc describe pod kube-state-metrics*

$ oc version
oc v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://lb-0.scale-ci.example.com:8443
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8

Steps to Reproduce:
1. Install a larger OCP cluster and watch kube-state-metrics pods getting OoM killed

Actual results:
Please see attachment.

Expected results:
No OoM kills.

Comment 2 Frederic Branczyk 2019-02-06 13:45:55 UTC
We did some pretty massive improvements that landed in 4.0, hopefully that fixes all of these. At the end of the day we will still use a lot of memory as that's the purpose of kube-state-metrics (being a cache that can be read from super fast). In future versions we will also look into sharding kube-state-metrics, but for now the improvements we have made have shown very significant improvements that should at least raise this bar a lot. Please re-test with OpenShift 4.0.

Comment 4 Junqi Zhao 2019-02-19 01:22:10 UTC
@Mike

Could you help to test this aos-scalability bug

Comment 9 Mike Fiedler 2019-09-13 16:41:48 UTC
Marking verified on 4.2.   There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue.   In a 250 node cluster on GCP, kube-state-metrics is using 278MB VSZ and 172MB RSS

Comment 11 errata-xmlrpc 2019-10-16 06:27:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.