Created attachment 1433972 [details]
OoM kills (dmesg), oc get pods, oc describe pod kube-state-metrics*
$ oc version
features: Basic-Auth GSSAPI Kerberos SPNEGO
Steps to Reproduce:
1. Install a larger OCP cluster and watch kube-state-metrics pods getting OoM killed
Please see attachment.
No OoM kills.
We did some pretty massive improvements that landed in 4.0, hopefully that fixes all of these. At the end of the day we will still use a lot of memory as that's the purpose of kube-state-metrics (being a cache that can be read from super fast). In future versions we will also look into sharding kube-state-metrics, but for now the improvements we have made have shown very significant improvements that should at least raise this bar a lot. Please re-test with OpenShift 4.0.
Could you help to test this aos-scalability bug
Marking verified on 4.2. There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue. In a 250 node cluster on GCP, kube-state-metrics is using 278MB VSZ and 172MB RSS
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.