Created attachment 1433972 [details] OoM kills (dmesg), oc get pods, oc describe pod kube-state-metrics* $ oc version oc v3.10.0-0.32.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://lb-0.scale-ci.example.com:8443 openshift v3.10.0-0.32.0 kubernetes v1.10.0+b81c8f8 Steps to Reproduce: 1. Install a larger OCP cluster and watch kube-state-metrics pods getting OoM killed Actual results: Please see attachment. Expected results: No OoM kills.
We did some pretty massive improvements that landed in 4.0, hopefully that fixes all of these. At the end of the day we will still use a lot of memory as that's the purpose of kube-state-metrics (being a cache that can be read from super fast). In future versions we will also look into sharding kube-state-metrics, but for now the improvements we have made have shown very significant improvements that should at least raise this bar a lot. Please re-test with OpenShift 4.0.
@Mike Could you help to test this aos-scalability bug
Marking verified on 4.2. There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue. In a 250 node cluster on GCP, kube-state-metrics is using 278MB VSZ and 172MB RSS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922