Created attachment 1433959 [details] OoM kills (dmesg), oc get pods, oc describe pod prometheus-operator* Description of problem: $ oc version oc v3.10.0-0.32.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://lb-0.scale-ci.example.com:8443 openshift v3.10.0-0.32.0 kubernetes v1.10.0+b81c8f8 Steps to Reproduce: 1. Install a larger OCP cluster and watch prometheus-operator getting OoM killed Actual results: Please see attachment.O Expected results: No OoM kills.
We have done various scalability changes for 4.0, this needs to be re-assessed in the 4.0 scope.
@Mike Could you also help to test this aos-scalability bug?
Did not find this issue in one smaller cluster, not sure if it would be happen in a larger cluster
Marking verified on 4.2. There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue. In a 250 node cluster on GCP, prometheus-operator is using 80Mb VSZ and 2MB RSS
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922