Created attachment 1433959 [details]
OoM kills (dmesg), oc get pods, oc describe pod prometheus-operator*
Description of problem:
$ oc version
features: Basic-Auth GSSAPI Kerberos SPNEGO
Steps to Reproduce:
1. Install a larger OCP cluster and watch prometheus-operator getting OoM killed
Please see attachment.O
No OoM kills.
We have done various scalability changes for 4.0, this needs to be re-assessed in the 4.0 scope.
Could you also help to test this aos-scalability bug?
Did not find this issue in one smaller cluster, not sure if it would be happen in a larger cluster
Marking verified on 4.2. There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue. In a 250 node cluster on GCP, prometheus-operator is using 80Mb VSZ and 2MB RSS
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.