Depending on the actual memory limits, you may or may not affected by this but other people have reported memory leaks with Prometheus 2.2.1 [1]. There's a opened PR [2] that seems to fix the problem but it may have surfaced other issues. [1] https://github.com/prometheus/prometheus/issues/4095 [2] https://github.com/prometheus/prometheus/pull/4013
# oc edit cm -n openshift-monitoring cluster-monitoring-config add the resources line below. prometheusK8s: baseImage: quay.io/prometheus/prometheus resources: {} wait patiently. mine took about 5 minutes to stop/start both prometheus-k8s-N pods # oc get pod -n openshift-monitoring prometheus-k8s-1 -o yaml Now you will see name: prometheus resources: requests: memory: 2Gi The RSS of these prometheus processes is just about 2.2G right now, each using 0.2 cores (scale labl env, 100 node cluster, 600 pods). I think we need to disable the limits in-product, or at least bump them to something like 30G (number taken from starter clusters, see attached image).
https://github.com/openshift/openshift-ansible/pull/8442
(In reply to Dan Mace from comment #4) > https://github.com/openshift/openshift-ansible/pull/8442 New upstream fix: https://github.com/openshift/cluster-monitoring-operator/pull/19 Will also require an openshift-ansible PR to a new cluster-monitoring-operator release, which I'll link here.
Fix for this is ready, still trying to get a new cluster-monitoring-operator release pushed so I can open a new openshift-ansible PR.
https://github.com/openshift/openshift-ansible/pull/8514
Moving back to ASSIGNED based on comment 9.
This can be tested with the release of https://github.com/openshift/openshift-ansible/pull/8591
The PR has been merged to openshift-ansible-3.10.0-0.63.0,please check
Verified on 3.10.0-0.64.0. prometheus-operator now using PV/PVC for persistence and resource limits have been removed from deployments/statefulsets/daemonsets.