Bug 1572587
Summary: | prometheus pods getting oomkilled @ 100 node scale | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jeremy Eder <jeder> |
Component: | Monitoring | Assignee: | Dan Mace <dmace> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Fiedler <mifiedle> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.10.0 | CC: | aos-bugs, dmace, fbranczy, jforrest, jmencak, mifiedle, spasquie, wsun |
Target Milestone: | --- | ||
Target Release: | 3.10.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | aos-scalability-310 | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-12-20 21:12:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Simon Pasquier
2018-04-27 12:51:45 UTC
# oc edit cm -n openshift-monitoring cluster-monitoring-config add the resources line below. prometheusK8s: baseImage: quay.io/prometheus/prometheus resources: {} wait patiently. mine took about 5 minutes to stop/start both prometheus-k8s-N pods # oc get pod -n openshift-monitoring prometheus-k8s-1 -o yaml Now you will see name: prometheus resources: requests: memory: 2Gi The RSS of these prometheus processes is just about 2.2G right now, each using 0.2 cores (scale labl env, 100 node cluster, 600 pods). I think we need to disable the limits in-product, or at least bump them to something like 30G (number taken from starter clusters, see attached image). (In reply to Dan Mace from comment #4) > https://github.com/openshift/openshift-ansible/pull/8442 New upstream fix: https://github.com/openshift/cluster-monitoring-operator/pull/19 Will also require an openshift-ansible PR to a new cluster-monitoring-operator release, which I'll link here. Fix for this is ready, still trying to get a new cluster-monitoring-operator release pushed so I can open a new openshift-ansible PR. Moving back to ASSIGNED based on comment 9. This can be tested with the release of https://github.com/openshift/openshift-ansible/pull/8591 The PR has been merged to openshift-ansible-3.10.0-0.63.0,please check Verified on 3.10.0-0.64.0. prometheus-operator now using PV/PVC for persistence and resource limits have been removed from deployments/statefulsets/daemonsets. |