Bug 1572587

Summary: prometheus pods getting oomkilled @ 100 node scale
Product: OpenShift Container Platform Reporter: Jeremy Eder <jeder>
Component: MonitoringAssignee: Dan Mace <dmace>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Fiedler <mifiedle>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.10.0CC: aos-bugs, dmace, fbranczy, jforrest, jmencak, mifiedle, spasquie, wsun
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: All   
OS: Linux   
Whiteboard: aos-scalability-310
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-20 21:12:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Simon Pasquier 2018-04-27 12:51:45 UTC
Depending on the actual memory limits, you may or may not affected by this but other people have reported memory leaks with Prometheus 2.2.1 [1]. There's a opened PR [2] that seems to fix the problem but it may have surfaced other issues.

[1] https://github.com/prometheus/prometheus/issues/4095
[2] https://github.com/prometheus/prometheus/pull/4013

Comment 2 Jeremy Eder 2018-04-27 16:09:08 UTC
# oc edit cm -n openshift-monitoring cluster-monitoring-config

add the resources line below.

    prometheusK8s:                                                                                                                                                          
      baseImage: quay.io/prometheus/prometheus                                                                                                                              
      resources: {}

wait patiently. mine took about 5 minutes to stop/start both prometheus-k8s-N pods

# oc get pod -n openshift-monitoring prometheus-k8s-1 -o yaml

Now you will see 

    name: prometheus                                                                                                                                                        
    resources:                                                                                                                                                              
      requests:                                                                                                                                                             
        memory: 2Gi     

The RSS of these prometheus processes is just about 2.2G right now, each using 0.2 cores (scale labl env, 100 node cluster, 600 pods).

I think we need to disable the limits in-product, or at least bump them to something like 30G (number taken from starter clusters, see attached image).

Comment 5 Dan Mace 2018-05-21 17:16:31 UTC
(In reply to Dan Mace from comment #4)
> https://github.com/openshift/openshift-ansible/pull/8442

New upstream fix: https://github.com/openshift/cluster-monitoring-operator/pull/19

Will also require an openshift-ansible PR to a new cluster-monitoring-operator release, which I'll link here.

Comment 6 Dan Mace 2018-05-23 14:48:59 UTC
Fix for this is ready, still trying to get a new cluster-monitoring-operator release pushed so I can open a new openshift-ansible PR.

Comment 10 Mike Fiedler 2018-06-05 18:57:45 UTC
Moving back to ASSIGNED based on comment 9.

Comment 11 Dan Mace 2018-06-06 17:02:00 UTC
This can be tested with the release of https://github.com/openshift/openshift-ansible/pull/8591

Comment 12 Wei Sun 2018-06-08 02:01:29 UTC
The PR has been merged to openshift-ansible-3.10.0-0.63.0,please check

Comment 13 Mike Fiedler 2018-06-11 15:46:22 UTC
Verified on 3.10.0-0.64.0.  prometheus-operator now using PV/PVC for persistence and resource limits have been removed from deployments/statefulsets/daemonsets.