Bug 1572587

Summary:	prometheus pods getting oomkilled @ 100 node scale
Product:	OpenShift Container Platform	Reporter:	Jeremy Eder <jeder>
Component:	Monitoring	Assignee:	Dan Mace <dmace>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Fiedler <mifiedle>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.10.0	CC:	aos-bugs, dmace, fbranczy, jforrest, jmencak, mifiedle, spasquie, wsun
Target Milestone:	---
Target Release:	3.10.0
Hardware:	All
OS:	Linux
Whiteboard:	aos-scalability-310
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-12-20 21:12:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Simon Pasquier 2018-04-27 12:51:45 UTC

Depending on the actual memory limits, you may or may not affected by this but other people have reported memory leaks with Prometheus 2.2.1 [1]. There's a opened PR [2] that seems to fix the problem but it may have surfaced other issues.

[1] https://github.com/prometheus/prometheus/issues/4095
[2] https://github.com/prometheus/prometheus/pull/4013

Comment 2 Jeremy Eder 2018-04-27 16:09:08 UTC

# oc edit cm -n openshift-monitoring cluster-monitoring-config

add the resources line below.

    prometheusK8s:                                                                                                                                                          
      baseImage: quay.io/prometheus/prometheus                                                                                                                              
      resources: {}

wait patiently. mine took about 5 minutes to stop/start both prometheus-k8s-N pods

# oc get pod -n openshift-monitoring prometheus-k8s-1 -o yaml

Now you will see 

    name: prometheus                                                                                                                                                        
    resources:                                                                                                                                                              
      requests:                                                                                                                                                             
        memory: 2Gi     

The RSS of these prometheus processes is just about 2.2G right now, each using 0.2 cores (scale labl env, 100 node cluster, 600 pods).

I think we need to disable the limits in-product, or at least bump them to something like 30G (number taken from starter clusters, see attached image).

Comment 4 Dan Mace 2018-05-18 17:00:19 UTC

https://github.com/openshift/openshift-ansible/pull/8442

Comment 5 Dan Mace 2018-05-21 17:16:31 UTC

(In reply to Dan Mace from comment #4)
> https://github.com/openshift/openshift-ansible/pull/8442

New upstream fix: https://github.com/openshift/cluster-monitoring-operator/pull/19

Will also require an openshift-ansible PR to a new cluster-monitoring-operator release, which I'll link here.

Comment 6 Dan Mace 2018-05-23 14:48:59 UTC

Fix for this is ready, still trying to get a new cluster-monitoring-operator release pushed so I can open a new openshift-ansible PR.

Comment 7 Dan Mace 2018-05-24 11:55:24 UTC

https://github.com/openshift/openshift-ansible/pull/8514

Comment 10 Mike Fiedler 2018-06-05 18:57:45 UTC

Moving back to ASSIGNED based on comment 9.

Comment 11 Dan Mace 2018-06-06 17:02:00 UTC

This can be tested with the release of https://github.com/openshift/openshift-ansible/pull/8591

Comment 12 Wei Sun 2018-06-08 02:01:29 UTC

The PR has been merged to openshift-ansible-3.10.0-0.63.0,please check

Comment 13 Mike Fiedler 2018-06-11 15:46:22 UTC

Verified on 3.10.0-0.64.0.  prometheus-operator now using PV/PVC for persistence and resource limits have been removed from deployments/statefulsets/daemonsets.