Bug 1576543

Summary:

prometheus-operator pods getting OoM killed @ 750 nodes

Product:

OpenShift Container Platform

Reporter:

Jiří Mencák <jmencak>

Component:

Monitoring

Assignee:

Frederic Branczyk <fbranczy>

Status:

CLOSED ERRATA

QA Contact:

Mike Fiedler <mifiedle>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.10.0

CC:

aos-bugs, byron.collins, dmace, jeder, juzhao, lcosic, mifiedle, spasquie

Target Milestone:

---

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-310

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-16 06:27:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
OoM kills (dmesg), oc get pods, oc describe pod prometheus-operator*	none

Description Jiří Mencák 2018-05-09 17:19:11 UTC

Created attachment 1433959 [details]
OoM kills (dmesg), oc get pods, oc describe pod prometheus-operator*

Description of problem:

$ oc version
oc v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://lb-0.scale-ci.example.com:8443
openshift v3.10.0-0.32.0
kubernetes v1.10.0+b81c8f8

Steps to Reproduce:
1. Install a larger OCP cluster and watch prometheus-operator getting OoM killed

Actual results:
Please see attachment.O

Expected results:
No OoM kills.

Comment 2 Frederic Branczyk 2019-02-06 17:21:32 UTC

We have done various scalability changes for 4.0, this needs to be re-assessed in the 4.0 scope.

Comment 4 Junqi Zhao 2019-02-19 01:27:45 UTC

@Mike

Could you also help to test this aos-scalability bug?

Comment 6 Junqi Zhao 2019-04-11 00:58:23 UTC

Did not find this issue in one smaller cluster, not sure if it would be happen in a larger cluster

Comment 9 Mike Fiedler 2019-09-13 16:39:59 UTC

Marking verified on 4.2.   There won't be another 750+ node cluster run until post-4.2 and a new bz can be opened then if there is an issue.   In a 250 node cluster on GCP, prometheus-operator is using 80Mb VSZ and 2MB RSS

Comment 11 errata-xmlrpc 2019-10-16 06:27:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922