Bug 1508059

Summary:	Prometheus and AlertManager volumes grows infinitely
Product:	OpenShift Container Platform	Reporter:	Scott Weiss <scweiss>
Component:	Hawkular	Assignee:	Paul Gier <pgier>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.6.1	CC:	aos-bugs, bazulay, ccoleman, eparis, kgeorgie, pgier, pweil
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:20:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Weiss 2017-10-31 18:56:28 UTC

Description of problem:
The data volume for prometheus alertmanager, which is attached to the prometheus pod (in the deployment created by https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_prometheus) grows without limit. In my deployment, it is growing at a rate of ~1MB / minute. Growth does not stop and after enough time (several days to a week) the NFS host (in my case the ocp master) crashes due to running out of disk space.

Version-Release number of selected component (if applicable):
oc v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ocp-master01.10.35.48.138.nip.io:8443
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7

Images in use by the prometheus pod:
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus:v2.0.0-dev
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus-alert-buffer:v0.0.1
Image:              openshift/prometheus-alertmanager:dev

How reproducible:
Every time

Steps to Reproduce:
1. Deploy prometheus using internal NFS and the openshift-ansible role
2. Wait a week (or simply long enough to watch the volume in /exports grow large enough)

Actual results:
PV grows to unbounded size

Expected results:
PV should have an upper limit; alertmanager should have a finite retention period.

Comment 1 Matt Wringe 2017-10-31 19:06:22 UTC

@pgier: is this something you should be looking into? or do we need to get someone from the OpenShift side to take this over?

Comment 2 Barak 2017-11-01 15:54:02 UTC

I saw a similar phenomena but on the prometheus PV this time on a 3.7 cluster.

[root@vm-49-57 exports]# oc version 
oc v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://baz-ocp-3.7-master01.10.35.49.57.nip.io:8443
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62

Images in use by the prometheus pod:
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus:v2.0.0-dev.3
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus-alert-buffer:v0.0.2
      image: openshift/prometheus-alertmanager:v0.9.1


I have noticed that the prometheus pv grew up to 29G after only 2 days up.

Comment 3 Paul Weil 2017-11-02 20:27:09 UTC

pgier - this is currently on the 3.7 blocker list.  The growth rate here looks pretty severe.  Please take a look and if this isn't something we need to block the release on please update the target release to 3.8.

Comment 4 Paul Gier 2017-11-03 13:56:42 UTC

I started investigating Scott's issue with the alertmanager, but I'm not sure yet why the disk usage is growing so much.  Tried upgrading alertmanager to 0.9.1 as suggested in the upstream issue (https://github.com/prometheus/alertmanager/issues/1074), but there didn't seem to be any improvement.

Comment 6 Clayton Coleman 2017-11-03 16:10:42 UTC

Caused by https://github.com/kubernetes/kubernetes/pull/54921

Comment 7 Krasi 2017-11-06 11:49:33 UTC

this might be useful to get an insight on metrics counts/storage issues etc.
https://github.com/kausalco/public/tree/master/promvt

Comment 11 Junqi Zhao 2017-11-09 09:23:30 UTC

Tested,  Prometheus AlertManager volumes does not grow infinitely now
# openshift version
openshift v3.7.0-0.198.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

images
prometheus-alert-buffer/images/v3.7.2-1
oauth-proxy/images/v3.7.2-1
prometheus-alertmanager/images/v3.7.2-1
prometheus/images/v3.7.2-1

Comment 14 errata-xmlrpc 2017-11-28 22:20:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188