1508059 – Prometheus and AlertManager volumes grows infinitely

Bug 1508059 - Prometheus and AlertManager volumes grows infinitely

Summary: Prometheus and AlertManager volumes grows infinitely

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Paul Gier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-31 18:56 UTC by Scott Weiss
Modified:	2017-11-28 22:20 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-28 22:20:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Scott Weiss 2017-10-31 18:56:28 UTC

Description of problem:
The data volume for prometheus alertmanager, which is attached to the prometheus pod (in the deployment created by https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_prometheus) grows without limit. In my deployment, it is growing at a rate of ~1MB / minute. Growth does not stop and after enough time (several days to a week) the NFS host (in my case the ocp master) crashes due to running out of disk space.

Version-Release number of selected component (if applicable):
oc v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ocp-master01.10.35.48.138.nip.io:8443
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7

Images in use by the prometheus pod:
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus:v2.0.0-dev
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus-alert-buffer:v0.0.1
Image:              openshift/prometheus-alertmanager:dev

How reproducible:
Every time

Steps to Reproduce:
1. Deploy prometheus using internal NFS and the openshift-ansible role
2. Wait a week (or simply long enough to watch the volume in /exports grow large enough)

Actual results:
PV grows to unbounded size

Expected results:
PV should have an upper limit; alertmanager should have a finite retention period.

Comment 1 Matt Wringe 2017-10-31 19:06:22 UTC

@pgier: is this something you should be looking into? or do we need to get someone from the OpenShift side to take this over?

Comment 2 Barak 2017-11-01 15:54:02 UTC

I saw a similar phenomena but on the prometheus PV this time on a 3.7 cluster.

[root@vm-49-57 exports]# oc version 
oc v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://baz-ocp-3.7-master01.10.35.49.57.nip.io:8443
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62

Images in use by the prometheus pod:
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus:v2.0.0-dev.3
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus-alert-buffer:v0.0.2
      image: openshift/prometheus-alertmanager:v0.9.1


I have noticed that the prometheus pv grew up to 29G after only 2 days up.

Comment 3 Paul Weil 2017-11-02 20:27:09 UTC

pgier - this is currently on the 3.7 blocker list.  The growth rate here looks pretty severe.  Please take a look and if this isn't something we need to block the release on please update the target release to 3.8.

Comment 4 Paul Gier 2017-11-03 13:56:42 UTC

I started investigating Scott's issue with the alertmanager, but I'm not sure yet why the disk usage is growing so much.  Tried upgrading alertmanager to 0.9.1 as suggested in the upstream issue (https://github.com/prometheus/alertmanager/issues/1074), but there didn't seem to be any improvement.

Comment 6 Clayton Coleman 2017-11-03 16:10:42 UTC

Caused by https://github.com/kubernetes/kubernetes/pull/54921

Comment 7 Krasi 2017-11-06 11:49:33 UTC

this might be useful to get an insight on metrics counts/storage issues etc.
https://github.com/kausalco/public/tree/master/promvt

Comment 11 Junqi Zhao 2017-11-09 09:23:30 UTC

Tested,  Prometheus AlertManager volumes does not grow infinitely now
# openshift version
openshift v3.7.0-0.198.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

images
prometheus-alert-buffer/images/v3.7.2-1
oauth-proxy/images/v3.7.2-1
prometheus-alertmanager/images/v3.7.2-1
prometheus/images/v3.7.2-1

Comment 14 errata-xmlrpc 2017-11-28 22:20:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.