Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1508059 - Prometheus and AlertManager volumes grows infinitely
Prometheus and AlertManager volumes grows infinitely
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular (Show other bugs)
3.6.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 3.7.0
Assigned To: Paul Gier
Junqi Zhao
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-31 14:56 EDT by Scott Weiss
Modified: 2017-11-28 17:20 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:20:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description Scott Weiss 2017-10-31 14:56:28 EDT
Description of problem:
The data volume for prometheus alertmanager, which is attached to the prometheus pod (in the deployment created by https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_prometheus) grows without limit. In my deployment, it is growing at a rate of ~1MB / minute. Growth does not stop and after enough time (several days to a week) the NFS host (in my case the ocp master) crashes due to running out of disk space.

Version-Release number of selected component (if applicable):
oc v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ocp-master01.10.35.48.138.nip.io:8443
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7

Images in use by the prometheus pod:
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus:v2.0.0-dev
Image:              openshift/oauth-proxy:v1.0.0
Image:              openshift/prometheus-alert-buffer:v0.0.1
Image:              openshift/prometheus-alertmanager:dev

How reproducible:
Every time

Steps to Reproduce:
1. Deploy prometheus using internal NFS and the openshift-ansible role
2. Wait a week (or simply long enough to watch the volume in /exports grow large enough)

Actual results:
PV grows to unbounded size

Expected results:
PV should have an upper limit; alertmanager should have a finite retention period.
Comment 1 Matt Wringe 2017-10-31 15:06:22 EDT
@pgier: is this something you should be looking into? or do we need to get someone from the OpenShift side to take this over?
Comment 2 Barak 2017-11-01 11:54:02 EDT
I saw a similar phenomena but on the prometheus PV this time on a 3.7 cluster.

[root@vm-49-57 exports]# oc version 
oc v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://baz-ocp-3.7-master01.10.35.49.57.nip.io:8443
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62

Images in use by the prometheus pod:
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus:v2.0.0-dev.3
      image: openshift/oauth-proxy:v1.0.0
      image: openshift/prometheus-alert-buffer:v0.0.2
      image: openshift/prometheus-alertmanager:v0.9.1


I have noticed that the prometheus pv grew up to 29G after only 2 days up.
Comment 3 Paul Weil 2017-11-02 16:27:09 EDT
pgier - this is currently on the 3.7 blocker list.  The growth rate here looks pretty severe.  Please take a look and if this isn't something we need to block the release on please update the target release to 3.8.
Comment 4 Paul Gier 2017-11-03 09:56:42 EDT
I started investigating Scott's issue with the alertmanager, but I'm not sure yet why the disk usage is growing so much.  Tried upgrading alertmanager to 0.9.1 as suggested in the upstream issue (https://github.com/prometheus/alertmanager/issues/1074), but there didn't seem to be any improvement.
Comment 6 Clayton Coleman 2017-11-03 12:10:42 EDT
Caused by https://github.com/kubernetes/kubernetes/pull/54921
Comment 7 Krasi 2017-11-06 06:49:33 EST
this might be useful to get an insight on metrics counts/storage issues etc.
https://github.com/kausalco/public/tree/master/promvt
Comment 11 Junqi Zhao 2017-11-09 04:23:30 EST
Tested,  Prometheus AlertManager volumes does not grow infinitely now
# openshift version
openshift v3.7.0-0.198.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

images
prometheus-alert-buffer/images/v3.7.2-1
oauth-proxy/images/v3.7.2-1
prometheus-alertmanager/images/v3.7.2-1
prometheus/images/v3.7.2-1
Comment 14 errata-xmlrpc 2017-11-28 17:20:29 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.