Description of problem: The data volume for prometheus alertmanager, which is attached to the prometheus pod (in the deployment created by https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_prometheus) grows without limit. In my deployment, it is growing at a rate of ~1MB / minute. Growth does not stop and after enough time (several days to a week) the NFS host (in my case the ocp master) crashes due to running out of disk space. Version-Release number of selected component (if applicable): oc v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ocp-master01.10.35.48.138.nip.io:8443 openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 Images in use by the prometheus pod: Image: openshift/oauth-proxy:v1.0.0 Image: openshift/prometheus:v2.0.0-dev Image: openshift/oauth-proxy:v1.0.0 Image: openshift/prometheus-alert-buffer:v0.0.1 Image: openshift/prometheus-alertmanager:dev How reproducible: Every time Steps to Reproduce: 1. Deploy prometheus using internal NFS and the openshift-ansible role 2. Wait a week (or simply long enough to watch the volume in /exports grow large enough) Actual results: PV grows to unbounded size Expected results: PV should have an upper limit; alertmanager should have a finite retention period.
@pgier: is this something you should be looking into? or do we need to get someone from the OpenShift side to take this over?
I saw a similar phenomena but on the prometheus PV this time on a 3.7 cluster. [root@vm-49-57 exports]# oc version oc v3.7.0-0.178.0 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://baz-ocp-3.7-master01.10.35.49.57.nip.io:8443 openshift v3.7.0-0.178.0 kubernetes v1.7.6+a08f5eeb62 Images in use by the prometheus pod: image: openshift/oauth-proxy:v1.0.0 image: openshift/prometheus:v2.0.0-dev.3 image: openshift/oauth-proxy:v1.0.0 image: openshift/prometheus-alert-buffer:v0.0.2 image: openshift/prometheus-alertmanager:v0.9.1 I have noticed that the prometheus pv grew up to 29G after only 2 days up.
pgier - this is currently on the 3.7 blocker list. The growth rate here looks pretty severe. Please take a look and if this isn't something we need to block the release on please update the target release to 3.8.
I started investigating Scott's issue with the alertmanager, but I'm not sure yet why the disk usage is growing so much. Tried upgrading alertmanager to 0.9.1 as suggested in the upstream issue (https://github.com/prometheus/alertmanager/issues/1074), but there didn't seem to be any improvement.
Caused by https://github.com/kubernetes/kubernetes/pull/54921
this might be useful to get an insight on metrics counts/storage issues etc. https://github.com/kausalco/public/tree/master/promvt
Tested, Prometheus AlertManager volumes does not grow infinitely now # openshift version openshift v3.7.0-0.198.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 images prometheus-alert-buffer/images/v3.7.2-1 oauth-proxy/images/v3.7.2-1 prometheus-alertmanager/images/v3.7.2-1 prometheus/images/v3.7.2-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188