Bug 1531096
Summary: | Prometheus fills up entire storage space | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Rajnikant <rkant> |
Component: | Hawkular | Assignee: | Paul Gier <pgier> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.7.0 | CC: | aos-bugs, dennis.stritzke, jcantril, pdwyer, scortopa, smunilla |
Target Milestone: | --- | ||
Target Release: | 3.9.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openshift v3.9.22 | Doc Type: | No Doc Update |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-18 18:19:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rajnikant
2018-01-04 14:56:56 UTC
Possible workaround is to delete series with the error and then delete the .tmp directories: https://github.com/prometheus/prometheus/issues/3487#issuecomment-347491886 I can confirm that the workaround is working. Unfortunately, the issue is happening over an over again with new series so that this is a very temporal workaround. Are you using a custom value for storage.tsdb.min-block-duration? The openshift installer currently defaults to a setting of 2 minutes but we found that the default of 2h prevents some out of memory issues in some cases. Not sure if this will also affect disk usage, but it should at least reduce the number of tsdb block directories that are created. We are not setting the storage.tsdb.min-block-duration. Just to be complete, here is the list of things that we are setting: - '--storage.tsdb.retention=168h' - '--config.file=/etc/prometheus/prometheus.yml' - '--web.listen-address=:9090' - '--storage.tsdb.path=/data' - '--web.enable-admin-api' Prometheus 2.1.0 was released this week and contains several fixes to the tsdb. Can you try using the upstream prom/prometheus:v2.1.0 container image to see if it resolves the storage issue? Sorry for not keeping this issue up to date. I deployed Prometheus 2.1 upstream image in parallel to our current setup. Will have collected enough inside until Feb 13 with real usage pattern and also provoking the issue like before. I was able to verify, that the storage issue is resolved with the 2.1 upstream image. Great! We're planning to push out the 2.1.0 upgrade for openshift 3.7 and higher. PRs for upgrading prometheus in examples and installer: https://github.com/openshift/origin/pull/18727 https://github.com/openshift/openshift-ansible/pull/7258 The master (3.10) and 3.9 branches of openshift have been updated to use prometheus 2.2.1 which should resolve this issue. Tested with prometheus/images/v3.9.22-1,prometheus version is 2.2.1 now in prometheus 3.9 image, and passed our sanity testing other images prometheus-alert-buffer/images/v3.9.22-1 prometheus-alertmanager/images/v3.9.22-1 oauth-proxy/images/v3.9.22-1 # openshift version openshift v3.9.22 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 (In reply to Dennis Stritzke from comment #6) > We are not setting the storage.tsdb.min-block-duration. > > Just to be complete, here is the list of things that we are setting: > - '--storage.tsdb.retention=168h' > - '--config.file=/etc/prometheus/prometheus.yml' > - '--web.listen-address=:9090' > - '--storage.tsdb.path=/data' > - '--web.enable-admin-api' How to manage this settings inside prometheus pods? E.g. changing --storage.tsdb.retention from 15d to another value. As far as I can see they're startup args for the containers. Thanks a lot |