1531096 – Prometheus fills up entire storage space

Bug 1531096 - Prometheus fills up entire storage space

Summary: Prometheus fills up entire storage space

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Paul Gier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-04 14:56 UTC by Rajnikant
Modified:	2021-03-11 16:49 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openshift v3.9.22
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-06-18 18:19:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1541212	0	unspecified	CLOSED	prometheus fails compaction	2021-03-11 17:05:19 UTC
Red Hat Product Errata	RHSA-2018:2013	0	normal	SHIPPED_LIVE	Important: OpenShift Container Platform 3.9 security, bug fix, and enhancement update	2018-06-27 22:01:43 UTC

Internal Links: 1541212

Description Rajnikant 2018-01-04 14:56:56 UTC

Description of problem:

Prometheus fills up entire storage space with hundreds of *.tmp files, even though the actual storage used by the time series data is around(~4GB) .

Version-Release number of selected component (if applicable):
3.7
registry.access.redhat.com/openshift3/prometheus:v3.7.14-5

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Prometheus fills up all of it's storage space with hundreds of *.tmp files, even though the actual storage used by the time series data is around(~4GB)

Expected results:


Additional info:

Comment 3 Paul Gier 2018-01-10 19:06:41 UTC

Possible workaround is to delete series with the error and then delete the .tmp directories:
https://github.com/prometheus/prometheus/issues/3487#issuecomment-347491886

Comment 4 Dennis Stritzke 2018-01-12 13:19:48 UTC

I can confirm that the workaround is working. Unfortunately, the issue is happening over an over again with new series so that this is a very temporal workaround.

Comment 5 Paul Gier 2018-01-17 16:45:46 UTC

Are you using a custom value for storage.tsdb.min-block-duration?  The openshift installer currently defaults to a setting of 2 minutes but we found that the default of 2h prevents some out of memory issues in some cases.  Not sure if this will also affect disk usage, but it should at least reduce the number of tsdb block directories that are created.

Comment 6 Dennis Stritzke 2018-01-17 16:50:11 UTC

We are not setting the storage.tsdb.min-block-duration.

Just to be complete, here is the list of things that we are setting:
- '--storage.tsdb.retention=168h'
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.listen-address=:9090'
- '--storage.tsdb.path=/data'
- '--web.enable-admin-api'

Comment 7 Paul Gier 2018-01-25 16:55:40 UTC

Prometheus 2.1.0 was released this week and contains several fixes to the tsdb.
Can you try using the upstream prom/prometheus:v2.1.0 container image to see if it resolves the storage issue?

Comment 8 Dennis Stritzke 2018-02-07 13:09:43 UTC

Sorry for not keeping this issue up to date. I deployed Prometheus 2.1 upstream image in parallel to our current setup. Will have collected enough inside until Feb 13 with real usage pattern and also provoking the issue like before.

Comment 9 Dennis Stritzke 2018-02-16 08:58:45 UTC

I was able to verify, that the storage issue is resolved with the 2.1 upstream image.

Comment 10 Paul Gier 2018-02-22 01:04:21 UTC

Great!  We're planning to push out the 2.1.0 upgrade for openshift 3.7 and higher.

Comment 11 Paul Gier 2018-02-22 22:22:12 UTC

PRs for upgrading prometheus in examples and installer:
https://github.com/openshift/origin/pull/18727
https://github.com/openshift/openshift-ansible/pull/7258

Comment 13 Paul Gier 2018-04-18 18:06:03 UTC

The master (3.10) and 3.9 branches of openshift have been updated to use prometheus 2.2.1 which should resolve this issue.

Comment 14 Junqi Zhao 2018-04-19 09:42:34 UTC

Tested with prometheus/images/v3.9.22-1,prometheus version is 2.2.1 now in prometheus 3.9 image, and passed our sanity testing

other images
prometheus-alert-buffer/images/v3.9.22-1
prometheus-alertmanager/images/v3.9.22-1
oauth-proxy/images/v3.9.22-1


# openshift version
openshift v3.9.22
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

Comment 15 Serena Cortopassi 2019-03-18 14:33:25 UTC

(In reply to Dennis Stritzke from comment #6)
> We are not setting the storage.tsdb.min-block-duration.
> 
> Just to be complete, here is the list of things that we are setting:
> - '--storage.tsdb.retention=168h'
> - '--config.file=/etc/prometheus/prometheus.yml'
> - '--web.listen-address=:9090'
> - '--storage.tsdb.path=/data'
> - '--web.enable-admin-api'

How to manage this settings inside prometheus pods? E.g. changing --storage.tsdb.retention from 15d to another value.
As far as I can see they're startup args for the containers.

Thanks a lot

Note You need to log in before you can comment on or make changes to this bug.