1889710 – Prometheus metrics on disk take more space compared to OCP 4.5

Bug 1889710 - Prometheus metrics on disk take more space compared to OCP 4.5

Summary: Prometheus metrics on disk take more space compared to OCP 4.5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1889711
TreeView+	depends on / blocked

Reported:	2020-10-20 12:44 UTC by Simon Pasquier
Modified:	2021-02-24 15:27 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1889711 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:26:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
compression ratio on 4.6 (74.52 KB, image/png) 2020-10-23 13:47 UTC, Simon Pasquier	no flags	Details
4.7 compression ratio (64.80 KB, image/png) 2020-10-29 08:25 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift prometheus pull 61	0	None	closed	Bug 1885235: bump Prometheus to v2.22.0	2021-01-13 09:39:46 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:27:26 UTC

Description Simon Pasquier 2020-10-20 12:44:52 UTC

Description of problem:
Timers in Go >= 1.14 introduced a degradation with the compression of data on disk (see https://github.com/prometheus/prometheus/pull/7976 for details but users reported that block sizes increased up to 50% compared to previous versions). Prometheus for OCP 4.5 (and before) are built with Go 1.13 so the only affected version is OCP 4.6.

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always

Steps to Reproduce:
Run Prometheus at least 4 hours and measure the sample compression with:

rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h])

Actual results:
More than 2 bytes on disk per sample.

Expected results:
Between 1 and 2 bytes on disk per sample.

Additional info:
See https://github.com/prometheus/prometheus/issues/7846

Comment 1 Simon Pasquier 2020-10-20 12:45:35 UTC

Fixed in https://github.com/openshift/prometheus/pull/61 by bumping Prometheus to v2.22.0.

Comment 3 Junqi Zhao 2020-10-23 06:53:05 UTC

tested with 4.7.0-0.nightly-2020-10-22-141237 in upi-on-azure cluster,prometheus version="2.22.0"
Element 	Value
prometheus_build_info{branch="rhaos-4.7-rhel-8",container="prometheus-proxy",endpoint="web",goversion="go1.15.0",instance="10.128.2.15:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-1",revision="7014907b651c19701e46e21f622c2b113cae6cac",service="prometheus-k8s",version="2.22.0"}	1
prometheus_build_info{branch="rhaos-4.7-rhel-8",container="prometheus-proxy",endpoint="web",goversion="go1.15.0",instance="10.129.2.12:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-0",revision="7014907b651c19701e46e21f622c2b113cae6cac",service="prometheus-k8s",version="2.22.0"}	1

Run Prometheus at least 4 hours and measure the sample compression with:
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h])
the result is greater than 2, not between 1 and 2 bytes

Element 	Value
{container="prometheus-proxy",endpoint="web",instance="10.128.2.15:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-1",service="prometheus-k8s"}	2.014416691936647
{container="prometheus-proxy",endpoint="web",instance="10.129.2.12:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-0",service="prometheus-k8s"}	2.1713813708344802

Comment 4 Simon Pasquier 2020-10-23 13:46:53 UTC

How long have you waited before running the query? I guess that the results can be skewed if not enough compactions have happened. I'll paste a screenshot for 4.6 which show that the second compaction has a better compression than the first one.

Comment 5 Simon Pasquier 2020-10-23 13:47:34 UTC

Created attachment 1723776 [details]
compression ratio on 4.6

Comment 6 Junqi Zhao 2020-10-27 02:25:40 UTC

(In reply to Simon Pasquier from comment #4)
> How long have you waited before running the query? I guess that the results
> can be skewed if not enough compactions have happened. I'll paste a
> screenshot for 4.6 which show that the second compaction has a better
> compression than the first one.

waited for about 5 - 6 hours, will wait for a longer time and monitor again

Comment 8 Junqi Zhao 2020-10-29 08:24:02 UTC

4.7.0-0.nightly-2020-10-27-051128, let the cluster run for 8 hours,
search "rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h])"
the result is between 1 and 2 bytes, see the attached picture

Comment 9 Junqi Zhao 2020-10-29 08:25:01 UTC

Created attachment 1724998 [details]
4.7 compression ratio

Comment 13 errata-xmlrpc 2021-02-24 15:26:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.