Description of problem: Timers in Go >= 1.14 introduced a degradation with the compression of data on disk (see https://github.com/prometheus/prometheus/pull/7976 for details but users reported that block sizes increased up to 50% compared to previous versions). Prometheus for OCP 4.5 (and before) are built with Go 1.13 so the only affected version is OCP 4.6. Version-Release number of selected component (if applicable): 4.6 How reproducible: Always Steps to Reproduce: Run Prometheus at least 4 hours and measure the sample compression with: rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h]) Actual results: More than 2 bytes on disk per sample. Expected results: Between 1 and 2 bytes on disk per sample. Additional info: See https://github.com/prometheus/prometheus/issues/7846
Fixed in https://github.com/openshift/prometheus/pull/61 by bumping Prometheus to v2.22.0.
tested with 4.7.0-0.nightly-2020-10-22-141237 in upi-on-azure cluster,prometheus version="2.22.0" Element Value prometheus_build_info{branch="rhaos-4.7-rhel-8",container="prometheus-proxy",endpoint="web",goversion="go1.15.0",instance="10.128.2.15:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-1",revision="7014907b651c19701e46e21f622c2b113cae6cac",service="prometheus-k8s",version="2.22.0"} 1 prometheus_build_info{branch="rhaos-4.7-rhel-8",container="prometheus-proxy",endpoint="web",goversion="go1.15.0",instance="10.129.2.12:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-0",revision="7014907b651c19701e46e21f622c2b113cae6cac",service="prometheus-k8s",version="2.22.0"} 1 Run Prometheus at least 4 hours and measure the sample compression with: rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h]) the result is greater than 2, not between 1 and 2 bytes Element Value {container="prometheus-proxy",endpoint="web",instance="10.128.2.15:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-1",service="prometheus-k8s"} 2.014416691936647 {container="prometheus-proxy",endpoint="web",instance="10.129.2.12:9091",job="prometheus-k8s",namespace="openshift-monitoring",pod="prometheus-k8s-0",service="prometheus-k8s"} 2.1713813708344802
How long have you waited before running the query? I guess that the results can be skewed if not enough compactions have happened. I'll paste a screenshot for 4.6 which show that the second compaction has a better compression than the first one.
Created attachment 1723776 [details] compression ratio on 4.6
(In reply to Simon Pasquier from comment #4) > How long have you waited before running the query? I guess that the results > can be skewed if not enough compactions have happened. I'll paste a > screenshot for 4.6 which show that the second compaction has a better > compression than the first one. waited for about 5 - 6 hours, will wait for a longer time and monitor again
4.7.0-0.nightly-2020-10-27-051128, let the cluster run for 8 hours, search "rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[8h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[8h])" the result is between 1 and 2 bytes, see the attached picture
Created attachment 1724998 [details] 4.7 compression ratio
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633