Bug 1804655 - Prometheus retention configuration get reset when minor upgrade is performed on OCP311
Summary: Prometheus retention configuration get reset when minor upgrade is performed ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-19 10:52 UTC by Daein Park
Modified: 2021-08-09 14:41 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If retention settings are configured, those are overwritten by ansible playbooks. Consequence: Prometheus retention configuration is not applied. Fix: Ansible now supports setting the retention configuration. Result: Prometheus retention configuration is applied.
Clone Of:
Environment:
Last Closed: 2021-04-15 07:17:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12105 0 None closed Bug 1804655: roles/openshift_cluster_monitoring_operator: configure retention 2021-01-27 13:48:05 UTC
Red Hat Product Errata RHBA-2020:0793 0 None None None 2020-03-20 00:12:55 UTC

Description Daein Park 2020-02-19 10:52:42 UTC
Description of problem:

We can configure retention time for samples in PrometheusK8sConfig on OCP311 through guide on [0].
But upgrade process will remove the value due to not included the  cluster-monitoring-operator-config template.
This behavior affects removing unexpected old samples on Prometheus, and if operator allow only default 15 days, it's not reasonable on various modern systems. 
Usually most system consider increasing retention time for samples, because 15 days is too short on product system.

[0] PrometheusK8sConfig
    [https://github.com/openshift/cluster-monitoring-operator/blob/release-3.11/Documentation/user-guides/configuring-cluster-monitoring.md#prometheusk8sconfig]
~~~
Use PrometheusK8sConfig to customize the Prometheus instance used for cluster monitoring.

# retention time for samples.
retention: <string>
~~~

Version-Release number of selected component (if applicable):

This issue is reported when OCP311 upgrades from v3.11.135 to v3.11.157.

How reproducible:

You can always reproduce as follows.

Steps to Reproduce:
1. Configure 'retention: "25d"' to cluster-monitoring-config configmap.
2. Run upgrade playbooks or reinstall cluster-monitoring-operator.
3. The "retention" will be removed completely.

Actual results:

The configured "retention" has been removed, and it will result in removing unexpected old samples. 

Expected results:

After upgrade, the configured "retention" is remained as it is.

Additional info:

Configuring "retention" has been already implemented feature, so we should consider this configuration.

Comment 2 Junqi Zhao 2020-02-27 12:45:32 UTC
openshift_cluster_monitoring_operator_prometheus_retention parameter is added
Tested with 
# rpm -qa | grep ansible
openshift-ansible-3.11.176-1.git.0.abb9886.el7.noarch
ansible-2.6.20-1.el7ae.noarch
openshift-ansible-playbooks-3.11.176-1.git.0.abb9886.el7.noarch
openshift-ansible-docs-3.11.176-1.git.0.abb9886.el7.noarch
openshift-ansible-roles-3.11.176-1.git.0.abb9886.el7.noarch

set value for openshift_cluster_monitoring_operator_prometheus_retention and it takes affect, example:
openshift_cluster_monitoring_operator_prometheus_retention=12h

# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "storage.tsdb.retention"
    - --storage.tsdb.retention=12h

Comment 5 errata-xmlrpc 2020-03-20 00:12:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0793

Comment 6 Mitchell Rollinson 2021-03-24 00:22:24 UTC
Hi team,

I would like to re-open this BZ as i have an issue which i believe to be related.

Please do advise if i need to create a new BZ.

This pertains to the retention of the following Prometheus customisations.

~~~~
--storage.tsdb.retention=7d
--storage.tsdb.min-block-duration=30m
--storage.tsdb.max-block-duration=2h
~~~
AND
nodeSelector:
        #node-role.kubernetes.io/infra: "true"
        infrarole: prometheus   <<==There was a requirement to schedule to a dedicated node, post installation.

Said changes were attempted first against statefulset.apps/prometheus-k8s AND then subsequently cm/cluster-monitoring-config when it became apparent that changes/cusomisations were lost.

The CU upgraded from 3.11.286 to Upgrade 3.11.380 

I note the creation of the variable openshift_cluster_monitoring_operator_prometheus_retention, but can you advise how best (if possible) to ensure the other customisations are retained.

Many thanks


Note You need to log in before you can comment on or make changes to this bug.