Bug 1804655

Summary:	Prometheus retention configuration get reset when minor upgrade is performed on OCP311
Product:	OpenShift Container Platform	Reporter:	Daein Park <dapark>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mirollin, mloibl, pkrupa, spasquie, surbania
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: If retention settings are configured, those are overwritten by ansible playbooks. Consequence: Prometheus retention configuration is not applied. Fix: Ansible now supports setting the retention configuration. Result: Prometheus retention configuration is applied.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-15 07:17:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daein Park 2020-02-19 10:52:42 UTC

Description of problem:

We can configure retention time for samples in PrometheusK8sConfig on OCP311 through guide on [0].
But upgrade process will remove the value due to not included the cluster-monitoring-operator-config template.
This behavior affects removing unexpected old samples on Prometheus, and if operator allow only default 15 days, it's not reasonable on various modern systems.
Usually most system consider increasing retention time for samples, because 15 days is too short on product system.

[0] PrometheusK8sConfig
[https://github.com/openshift/cluster-monitoring-operator/blob/release-3.11/Documentation/user-guides/configuring-cluster-monitoring.md#prometheusk8sconfig]
~~~
Use PrometheusK8sConfig to customize the Prometheus instance used for cluster monitoring.

# retention time for samples.
retention: <string>
~~~

Version-Release number of selected component (if applicable):

This issue is reported when OCP311 upgrades from v3.11.135 to v3.11.157.

How reproducible:

You can always reproduce as follows.

Steps to Reproduce:
1. Configure 'retention: "25d"' to cluster-monitoring-config configmap.
2. Run upgrade playbooks or reinstall cluster-monitoring-operator.
3. The "retention" will be removed completely.

Actual results:

The configured "retention" has been removed, and it will result in removing unexpected old samples.

Expected results:

After upgrade, the configured "retention" is remained as it is.

Additional info:

Configuring "retention" has been already implemented feature, so we should consider this configuration.

Comment 2 Junqi Zhao 2020-02-27 12:45:32 UTC

openshift_cluster_monitoring_operator_prometheus_retention parameter is added
Tested with 
# rpm -qa | grep ansible
openshift-ansible-3.11.176-1.git.0.abb9886.el7.noarch
ansible-2.6.20-1.el7ae.noarch
openshift-ansible-playbooks-3.11.176-1.git.0.abb9886.el7.noarch
openshift-ansible-docs-3.11.176-1.git.0.abb9886.el7.noarch
openshift-ansible-roles-3.11.176-1.git.0.abb9886.el7.noarch

set value for openshift_cluster_monitoring_operator_prometheus_retention and it takes affect, example:
openshift_cluster_monitoring_operator_prometheus_retention=12h

# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "storage.tsdb.retention"
    - --storage.tsdb.retention=12h

Comment 5 errata-xmlrpc 2020-03-20 00:12:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0793

Comment 6 Mitchell Rollinson 2021-03-24 00:22:24 UTC

Hi team,

I would like to re-open this BZ as i have an issue which i believe to be related.

Please do advise if i need to create a new BZ.

This pertains to the retention of the following Prometheus customisations.

~~~~
--storage.tsdb.retention=7d
--storage.tsdb.min-block-duration=30m
--storage.tsdb.max-block-duration=2h
~~~
AND
nodeSelector:
        #node-role.kubernetes.io/infra: "true"
        infrarole: prometheus   <<==There was a requirement to schedule to a dedicated node, post installation.

Said changes were attempted first against statefulset.apps/prometheus-k8s AND then subsequently cm/cluster-monitoring-config when it became apparent that changes/cusomisations were lost.

The CU upgraded from 3.11.286 to Upgrade 3.11.380 

I note the creation of the variable openshift_cluster_monitoring_operator_prometheus_retention, but can you advise how best (if possible) to ensure the other customisations are retained.

Many thanks