Description of problem: We can configure retention time for samples in PrometheusK8sConfig on OCP311 through guide on [0]. But upgrade process will remove the value due to not included the cluster-monitoring-operator-config template. This behavior affects removing unexpected old samples on Prometheus, and if operator allow only default 15 days, it's not reasonable on various modern systems. Usually most system consider increasing retention time for samples, because 15 days is too short on product system. [0] PrometheusK8sConfig [https://github.com/openshift/cluster-monitoring-operator/blob/release-3.11/Documentation/user-guides/configuring-cluster-monitoring.md#prometheusk8sconfig] ~~~ Use PrometheusK8sConfig to customize the Prometheus instance used for cluster monitoring. # retention time for samples. retention: <string> ~~~ Version-Release number of selected component (if applicable): This issue is reported when OCP311 upgrades from v3.11.135 to v3.11.157. How reproducible: You can always reproduce as follows. Steps to Reproduce: 1. Configure 'retention: "25d"' to cluster-monitoring-config configmap. 2. Run upgrade playbooks or reinstall cluster-monitoring-operator. 3. The "retention" will be removed completely. Actual results: The configured "retention" has been removed, and it will result in removing unexpected old samples. Expected results: After upgrade, the configured "retention" is remained as it is. Additional info: Configuring "retention" has been already implemented feature, so we should consider this configuration.
openshift_cluster_monitoring_operator_prometheus_retention parameter is added Tested with # rpm -qa | grep ansible openshift-ansible-3.11.176-1.git.0.abb9886.el7.noarch ansible-2.6.20-1.el7ae.noarch openshift-ansible-playbooks-3.11.176-1.git.0.abb9886.el7.noarch openshift-ansible-docs-3.11.176-1.git.0.abb9886.el7.noarch openshift-ansible-roles-3.11.176-1.git.0.abb9886.el7.noarch set value for openshift_cluster_monitoring_operator_prometheus_retention and it takes affect, example: openshift_cluster_monitoring_operator_prometheus_retention=12h # oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "storage.tsdb.retention" - --storage.tsdb.retention=12h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0793
Hi team, I would like to re-open this BZ as i have an issue which i believe to be related. Please do advise if i need to create a new BZ. This pertains to the retention of the following Prometheus customisations. ~~~~ --storage.tsdb.retention=7d --storage.tsdb.min-block-duration=30m --storage.tsdb.max-block-duration=2h ~~~ AND nodeSelector: #node-role.kubernetes.io/infra: "true" infrarole: prometheus <<==There was a requirement to schedule to a dedicated node, post installation. Said changes were attempted first against statefulset.apps/prometheus-k8s AND then subsequently cm/cluster-monitoring-config when it became apparent that changes/cusomisations were lost. The CU upgraded from 3.11.286 to Upgrade 3.11.380 I note the creation of the variable openshift_cluster_monitoring_operator_prometheus_retention, but can you advise how best (if possible) to ensure the other customisations are retained. Many thanks