Cause: the Cluster Monitoring Operator playbook resets the CMO ConfigMap every time it's executed.
Consequence: manual changes to the ConfigMap enabling the etcd monitoring are lost.
Fix: etcd monitoring can be configured with Ansible.
Result: etcd monitoring is persisted when the CMO playbook is executed again.
DescriptionK Chandra Sekar
2019-04-25 10:46:29 UTC
Description of problem:
etcd monitoring configuration is completely getting reset when performing an minor upgrade from v3.11.88 to v3.11.98.
Followed the guide[1] to setup etcd monitoring will not come by default when OpenShift Monitoring Stack is set up.So after setting up the etcd monitoring successfully when we upgarde the cluster to a minor version whole etcd monitoring setup is getting disappeared and it reverts to the default OpenShift Monitoring Stack as a result it shows all the etcd targets are down.Minor upgrades shouldn't be doing this as etcd is major component which requires continues monitoring
How reproducible: Always
Steps to Reproduce:
1.Set up a OpenShift Monitoring Stack on OpenShift v3.11
2.Next setup etcd monitoring as stated from the guide[1]
3.Just upgrade the whole cluster to minor version and boom OpenShift Monitoring stack is reverted back to its original state and etcd config goes missing.
Actual results:
Whole etcd monitoring setup is getting disappeared and it reverts to the default OpenShift Monitoring Stack after a minor cluster update as a result it shows all the etcd targets are down.
Expected results:
Minor upgrades shouldn't be doing this as etcd is major component which requires continues monitoring.So minor cluster upgrades should still persist the configuration moving onto to the next version as well unless there are major breaking changes involved.
Additional info:
[1]- https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html#configuring-etcd-monitoring
Comment 1Frederic Branczyk
2019-04-25 11:58:19 UTC
Yes I can see how this happens, this is indeed a bug. As a work around for now, you can reapply the configuration without an issue and you should get back into the expected state. Of course that's not how it should be, but a way to move forward for the customer in the immediate situation until we fix this. This needs a fix in the OpenShift ansible playbooks.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:0793