Description of problem (please be detailed as possible and provide log snippests): During verification of Bug 1974441 after upgrade from OCS 4.7 to 4.8, I tried to check the info about MGR endpoint present in CephCluster resource but observed that monitoring spec was empty and has been reset. Although the monitoring spec was empty I didn't notice any obvious warning/error in UI ocs-dashboards. And as I noted in my comment Bug 1974441#c15, the monitoring endpoint was also being updated if there was any MGR failover in external RHCS. However as per comment Bug 1974441#c16 this is not expected behaviour that spec is empty and getting reset. Hence this needs further investigation to idetify why it's happening. > After upgrade from 4.7 to 4.8 $ oc get cephcluster -o yaml ... status: {} logCollector: {} mgr: {} mon: count: 0 monitoring: {} network: ipFamily: IPv4 security: kms: {} storage: {} ... Monitoring spec in fresh OCS 4.7 deployment looks like this ... mgr: {} mon: {} monitoring: enabled: true externalMgrEndpoints: - ip: 10.1.8.51 externalMgrPrometheusPort: 9283 rulesNamespace: openshift-storage ... Monitoring spec in fresh OCS 4.8 deployment looks like this ... mgr: {} mon: count: 0 monitoring: enabled: true externalMgrEndpoints: - ip: 10.1.8.51 - ip: 10.1.8.106 - ip: 10.1.8.29 externalMgrPrometheusPort: 9283 rulesNamespace: openshift-storage ... Version of all relevant components (if applicable): OCS: 4.8.0-452.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? - Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes 2/2 Can this issue reproduce from the UI? yes If this is a regression, please provide more details to justify this: Steps to Reproduce: Although this was observed during upgrade, I was able to repro by restarting the ocs-operator pod and then check for monitoring section in CephCluster resource Actual results: monitoring spec is empty Expected results: monitoring spec should not be empty
RCA (TLDR): ---------- An external cluster JSON is read, by the OCS-Operator, only when the checksum hash (a unique hexa string) of the JSON content is NOT same with the hash content in storagecluster.spec.externalSecretHash. For the first time, when OCS-Operator starts, the hash is empty in storagecluster.spec.externalSecretHash. Since the hash of the JSON content and the existing externalSecretHash won't match, it will create all resources and [important] update the externalSecretHash storagecluster spec. From the next time/reconcile onwards, both the hashes (JSON input's and spec's) match and there won't be any (repeated) creation external resources. *restart* When there is a OCS-Operator restart, storagecluster instance still has the same externalSecretHash spec which matches exactly with the JSON input. Thus new external resources won't be created (as the hashes matches). Why CephCluster's monitoring spec is not being updated? The way we pass the monitoring details from the JSON input to the CephCluster's monitoring spec is through StorageCluster's Reconciler object (not through any persistent resource like ConfigMaps or Secrets). On the restart of OCS-Operator, reconciler object is fresh and it won't have the monitoring values. You could see from the above *restart* reference, hashes are same and we won't get into the external resources creation part. Workaround: ----------- In progress FIX ------- Need to fix the way monitoring details are being passed to CephCluster spec. Instead of passing it through the internal Reconciler object, we should populate the monitoring details through a ConfigMap and allow the CephCluster to access it.
A WIP PR is pushed: https://github.com/openshift/ocs-operator/pull/1284 Will remove the WIP tag, once we test the changes in Sidhant's external cluster.
Summary: If the ocs-operator restarts, monitoring spec becomes empty but we are able to update the monitoring end point as well as we don't see any issues with the UI monitoring pages. Anmol, Sidhant and Arun had look at the latest external cluster and didn't see any impact in the UI functioning. We need to document that the customer should update the secret after upgrading from 4.7 to 4.8 so that the details of all endpoints are updated in the secret. This is to avoid any problem in future raised due to difference in the number of endpoints in fresh vs upgraded clusters. Sidhant will raise a doc bug for this. Details: https://chat.google.com/room/AAAAREGEba8/QWUHDIkEvW8
Updated the doc_text with some additional info. Please check.
Changed the logic a bit differently. Instead of adding an extra resource, 'ConfigMap', directly reading the Monitoring details from the external cluster secret. PR: https://github.com/openshift/ocs-operator/pull/1285
Please test with the latest build.
Arun, please add the doc text
Updated the docs, please check.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086