Bug 1984735

Summary: [External Mode] Monitoring spec is getting reset in CephCluster resource
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sidhant Agrawal <sagrawal>
Component: ocs-operatorAssignee: arun kumar mohan <amohan>
Status: CLOSED ERRATA QA Contact: Sidhant Agrawal <sagrawal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: amohan, ebenahar, kbg, madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, rperiyas, sostapov, uchapaga
Target Milestone: ---Keywords: AutomationBackLog
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
.Monitoring spec is getting reset in CephCluster resource in external mode Previously, when OpenShift Container Storage was upgraded, the monitoring endpoints would get reset in external CephCluster's monitoring spec. This was not an expected behavior and was due to the way monitoring endpoints were passed to the CephCluster. With this update, the way endpoints are passed is changed. Before the CephCluster is created, the endpoints are accessed directly from the JSON secret, `rook-ceph-external-cluster-details` and the CephCluster spec is updated. As a result, the monitoring endpoint specs in the CephCluster is updated properly with appropriate values even after the OpenShift Container Storage upgrade.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-13 17:44:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1966894, 2011326    

Description Sidhant Agrawal 2021-07-22 04:42:23 UTC
Description of problem (please be detailed as possible and provide log
snippests):

During verification of Bug 1974441 after upgrade from OCS 4.7 to 4.8, I tried to check the info about MGR endpoint present in CephCluster resource but observed that monitoring spec was empty and has been reset.

Although the monitoring spec was empty I didn't notice any obvious warning/error in UI ocs-dashboards. And as I noted in my comment Bug 1974441#c15, the monitoring endpoint was also being updated if there was any MGR failover in external RHCS.
However as per comment Bug 1974441#c16 this is not expected behaviour that spec is empty and getting reset. Hence this needs further investigation to idetify why it's happening.

> After upgrade from 4.7 to 4.8
$ oc get cephcluster -o yaml
...
        status: {}
    logCollector: {}
    mgr: {}
    mon:
      count: 0
    monitoring: {}
    network:
      ipFamily: IPv4
    security:
      kms: {}
    storage: {}
...


Monitoring spec in fresh OCS 4.7 deployment looks like this
...
   mgr: {}
    mon: {}
    monitoring:
      enabled: true
      externalMgrEndpoints:
      - ip: 10.1.8.51
      externalMgrPrometheusPort: 9283
      rulesNamespace: openshift-storage

...


Monitoring spec in fresh OCS 4.8 deployment looks like this
...
    mgr: {}
    mon:
      count: 0
    monitoring:
      enabled: true
      externalMgrEndpoints:
      - ip: 10.1.8.51
      - ip: 10.1.8.106
      - ip: 10.1.8.29
      externalMgrPrometheusPort: 9283
      rulesNamespace: openshift-storage

...


Version of all relevant components (if applicable):
OCS: 4.8.0-452.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
-

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes 2/2

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Although this was observed during upgrade, I was able to repro by restarting the ocs-operator pod and then check for monitoring section in CephCluster resource

Actual results:
monitoring spec is empty

Expected results:
monitoring spec should not be empty

Comment 9 arun kumar mohan 2021-07-27 17:52:32 UTC
RCA  (TLDR):
----------

An external cluster JSON is read, by the OCS-Operator, only when the checksum hash (a unique hexa string) of the JSON content is NOT same with the hash content in storagecluster.spec.externalSecretHash.

For the first time, when OCS-Operator starts, the hash is empty in storagecluster.spec.externalSecretHash. Since the hash of the JSON content and the existing externalSecretHash won't match, it will create all resources and [important] update the externalSecretHash storagecluster spec.

From the next time/reconcile onwards, both the hashes (JSON input's and spec's) match and there won't be any (repeated) creation external resources.

*restart*
When there is a OCS-Operator restart, storagecluster instance still has the same externalSecretHash spec which matches exactly with the JSON input. Thus new external resources won't be created (as the hashes matches).

Why CephCluster's monitoring spec is not being updated?
The way we pass the monitoring details from the JSON input to the CephCluster's monitoring spec is through StorageCluster's Reconciler object (not through any persistent resource like ConfigMaps or Secrets).
On the restart of OCS-Operator, reconciler object is fresh and it won't have the monitoring values. You could see from the above *restart* reference, hashes are same and we won't get into the external resources creation part.



Workaround:
-----------
In progress



FIX
-------
Need to fix the way monitoring details are being passed to CephCluster spec. Instead of passing it through the internal Reconciler object, we should populate the monitoring details through a ConfigMap and allow the CephCluster to access it.

Comment 10 arun kumar mohan 2021-07-27 20:52:04 UTC
A WIP PR is pushed: https://github.com/openshift/ocs-operator/pull/1284
Will remove the WIP tag, once we test the changes in Sidhant's external cluster.

Comment 11 Mudit Agarwal 2021-07-28 11:47:17 UTC
Summary:
If the ocs-operator restarts, monitoring spec becomes empty but we are able to update the monitoring end point as well as we don't see any issues with the UI monitoring pages.
Anmol, Sidhant and Arun had look at the latest external cluster and didn't see any impact in the UI functioning.

We need to document that the customer should update the secret after upgrading from 4.7 to 4.8 so that the details of all endpoints are updated in the secret.
This is to avoid any problem in future raised due to difference in the number of endpoints in fresh vs upgraded clusters.
Sidhant will raise a doc bug for this.


Details: https://chat.google.com/room/AAAAREGEba8/QWUHDIkEvW8

Comment 14 arun kumar mohan 2021-07-28 14:49:28 UTC
Updated the doc_text with some additional info. Please check.

Comment 15 arun kumar mohan 2021-07-29 14:54:13 UTC
Changed the logic a bit differently. Instead of adding an extra resource, 'ConfigMap', directly reading the Monitoring details from the external cluster secret.
PR: https://github.com/openshift/ocs-operator/pull/1285

Comment 20 Mudit Agarwal 2021-09-08 06:29:38 UTC
Please test with the latest build.

Comment 26 Mudit Agarwal 2021-12-07 10:17:54 UTC
Arun, please add the doc text

Comment 27 arun kumar mohan 2021-12-07 14:12:13 UTC
Updated the docs, please check.

Comment 30 errata-xmlrpc 2021-12-13 17:44:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086