1984735 – [External Mode] Monitoring spec is getting reset in CephCluster resource

Bug 1984735 - [External Mode] Monitoring spec is getting reset in CephCluster resource

Summary: [External Mode] Monitoring spec is getting reset in CephCluster resource

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	arun kumar mohan
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1966894 2011326
TreeView+	depends on / blocked

Reported:	2021-07-22 04:42 UTC by Sidhant Agrawal
Modified:	2023-08-09 17:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.Monitoring spec is getting reset in CephCluster resource in external mode Previously, when OpenShift Container Storage was upgraded, the monitoring endpoints would get reset in external CephCluster's monitoring spec. This was not an expected behavior and was due to the way monitoring endpoints were passed to the CephCluster. With this update, the way endpoints are passed is changed. Before the CephCluster is created, the endpoints are accessed directly from the JSON secret, `rook-ceph-external-cluster-details` and the CephCluster spec is updated. As a result, the monitoring endpoint specs in the CephCluster is updated properly with appropriate values even after the OpenShift Container Storage upgrade.
Clone Of:
Environment:
Last Closed:	2021-12-13 17:44:54 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1285	None	open	Changing the way external endpoint IP/Port being propagated	2021-07-29 14:54:13 UTC
Github	red-hat-storage ocs-operator pull 1286	None	None	None	2021-09-06 08:35:13 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:45:16 UTC

Description Sidhant Agrawal 2021-07-22 04:42:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):

During verification of Bug 1974441 after upgrade from OCS 4.7 to 4.8, I tried to check the info about MGR endpoint present in CephCluster resource but observed that monitoring spec was empty and has been reset.

Although the monitoring spec was empty I didn't notice any obvious warning/error in UI ocs-dashboards. And as I noted in my comment Bug 1974441#c15, the monitoring endpoint was also being updated if there was any MGR failover in external RHCS.
However as per comment Bug 1974441#c16 this is not expected behaviour that spec is empty and getting reset. Hence this needs further investigation to idetify why it's happening.

> After upgrade from 4.7 to 4.8
$ oc get cephcluster -o yaml
...
        status: {}
    logCollector: {}
    mgr: {}
    mon:
      count: 0
    monitoring: {}
    network:
      ipFamily: IPv4
    security:
      kms: {}
    storage: {}
...


Monitoring spec in fresh OCS 4.7 deployment looks like this
...
   mgr: {}
    mon: {}
    monitoring:
      enabled: true
      externalMgrEndpoints:
      - ip: 10.1.8.51
      externalMgrPrometheusPort: 9283
      rulesNamespace: openshift-storage

...


Monitoring spec in fresh OCS 4.8 deployment looks like this
...
    mgr: {}
    mon:
      count: 0
    monitoring:
      enabled: true
      externalMgrEndpoints:
      - ip: 10.1.8.51
      - ip: 10.1.8.106
      - ip: 10.1.8.29
      externalMgrPrometheusPort: 9283
      rulesNamespace: openshift-storage

...


Version of all relevant components (if applicable):
OCS: 4.8.0-452.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
-

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes 2/2

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Although this was observed during upgrade, I was able to repro by restarting the ocs-operator pod and then check for monitoring section in CephCluster resource

Actual results:
monitoring spec is empty

Expected results:
monitoring spec should not be empty

Comment 9 arun kumar mohan 2021-07-27 17:52:32 UTC

RCA (TLDR):
----------

An external cluster JSON is read, by the OCS-Operator, only when the checksum hash (a unique hexa string) of the JSON content is NOT same with the hash content in storagecluster.spec.externalSecretHash.

For the first time, when OCS-Operator starts, the hash is empty in storagecluster.spec.externalSecretHash. Since the hash of the JSON content and the existing externalSecretHash won't match, it will create all resources and [important] update the externalSecretHash storagecluster spec.

From the next time/reconcile onwards, both the hashes (JSON input's and spec's) match and there won't be any (repeated) creation external resources.

*restart*
When there is a OCS-Operator restart, storagecluster instance still has the same externalSecretHash spec which matches exactly with the JSON input. Thus new external resources won't be created (as the hashes matches).

Why CephCluster's monitoring spec is not being updated?
The way we pass the monitoring details from the JSON input to the CephCluster's monitoring spec is through StorageCluster's Reconciler object (not through any persistent resource like ConfigMaps or Secrets).
On the restart of OCS-Operator, reconciler object is fresh and it won't have the monitoring values. You could see from the above *restart* reference, hashes are same and we won't get into the external resources creation part.

Workaround:
-----------
In progress

FIX
-------
Need to fix the way monitoring details are being passed to CephCluster spec. Instead of passing it through the internal Reconciler object, we should populate the monitoring details through a ConfigMap and allow the CephCluster to access it.

Comment 10 arun kumar mohan 2021-07-27 20:52:04 UTC

A WIP PR is pushed: https://github.com/openshift/ocs-operator/pull/1284
Will remove the WIP tag, once we test the changes in Sidhant's external cluster.

Comment 11 Mudit Agarwal 2021-07-28 11:47:17 UTC

Summary:
If the ocs-operator restarts, monitoring spec becomes empty but we are able to update the monitoring end point as well as we don't see any issues with the UI monitoring pages.
Anmol, Sidhant and Arun had look at the latest external cluster and didn't see any impact in the UI functioning.

We need to document that the customer should update the secret after upgrading from 4.7 to 4.8 so that the details of all endpoints are updated in the secret.
This is to avoid any problem in future raised due to difference in the number of endpoints in fresh vs upgraded clusters.
Sidhant will raise a doc bug for this.


Details: https://chat.google.com/room/AAAAREGEba8/QWUHDIkEvW8

Comment 14 arun kumar mohan 2021-07-28 14:49:28 UTC

Updated the doc_text with some additional info. Please check.

Comment 15 arun kumar mohan 2021-07-29 14:54:13 UTC

Changed the logic a bit differently. Instead of adding an extra resource, 'ConfigMap', directly reading the Monitoring details from the external cluster secret.
PR: https://github.com/openshift/ocs-operator/pull/1285

Comment 20 Mudit Agarwal 2021-09-08 06:29:38 UTC

Please test with the latest build.

Comment 26 Mudit Agarwal 2021-12-07 10:17:54 UTC

Arun, please add the doc text

Comment 27 arun kumar mohan 2021-12-07 14:12:13 UTC

Updated the docs, please check.

Comment 30 errata-xmlrpc 2021-12-13 17:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.