Bug 2069795

Summary:	cephcluster on consumer stays in Connecting state after MGR endpoint changes(goroutine not ttriggered)
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Neha Berry <nberry>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	madam, muagarwa, ocs-bugs, odf-bz-bot, oviner, rperiyas, shan
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	ODF 4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.10.0-217	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-21 09:12:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2022-03-29 17:40:05 UTC

Description of problem (please be detailed as possible and provide log
snippests):
========================================================================
To verify Bug 2060273 with the scenario of MGR pod repsin, restarted the MGR pod on provider to a new node.

As expected with rrecent fixes, the mgr endpoint and the cephcluster monitoring endpoint changed to the new node IO on consumer

Able to create PVCs as well. However, the cephcluster stays in Connecting state with no updates seen in rook-ceph-operator pod

Had a live session with engineering and raising a bug based on the discussion there.


Provider side
==================
date --utc; oc delete pod rook-ceph-mgr-a-6ddbf6bb5-kdx25; oc get pods -o wide|grep mgr
Tue Mar 29 05:15:36 PM UTC 2022
pod "rook-ceph-mgr-a-6ddbf6bb5-kdx25" deleted
rook-ceph-mgr-a-6ddbf6bb5-6g7kl                                   1/2     Running     0             3s      10.0.180.112   ip-10-0-180-112.us-east-2.compute.internal   <none>           <none>


Consumer side
==================
+++++++++++++++++++++++
cephcluster MGR
    monitoring:
      enabled: true
      externalMgrEndpoints:
      - ip: 10.0.180.112
      externalMgrPrometheusPort: 9283
+++++++++++
endpoint
rook-ceph-mgr-external                            10.0.180.112:9283                                                  75m


======= storagecluster ==========
NAME                 AGE   PHASE        EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   85m   Connecting   true       2022-03-29T16:10:08Z   
======= storagesystem ==========
NAME                               STORAGE-SYSTEM-KIND                  STORAGE-SYSTEM-NAME
ocs-storagecluster-storagesystem   storagecluster.ocs.openshift.io/v1   ocs-storagecluster
--------------
======= cephcluster ==========
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE        MESSAGE                                             HEALTH      EXTERNAL
ocs-storagecluster-cephcluster                                85m   Connecting   Attempting to connect to an external Ceph cluster   HEALTH_OK   true




Version of all relevant components (if applicable):
====================================================
OCP 4.9.25
ODF= 4.10.0-206
Deployer - 2.0.0-5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
===============================================================
No as the status is not updated but cluster is able to talk to provider

Is there any workaround available to the best of your knowledge?
==================================================================
restart the rook-ceph-operator pod and the cephcluster status changes to Connected since the go-routine is triggered

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
===========================================================================

3

Can this issue reproducible?
==============================
Yes

Can this issue reproduce from the UI?
======================================
NA

If this is a regression, please provide more details to justify this:
=======================================================================
Not sure

Steps to Reproduce:
======================
1. Create an add-on based provider and consumer setup 
2. On provider, respin the MGR pod and make sure it moves to another node
3. Check the monitoring endpoint on consumer and also the status of cephcluster CR
3.


Actual results:
=================
cephcluster CR stays in Connecting state

However, now the Monitoring endpoint is updated successfully

Expected results:
===================
cephcluster status should be Connected and storagecluster in ready

Additional info:

Comment 6 Travis Nielsen 2022-04-04 15:19:40 UTC

*** Bug 2062853 has been marked as a duplicate of this bug. ***