Description of problem (please be detailed as possible and provide log snippests): When mirroring is disabled in the storagecluster the Mirroring Peer is not removed in Ceph. This is not a problem until mirroring is enabled for a new Peer. Then there are 2 peers and the rbd-daemon is in an Error state. Example shows 2 Site Peers. The first is for a new storagecluster on a new OCP cluster and the second Peer Site is for the prior, deleted storagecluster on the prior OCP cluster. $ rbd -p ocs-storagecluster-cephblockpool mirror pool info --all Mode: image Site Name: 37d2b392-2763-4650-b5c4-fc53c6b90759 Peer Sites: UUID: 5455bf37-6440-4227-81c5-d09cb2114b08 Name: 2e4f73f7-b57a-45ba-b18b-8fba26ca2746 Mirror UUID: f30f515b-5f83-4dd9-b87d-cf353a84f555 Direction: tx-only Mon Host: 10.18.113.246:3300,10.18.246.159:3300,10.18.72.46:3300 Key: AQDj32dl+N5QNhAAdhnhpCVYC0vtWjaErymQfQ== UUID: cd63acbb-77b9-42cf-bbdd-1f07e8259ec2 Name: a4e71591-8745-440f-b240-0058944f4c52 Mirror UUID: 0c7e37c4-66f6-4aa0-b2e1-4c65621cc13a Direction: rx-tx Client: client.rbd-mirror-peer Mon Host: v2:10.18.46.194:3300/0,v2:10.18.198.162:3300/0,v2:10.18.96.38:3300/0 Key: AQBI7mRlBNuGKxAA5BxbZ6/EX3FIqHw5cVk2+w== $ rbd -p ocs-storagecluster-cephblockpool mirror pool status health: ERROR daemon health: ERROR image health: OK images: 0 total Version of all relevant components (if applicable): OCP: Server Version: 4.14.1 ODF: 4.14.0 ACM: 2.8.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, replication will not work for rbd images. Is there any workaround available to the best of your knowledge? Maybe. Try removing the old mirroring peer and restart the rbd-mirror pods on both clusters. $ rbd mirror pool peer remove {pool-name} {old-peer-uuid} Example: $ rbd mirror pool peer remove ocs-storagecluster-cephblockpool cd63acbb-77b9-42cf-bbdd-1f07e8259ec2 In the past with ODF 4.13 this worked but not working with ODF 4.14, daemon health went from ERROR -> WARNING. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Yes Steps to Reproduce: 1. Create RDR test environment. 2. Remove DR using cleanup-up.sh 3. Remove Submariner 4. Detach one managedcluster using ACM console 5. Create new OCP cluster and import into ACM 6. Install ODF for RDR on new managedcluster 7. Install MCO operator and create new DRPolicy 8. Run "rbd -p ocs-storagecluster-cephblockpool mirror pool info --all" after logging into ceph on both managed clusters. Actual results: Cluster that was not recreated has 2 Peer Sites and rbd-mirror daemon status is ERROR. Expected results: Cluster that was not recreated has 1 Peer Sites and rbd-mirror daemon status is OK. Additional info: $ cat cleanup-hub.sh #!/bin/bash secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1) echo $secrets for secret in $secrets do oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge done mirrorpeers=$(oc get mirrorpeer -o name) echo $mirrorpeers for mp in $mirrorpeers do oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $mp done drpolicies=$(oc get drpolicy -o name) echo $drpolicies for drp in $drpolicies do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done drclusters=$(oc get drcluster -o name) echo $drclusters for drp in $drclusters do oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge oc delete $drp done oc delete project openshift-operators managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2) echo $managedclusters for mc in $managedclusters do secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1) echo $secrets for secret in $secrets do set -x oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge oc delete -n $mc secret/$secret done done
Also the peerSecretNames should be removed if mirroring is disabled. mirroring: peerSecretNames: - 4446d6504fadb8362e6814327c7a1172b9735c0
is this a regression?
Not a 4.15 blocker
So I'm reading it write, the main issue looks resolved. But following issues are there in the cephCluster CR. 1. PeerSecretName is still there are disabling mirroring - I think the entity that is adding this peer secret name to the cephCluster CR should be responsible for removing at as well. Thoughts? 2. The CephCluster Status CR is still showing mirroring info, even though its disabled and peer site is removed. - This I need to check again. But I feel that we run a goroutine that keeps on checking this mirroring health and adds it to the cephCluster status. Once the mirroring is disabled, the goroutine is deleted and status is not updated. It still shows the old mirroring info since its no longer checking for the mirroring status. Annette, can you please check if there is any side effect of [2]? That is, if you try to enable mirroring again and add a new peer, does it work?
reading this correctly*
is comment #7 a blocker for disabling the mirroring and re-enabling it on another cluster? If not, then I think we should close this BZ (if the main issue is fixed) and open a new one for comment #7. Also, from Comment #7: - The deleting of the `peerSecretNames` from the spec should not be the responsibility of the OCS operator or Rook. It should come from the entity that is adding the `peerSecretNames` in the spec. - Spec Cluster CR Status still showing the status of old peer should be fixed in Rook. But it don't think its a blocker since it does not blocker the re-enabling of mirroring to another site.
No doc text is required for this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days