2252318 – Mirroring Peer is not removed in Ceph when mirroring is disabled in StorageCluster

Bug 2252318 - Mirroring Peer is not removed in Ceph when mirroring is disabled in StorageCluster

Summary: Mirroring Peer is not removed in Ceph when mirroring is disabled in StorageCl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Santosh Pillai
QA Contact:	kmanohar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-30 21:10 UTC by Annette Clewett
Modified:	2024-11-15 04:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.16.0-57
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:11:41 UTC
Embargoed:
Flags:	aclewett: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rook rook pull 13905	0	None	open	[WIP]core: remove stale cluster peer	2024-03-14 15:27:51 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:11:45 UTC

Description Annette Clewett 2023-11-30 21:10:52 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When mirroring is disabled in the storagecluster the Mirroring Peer is not removed in Ceph. This is not a problem until mirroring is enabled for a new Peer. Then there are 2 peers and the rbd-daemon is in an Error state.

Example shows 2 Site Peers. The first is for a new storagecluster on a new OCP cluster and the second Peer Site is for the prior, deleted storagecluster on the prior OCP cluster.

$ rbd -p ocs-storagecluster-cephblockpool mirror pool info --all
Mode: image
Site Name: 37d2b392-2763-4650-b5c4-fc53c6b90759

Peer Sites: 

UUID: 5455bf37-6440-4227-81c5-d09cb2114b08
Name: 2e4f73f7-b57a-45ba-b18b-8fba26ca2746
Mirror UUID: f30f515b-5f83-4dd9-b87d-cf353a84f555
Direction: tx-only
Mon Host: 10.18.113.246:3300,10.18.246.159:3300,10.18.72.46:3300
Key: AQDj32dl+N5QNhAAdhnhpCVYC0vtWjaErymQfQ==


UUID: cd63acbb-77b9-42cf-bbdd-1f07e8259ec2
Name: a4e71591-8745-440f-b240-0058944f4c52
Mirror UUID: 0c7e37c4-66f6-4aa0-b2e1-4c65621cc13a
Direction: rx-tx
Client: client.rbd-mirror-peer
Mon Host: v2:10.18.46.194:3300/0,v2:10.18.198.162:3300/0,v2:10.18.96.38:3300/0
Key: AQBI7mRlBNuGKxAA5BxbZ6/EX3FIqHw5cVk2+w==

$ rbd -p ocs-storagecluster-cephblockpool mirror pool status
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total


Version of all relevant components (if applicable):
OCP: Server Version: 4.14.1
ODF: 4.14.0
ACM: 2.8.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, replication will not work for rbd images.

Is there any workaround available to the best of your knowledge?

Maybe. Try removing the old mirroring peer and restart the rbd-mirror pods on both clusters.

$ rbd mirror pool peer remove {pool-name} {old-peer-uuid}

Example:
$ rbd mirror pool peer remove ocs-storagecluster-cephblockpool cd63acbb-77b9-42cf-bbdd-1f07e8259ec2

In the past with ODF 4.13 this worked but not working with ODF 4.14, daemon health went from ERROR -> WARNING.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Yes


Steps to Reproduce:
1. Create RDR test environment.
2. Remove DR using cleanup-up.sh
3. Remove Submariner
4. Detach one managedcluster using ACM console
5. Create new OCP cluster and import into ACM
6. Install ODF for RDR on new managedcluster
7. Install MCO operator and create new DRPolicy
8. Run "rbd -p ocs-storagecluster-cephblockpool mirror pool info --all" after logging into ceph on both managed clusters.


Actual results:

Cluster that was not recreated has 2 Peer Sites and rbd-mirror daemon status is ERROR.

Expected results:

Cluster that was not recreated has 1 Peer Sites and rbd-mirror daemon status is OK.

Additional info:

$ cat cleanup-hub.sh 
#!/bin/bash


secrets=$(oc get secrets -n openshift-operators | grep Opaque | cut -d" " -f1)
echo $secrets
for secret in $secrets
do
    oc patch -n openshift-operators secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
done

mirrorpeers=$(oc get mirrorpeer -o name)
echo $mirrorpeers
for mp in $mirrorpeers
do
    oc patch $mp -p '{"metadata":{"finalizers":null}}' --type=merge
    oc delete $mp
done

drpolicies=$(oc get drpolicy -o name)
echo $drpolicies
for drp in $drpolicies
do
    oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
    oc delete $drp
done

drclusters=$(oc get drcluster -o name)
echo $drclusters
for drp in $drclusters
do
    oc patch $drp -p '{"metadata":{"finalizers":null}}' --type=merge
    oc delete $drp
done

oc delete project openshift-operators

managedclusters=$(oc get managedclusters -o name | cut -d"/" -f2)
echo $managedclusters
for mc in $managedclusters
do
    secrets=$(oc get secrets -n $mc | grep multicluster.odf.openshift.io/secret-type | cut -d" " -f1)
    echo $secrets
    for secret in $secrets
    do
        set -x
        oc patch -n $mc secret/$secret -p '{"metadata":{"finalizers":null}}' --type=merge
        oc delete -n $mc secret/$secret
    done
done

Comment 2 Annette Clewett 2023-11-30 22:17:29 UTC

Also the peerSecretNames should be removed if mirroring is disabled. 

  mirroring:
    peerSecretNames:
    - 4446d6504fadb8362e6814327c7a1172b9735c0

Comment 3 Santosh Pillai 2023-12-01 10:22:46 UTC

is this a regression?

Comment 5 Mudit Agarwal 2024-01-21 14:15:21 UTC

Not a 4.15 blocker

Comment 8 Santosh Pillai 2024-04-15 14:38:09 UTC

So I'm reading it write, the main issue looks resolved. But following issues are there in the cephCluster CR. 

1. PeerSecretName is still there are disabling mirroring
  - I think the entity that is adding this peer secret name to the cephCluster CR should be responsible for removing at as well. Thoughts? 

2. The CephCluster Status CR is still showing mirroring info, even though its disabled and peer site is removed. 
  - This I need to check again. But I feel that we run a goroutine that keeps on checking this mirroring health and adds it to the cephCluster status. Once the mirroring is disabled, the goroutine is deleted and status is not updated. It still shows the old mirroring info since its no longer checking for the mirroring status. 


Annette, can you please check if there is any side effect of [2]?  That is, if you try to enable mirroring again and add a new peer, does it work?

Comment 9 Santosh Pillai 2024-04-15 14:38:32 UTC

reading this correctly*

Comment 12 Santosh Pillai 2024-04-24 05:45:36 UTC

is comment #7 a blocker for disabling the mirroring and re-enabling it on another cluster? 
If not, then I think we should close this BZ (if the main issue is fixed) and open a new one for comment #7. 

Also, from Comment #7: 
- The deleting of the `peerSecretNames` from the spec should not be the responsibility of the OCS operator or Rook. It should come from the entity that is adding the `peerSecretNames` in the spec. 
- Spec Cluster CR Status still showing the status of old peer should be fixed in Rook. But it don't think its a blocker since it does not blocker the re-enabling of mirroring to another site.

Comment 14 Santosh Pillai 2024-06-12 04:24:51 UTC

No doc text is required for this.

Comment 18 errata-xmlrpc 2024-07-17 13:11:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 19 Red Hat Bugzilla 2024-11-15 04:25:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.