Bug 2180329 - [RDR][tracker for BZ 2215392] RBD images left behind in managed cluster after deleting the application
Summary: [RDR][tracker for BZ 2215392] RBD images left behind in managed cluster after...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.14.0
Assignee: N Balachandran
QA Contact: kmanohar
URL:
Whiteboard:
Depends On: 2215392
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-21 08:07 UTC by Sidhant Agrawal
Modified: 2023-11-08 18:51 UTC (History)
9 users (show)

Fixed In Version: 4.14.0-110
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2215392 (view as bug list)
Environment:
Last Closed: 2023-11-08 18:50:04 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:51:47 UTC

Description Sidhant Agrawal 2023-03-21 08:07:06 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup after performing failover and relocate operations and then deleting DR workload, observed that the RBD images were not deleted from the secondary managed cluster.

Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-03-14-053612
ODF: 4.13.0-107
Ceph: 17.2.5-75.el9cp (52c8ab07f1bc5423199eeb6ab5714bc30a930955) quincy (stable)
ACM: 2.7.2
Submariner: 0.14.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
RBD images left behind in one of the managed cluster and mirroring status will show health, image_health in WARNING state

Is there any workaround available to the best of your knowledge?
Restart the RBD mirror daemon on the managed cluster where images were left behind.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:
Yes, issue was not observed in 4.12

Steps to Reproduce:
1. Configure RDR setup
2. Deploy an application containing 20 PVCs/Pods on C1  
3. Wait for 10 minutes to run IOs  
4. Scale down RBD mirror daemon deployment to 0  
5. Initiate failover to C2  
6. Check PVC and pod resources are created on C2 successfully.
7. Scale up RBD mirror daemon deployment back to 1
8. Check application and replication resources deleted from C1
9. Check mirroring status 
cluster: sagrawal-c1
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
cluster: sagrawal-c2
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
10. Wait for 10 minutes to run IOs
11. Initiate Relocate to C1
12. Check mirroring status after relocate operation
cluster: sagrawal-c1
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
cluster: sagrawal-c2
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
13. Delete the application
14. Observe the mirroring status
cluster: sagrawal-c1
{"daemon_health":"OK","health":"OK","image_health":"OK","states":{}}

cluster: sagrawal-c2
{"daemon_health":"OK","health":"WARNING","image_health":"WARNING","states":{"unknown":15}}

Automated test:  
tests/disaster-recovery/regional-dr/test_failover_and_relocate.py


Actual results:
After deleting the application workload, mirroring status in WARNING and RBD images left behind in the managed cluster

Expected results:
Mirroring status should be OK and all RBD images should be deleted after deleting the application workload.

Comment 14 Karolin Seeger 2023-05-30 11:32:26 UTC
Moving this one to 4.14 as it's not reproducible in recent builds and RDR is in TP state.

Comment 20 Mudit Agarwal 2023-07-25 07:22:55 UTC
Fix is merged upstream, waiting for ceph builds with the fix.

Comment 27 kmanohar 2023-10-15 17:51:39 UTC
VERIFICATION COMMENTS :-
=====================

Steps to Reproduce:
-------------------

1. Configure RDR setup
2. Deploy an application containing 20 PVCs/Pods on C1  
3. Wait for 10 minutes to run IOs  
4. Scale down RBD mirror daemon deployment to 0  
5. Initiate failover to C2  
6. Check PVC and pod resources are created on C2 successfully.
7. Scale up RBD mirror daemon deployment back to 1
8. Check application and replication resources deleted from C1
9. Check mirroring status 
cluster: sagrawal-c1
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
cluster: sagrawal-c2
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
10. Wait for 10 minutes to run IOs
11. Initiate Relocate to C1
12. Check mirroring status after relocate operation
cluster: sagrawal-c1
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
cluster: sagrawal-c2
{'daemon_health': 'OK', 'health': 'OK', 'image_health': 'OK', 'states': {'replaying': 20}}
13. Delete the application
14. Observe the mirroring status
cluster: sagrawal-c1
{"daemon_health":"OK","health":"OK","image_health":"OK","states":{}}

cluster: sagrawal-c2
{"daemon_health":"OK","health":"WARNING","image_health":"WARNING","states":{"unknown":15}}

Automated test:  
tests/disaster-recovery/regional-dr/test_failover_and_relocate.py


Actual results:
After deleting the application workload, mirroring status in WARNING and RBD images left behind in the managed cluster

Expected results:
Mirroring status should be OK and all RBD images should be deleted after deleting the application workload.

_____________________________________________________________________________________________________________


o/p after deleting the application
-----------------------------------------------

On C1

sh-5.1$  rbd mirror pool status ocs-storagecluster-cephblockpool
health: OK
daemon health: OK
image health: OK
images: 0 total

sh-5.1$ for i in $(rbd ls -p ocs-storagecluster-cephblockpool); do echo $i; rbd snap ls ocs-storagecluster-cephblockpool/$i --all 2>/dev/null; echo "##########################################";done
csi-vol-e9ffb004-2730-4369-a916-dd45e29f2a41
##########################################

$ oc get pods
No resources found in busybox-workloads-1 namespace.

$ oc get pvc
No resources found in busybox-workloads-1 namespace.



On C2

sh-5.1$  rbd mirror pool status ocs-storagecluster-cephblockpool
health: OK
daemon health: OK
image health: OK
images: 0 total

sh-5.1$ for i in $(rbd ls -p ocs-storagecluster-cephblockpool); do echo $i; rbd snap ls ocs-storagecluster-cephblockpool/$i --all 2>/dev/null; echo "##########################################";done
csi-vol-7e98268e-2b2a-40c7-86cd-deeab148d0c9
##########################################


Verified on
------------

OCP - 4.14.0-0.nightly-2023-10-13-032002
ODF - 4.14.0-150
Ceph version - ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)
Submariner - 0.16.0
ACM - 2.9.0 (Image - 2.9.0-DOWNSTREAM-2023-10-03-20-08-35)

Must gather - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/bz-v/bz-2180329/

Comment 29 errata-xmlrpc 2023-11-08 18:50:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832


Note You need to log in before you can comment on or make changes to this bug.