Description of problem (please be detailed as possible and provide log snippests): These 2 rbd based workloads were failedover which were in deployed state earlier. amagrawa:acm$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 busybox-workloads-2-placement-1-drpc 2d23h amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2023-09-18T19:18:58Z 48m18.757117217s True openshift-gitops busybox-workloads-1-placement-drpc 2d23h amagrawa-c1 amagrawa-c2 Failover FailedOver Completed 2023-09-18T19:17:45Z 44m1.2498824s True Version of all relevant components (if applicable): ODF 4.14.0-132.stable OCP 4.14.0-0.nightly-2023-09-02-132842 ACM 2.9.0-DOWNSTREAM-2023-08-24-09-30-12 subctl version: v0.16.0 ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy appset and subscription based DR protected rbd based workloads on a RDR setup and run IOs for a few days (4-5 days in this case) 2. Perform submariner connectivity check, ensure mirroring is working, lastGroupSyncTime is within desired range, rbd image status is healthy, etc. 3. Bring master nodes of primary cluster down 4. Perform failover of both appset and subscription based workloads when cluster is marked unavailable on the ACM UI 5. Wait for Pods and other resources to come up on the new primary, drpc progression state to change to Cleaning up 6. Bring master nodes up, let the cluster become reachable. 7. Ensure subctl verify connectivity check passes for submariner connectivity, wait for cleanup to complete. 8. Check rbd image status again. Actual results: Few rbd images remain stuck in starting_replay post failover, image_health also reports warning C1- amagrawa:~$ mirror { "lastChecked": "2023-09-18T20:14:39Z", "summary": { "daemon_health": "OK", "health": "WARNING", "image_health": "WARNING", "states": { "replaying": 9, "starting_replay": 3 } } csi-vol-4d7b71bc-afa5-478b-8a1b-cf372eb5f297 csi-vol-4d7b71bc-afa5-478b-8a1b-cf372eb5f297: global_id: 1aec3522-725e-4e9c-8d1b-134a861d55b8 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":52676608.0,"last_snapshot_bytes":63197184,"last_snapshot_sync_seconds":1,"local_snapshot_timestamp":1695067800,"remote_snapshot_timestamp":1695067800,"replay_state":"idle"} service: a on compute-0 last_update: 2023-09-18 20:16:30 peer_sites: name: 724e0358-2cfc-4a0f-9a99-419999493584 state: up+starting_replay description: starting replay last_update: 2023-09-18 20:16:18 ########################################## csi-vol-b3031480-5f84-4bf4-b976-6809c32fa1f8 csi-vol-b3031480-5f84-4bf4-b976-6809c32fa1f8: global_id: 6ba80b97-eb9c-4eff-8ffe-d691b14befee state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":284784640.0,"last_snapshot_bytes":389464064,"last_snapshot_sync_seconds":7,"local_snapshot_timestamp":1695068100,"remote_snapshot_timestamp":1695068100,"replay_state":"idle"} last_update: 2023-09-18 20:16:24 peer_sites: name: 724e0358-2cfc-4a0f-9a99-419999493584 state: up+starting_replay description: starting replay last_update: 2023-09-18 20:16:18 ########################################## csi-vol-d4e319c8-2329-482c-9804-7bc023b59d22 csi-vol-d4e319c8-2329-482c-9804-7bc023b59d22: global_id: aaddeb70-0e98-4389-9c3c-ab0e385e76de state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":326821888.0,"last_snapshot_bytes":445595648,"last_snapshot_sync_seconds":8,"local_snapshot_timestamp":1695068100,"remote_snapshot_timestamp":1695068100,"replay_state":"idle"} last_update: 2023-09-18 20:16:24 peer_sites: name: 724e0358-2cfc-4a0f-9a99-419999493584 state: up+starting_replay description: starting replay last_update: 2023-09-18 20:16:18 ########################################## C2- amagrawa:c2$ mirror { "lastChecked": "2023-09-18T20:34:46Z", "summary": { "daemon_health": "OK", "health": "WARNING", "image_health": "WARNING", "states": { "replaying": 9, "starting_replay": 3 } } } Must gather logs are kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/19sept23-1/ Subctl verify logs collected post failover when older primary cluster became reachable- http://pastebin.test.redhat.com/1109565 All checks passed Expected results: rbd images should go back to replaying state post failover and report healthy. Additional info:
Failover was performed from C1 to C2.
As confirmed by Ilya in https://chat.google.com/room/AAAAqWkMm2s/wxV5kxtqX1g nearfull isn't the same as full, so it shouldn't be a problem.
Issue wasn't reproduced when testing with ODF 4.14.0-150.stable ACM 2.9.0-DOWNSTREAM-2023-10-12-14-53-11 advanced-cluster-management.v2.9.0-187 Submariner brew.registry.redhat.io/rh-osbs/iib:594788 OCP 4.14.0-0.nightly-2023-10-14-061428 ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) Therefore, marking it as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832