Bug 2019161 - [RDR] [Tracker for BZ #2020618] RBD Mirror snapshot processing stop after a few Failover-Relocate operations
Summary: [RDR] [Tracker for BZ #2020618] RBD Mirror snapshot processing stop after a f...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.12.0
Assignee: Ilya Dryomov
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On: 2020618 2100519
Blocks: 2024792 2030749
TreeView+ depends on / blocked
 
Reported: 2021-11-01 17:53 UTC by Jean-Charles Lopez
Modified: 2023-08-09 16:37 UTC (History)
20 users (show)

Fixed In Version: 4.11.0-107
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2020618 2024792 2030749 (view as bug list)
Environment:
Last Closed: 2023-01-31 00:19:18 UTC
Embargoed:


Attachments (Terms of Use)
Log captured by Shyam (5.93 MB, application/x-tar)
2021-11-01 17:53 UTC, Jean-Charles Lopez
no flags Details
Logs gathered after the reproduction of the bug (2.10 MB, application/zip)
2021-11-09 22:31 UTC, Benamar Mekhissi
no flags Details
rbdmirror with debug log. Last snap taken at: Wed Nov 10 12:41:02 (11.22 MB, text/plain)
2021-11-10 13:39 UTC, Benamar Mekhissi
no flags Details
ceph manager log (1.49 MB, application/zip)
2021-11-10 13:42 UTC, Benamar Mekhissi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2023:0551 0 None None None 2023-01-31 00:19:42 UTC

Description Jean-Charles Lopez 2021-11-01 17:53:44 UTC
Created attachment 1838987 [details]
Log captured by Shyam

Description of problem (please be detailed as possible and provide log
snippests):
Deployed ODF and ODR operators
Failover and Relocate the busy box application many times successfully
At some point we witnessed that the snapshot used for mirroring were no longer being created.

Version of all relevant components (if applicable):
Deployed ODF 4.9 build 203
Over OCP 4.9.0


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
Not identified yet

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Not sure at this point

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF and ODR
2. Deploy the buysybox test app
3. Failover and relocate app multiple time (our Policy is set to 5 minutes)
4. Check the time stamps recorded on the file on the PV to check the lag 
5. The lag goes far above 5 minutes (in our case over 14 minutes)
6. Use the red command to check the snapshots created for the test RBD


Actual results:


Expected results:


Additional info:

Comment 2 Jean-Charles Lopez 2021-11-01 19:14:54 UTC
RBD command to check on the snapshots for an RBD image

rbd -p {pool} snap ls {rbd image name} --all

Comment 11 Benamar Mekhissi 2021-11-09 22:31:27 UTC
Created attachment 1840965 [details]
Logs gathered after the reproduction of the bug

Comment 12 Benamar Mekhissi 2021-11-09 22:33:55 UTC
Annette and I were able to reproduce the problem simply by failing over and then relocating. I have gathered the logs... Attached.
We failed over to perf2 from perf1 and relocated to perf1 from perf2.  In each directory, you'll find the necessary logs. I have not analyzed the logs yet. That will be next on my part. But here is the summary of the problem:

1. Failover started on: 2021-11-09T18:25:21.340Z (no issues)
2. Failback (relocation) started at: 2021-11-09T18:47:15.078Z (snaps issue)
3. Snaps stopped being generated at 18:48:30 after the failback.  You'll find more info in perf1/snap-ls.log and perf2/snap-ls.log
4. Snaps started being generated again once we attempted another failback at 2021-11-09T19:48:14.260Z.

Comment 13 Benamar Mekhissi 2021-11-10 11:11:26 UTC
So I eliminated all the K8s and DR stuff. I used simply the rbd commands (based on Shyam's script above) to reproduce the issue. Here are the steps that I followed to reproduce it.
1. Created an RBD image on `perf1`
2. Enabled the image for mirroring
3. Added 2m schedule, but it wasn't taken (I think because there was already a schedule set for that pool at 5m interval)
4. Performed IO on `perf1` and then looked at the following:
Using project "default".
===   Mirror image status default/api-perf1-chris-ocs-ninja:6443/kube:admin   ===
test-1:
  global_id:   26b302a2-f175-40a7-a2ae-95da7941d92a
  state:       up+stopped
  description: local image is primary
  service:     a on perf1-tnd5f-ocs-ggwcm
  last_update: 2021-11-10 10:33:34
  peer_sites:
    name: 9b446270-c120-4c2b-abcc-d8a2e9e80dd9
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1636540410,"remote_snapshot_timestamp":1636540410,"replay_state":"idle"}
    last_update: 2021-11-10 10:33:59
  snapshots:
    1980 .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.2cce20c3-e82f-4561-a281-822d826f3ae4 (peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e])

===   Mirror snapshot list default/api-perf1-chris-ocs-ninja:6443/kube:admin   ===
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  1980  .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.2cce20c3-e82f-4561-a281-822d826f3ae4  1 GiB             Wed Nov 10 10:33:30 2021  mirror (primary peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e])

5. Forced promoted `perf2` to simulate a failover
6. Peformed IO on `perf2`
7. Performed IO on `perf1` in order to cause split-brain
8. Demoted `perf1`
9. Resynced `perf1`
10. Performed IO on `perf2` then took the following samples
    Using project "default".
===   Mirror image status default/api-perf2-chris-ocs-ninja:6443/kube:admin   ===
test-1:
  global_id:   26b302a2-f175-40a7-a2ae-95da7941d92a
  state:       up+stopped
  description: local image is primary
  service:     a on perf2-m7t6r-ocs-5wkmj
  last_update: 2021-11-10 10:40:29
  peer_sites:
    name: 23156624-cc9c-4375-8a0c-91a015e4711e
    state: up+stopped
    description: local image is primary
    last_update: 2021-11-10 10:40:34
  snapshots:
    445 .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.cd6165c8-a3d1-4a8b-87dc-4c513fa0a77a (peer_uuids:[ffe68f70-3873-434b-a134-17c00e95c179])

===   Mirror snapshot list default/api-perf2-chris-ocs-ninja:6443/kube:admin   ===
SNAPID  NAME                                                                                           SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
   443  .mirror.non_primary.26b302a2-f175-40a7-a2ae-95da7941d92a.8de402ef-e283-436f-9c3f-49f8e90d8b76  1 GiB             Wed Nov 10 10:36:25 2021  mirror (non-primary peer_uuids:[] :18446744073709551614 copied)   
   445  .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.cd6165c8-a3d1-4a8b-87dc-4c513fa0a77a      1 GiB             Wed Nov 10 10:36:29 2021  mirror (primary peer_uuids:[ffe68f70-3873-434b-a134-17c00e95c179])

11. Demoted `perf2` to simulate a failback
12. Promoted `perf1`. snap was taken and no more after that
Image promoted to primary

Wed Nov 10 05:49:04 EST 2021
Login successful.

You have access to 72 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".
===   Mirror image status default/api-perf1-chris-ocs-ninja:6443/kube:admin   ===
test-1:
  global_id:   26b302a2-f175-40a7-a2ae-95da7941d92a
  state:       up+stopped
  description: local image is primary
  service:     a on perf1-tnd5f-ocs-ggwcm
  last_update: 2021-11-10 10:49:04
  peer_sites:
    name: 9b446270-c120-4c2b-abcc-d8a2e9e80dd9
    state: up+unknown
    description: remote image demoted
    last_update: 2021-11-10 10:49:00
  snapshots:
    1999 .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.45ed5d7f-7128-4239-a327-8c7b6bf44155 (peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e])

===   Mirror snapshot list default/api-perf1-chris-ocs-ninja:6443/kube:admin   ===
SNAPID  NAME                                                                                           SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                                                                         
  1997  .mirror.non_primary.26b302a2-f175-40a7-a2ae-95da7941d92a.e505cd41-36e5-4a29-9ad6-f4a27cd3319d  1 GiB             Wed Nov 10 10:48:43 2021  mirror (demoted peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e] b5334237-56b7-47ab-a3ec-4c4cbe95e855:450 copied)
  1999  .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.45ed5d7f-7128-4239-a327-8c7b6bf44155      1 GiB             Wed Nov 10 10:49:04 2021  mirror (primary peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e]) 

13. Performed IO on `perf1` but no new snaps
===   Perform IO on default/api-perf1-chris-ocs-ninja:6443/kube:admin   ===
bench  type write io_size 4096 io_threads 2 bytes 15728640 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed: 0   ops: 3840   ops/sec: 25771.4   bytes/sec: 101 MiB/s
writing 1 iteration
bench  type write io_size 4096 io_threads 2 bytes 15728640 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed: 0   ops: 3840   ops/sec: 28871.7   bytes/sec: 113 MiB/s
writing 2 iteration
bench  type write io_size 4096 io_threads 2 bytes 15728640 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed: 0   ops: 3840   ops/sec: 30967.2   bytes/sec: 121 MiB/s
writing 3 iteration
bench  type write io_size 4096 io_threads 2 bytes 15728640 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed: 0   ops: 3840   ops/sec: 28656.2   bytes/sec: 112 MiB/s
writing 4 iteration
bench  type write io_size 4096 io_threads 2 bytes 15728640 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
elapsed: 0   ops: 3840   ops/sec: 25430   bytes/sec: 99 MiB/s
writing 5 iteration


14. Here is the schedule
rbd -p ocs-storagecluster-cephblockpool mirror snapshot schedule status
SCHEDULE TIME        IMAGE                                                                        
2021-11-10 10:55:00  ocs-storagecluster-cephblockpool/csi-vol-d47636e8-41ab-11ec-af18-0a580a060e13

date: Wed Nov 10 10:56:17 UTC 2021

rbd -p ocs-storagecluster-cephblockpool snap ls test-1 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  1999  .mirror.primary.26b302a2-f175-40a7-a2ae-95da7941d92a.45ed5d7f-7128-4239-a327-8c7b6bf44155  1 GiB             Wed Nov 10 10:49:04 2021  mirror (primary peer_uuids:[634c80d3-65ae-4e02-9862-de97b73ccb2e])

15. The schedule is increasing
2021-11-10 11:00:00  ocs-storagecluster-cephblockpool/csi-vol-d47636e8-41ab-11ec-af18-0a580a060e13

Comment 14 Benamar Mekhissi 2021-11-10 13:39:51 UTC
Created attachment 1841073 [details]
rbdmirror with debug log. Last snap taken at: Wed Nov 10 12:41:02

Comment 15 Benamar Mekhissi 2021-11-10 13:42:57 UTC
Created attachment 1841076 [details]
ceph manager log

Comment 16 Benamar Mekhissi 2021-11-10 18:19:11 UTC
The issue has been reproduced in Pratik's clusters. The last snap was taken on Wed Nov 10 17:49:49 from from the preferredCluster vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com.  The bottom of this output also shows listing snaps from the same cluster 10 minutes later with no change.


>>>>>>>>>>>>>>>>>>> Failing back to busybox-workloads/api-vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com:6443/system:admin
===   Demoting busybox-workloads/api-prsurve-vm-dev-qe-rh-ocs-com:6443/system:admin   ===
Image demoted to non-primary
===   Promote on busybox-workloads/api-vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com:6443/system:admin   ===
Image promoted to primary

Wed Nov 10 12:49:49 EST 2021
===   Mirror image status busybox-workloads/api-vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com:6443/system:admin   ===
test-1:
  global_id:   3ac4a0a2-52ec-44e0-86e1-c783fc936b12
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1636566015,"remote_snapshot_timestamp":1636566015,"replay_state":"idle"}
  service:     a on vmware-dccp-one-6nzb5-worker-xvllk
  last_update: 2021-11-10 17:49:44
  peer_sites:
    name: 85826cdf-6a26-4620-aa46-df3323cb5641
    state: up+stopped
    description: local image is primary
    last_update: 2021-11-10 17:49:34
  snapshots:
    2772 .mirror.primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.ee221a0a-48eb-43f9-bad3-7946822aebb4 (peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073])

===   Mirror snapshot list busybox-workloads/api-vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com:6443/system:admin   ===
SNAPID  NAME                                                                                           SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                                                                          
  2770  .mirror.non_primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.feb78f21-90c9-4e39-b6ee-1cc12c0219fc  1 GiB             Wed Nov 10 17:44:50 2021  mirror (non-primary peer_uuids:[] 67843ffe-f344-485f-b91d-0af0b122ed4f:2769 copied)                                
  2771  .mirror.non_primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.4dd08ff4-e5d7-42e1-91dc-aab1dbebbf9d  1 GiB             Wed Nov 10 17:49:42 2021  mirror (demoted peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073] 67843ffe-f344-485f-b91d-0af0b122ed4f:2770 copied)
  2772  .mirror.primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.ee221a0a-48eb-43f9-bad3-7946822aebb4      1 GiB             Wed Nov 10 17:49:49 2021  mirror (primary peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073])                                                 

===   Mirror image status busybox-workloads/api-prsurve-vm-dev-qe-rh-ocs-com:6443/system:admin   ===
test-1:
  global_id:   3ac4a0a2-52ec-44e0-86e1-c783fc936b12
  state:       up+stopped
  description: local image is primary
  service:     a on prsurve-vm-dev-88rv4-worker-vh6hb
  last_update: 2021-11-10 17:49:34
  peer_sites:
    name: 73dfaa19-44bd-41a9-acae-4985e6b05d05
    state: up+stopped
    description: force promoted
    last_update: 2021-11-10 17:49:59

===   Mirror snapshot list busybox-workloads/api-prsurve-vm-dev-qe-rh-ocs-com:6443/system:admin   ===
SNAPID  NAME                                                                                           SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  2767  .mirror.non_primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.6fef8d63-02b2-475b-b5dc-17283bb13af8  1 GiB             Wed Nov 10 17:40:13 2021  mirror (non-primary peer_uuids:[] :18446744073709551614 copied)   
  2770  .mirror.primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.bc48dfb0-d3a2-4881-990e-1f4e4bc1e20a      1 GiB             Wed Nov 10 17:49:40 2021  mirror (demoted peer_uuids:[25e3849e-fd00-4458-9347-c4bf67f19bcb])


>>>>>>>> 10 minutes later no more snaps >>>>>>>>>

./tbox.sh busybox-workloads/api-vmware-dccp-one-ocsqe-lab-eng-rdu2-redhat-com:6443/system:admin rbd -p ocs-storagecluster-cephblockpool snap ls test-1 --all
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  2772  .mirror.primary.3ac4a0a2-52ec-44e0-86e1-c783fc936b12.ee221a0a-48eb-43f9-bad3-7946822aebb4  1 GiB             Wed Nov 10 17:49:49 2021  mirror (primary peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073])

Comment 17 Benamar Mekhissi 2021-11-11 16:14:03 UTC
I used the a pool that has already have a snapshot schedule setup.  I repeated the test, and I go the same result.  This is the last snap after I failed-over then relocated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                  
       
  6383  .mirror.primary.62fd69cc-c42a-423e-85b8-a5ba1b352d7b.1b139844-9cfd-4945-867e-6866457562b7  1 GiB             Thu Nov 11 15:48:59 2021  mirror (primary peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5b
bb073])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

The snapshots stopped creating as before.  The interesting thing is that if at this point I re-add the schedule, then the snaps creation resume
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
$ rbd mirror snapshot schedule add -p ocs-storagecluster-cephblockpool --image test-6 5m

$ rbd mirror snapshot schedule status -p ocs-storagecluster-cephblockpool
SCHEDULE TIME        IMAGE                                  
2021-11-11 16:05:00  ocs-storagecluster-cephblockpool/test-6

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
$ rbd snap ls ocs-storagecluster-cephblockpool/test-6 --all

SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
  6383  .mirror.primary.62fd69cc-c42a-423e-85b8-a5ba1b352d7b.1b139844-9cfd-4945-867e-6866457562b7  1 GiB             Thu Nov 11 15:48:59 2021  mirror (primary peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073])
  6465  .mirror.primary.62fd69cc-c42a-423e-85b8-a5ba1b352d7b.b372bccd-11ae-4cdc-99e1-e3febf35ea6f  1 GiB             Thu Nov 11 16:05:01 2021  mirror (primary peer_uuids:[f647fbb6-8729-4d53-8b13-d2dcb5bbb073])

Comment 19 Benamar Mekhissi 2021-11-12 11:31:33 UTC
I have confirmed (after my last comment) that re-adding the schedule will resume snap creation. 
Note: My experiment was done strictly with the script that uses RBD commands. I didn't verify that though ramen.

I also observed that the issue happens also on the second failback. Here are detailed steps:
1. Deploy to West (snaps get created at regular intervals)
2. Failover to East (snaps get created at regular intervals)
3. Relocate to West (One snap get created after we promote then no more)
4. Failover again to East (One snap get created after we promote then no more)
5. To start generating snaps, re-add the schedule.

Comment 20 Shyamsundar 2021-11-12 14:13:11 UTC
Notes from the Triage meeting (Nov, 12th Friday 8:00 AM Eastern TZ):

- The code workaround seems to be to do the following:
  - [Madhu] Set the image schedule unconditionally post promote operation in ceph-csi
    - Empirically determined that the schedule set is an idempotent operation hence repeating this unconditionally does not cause unwanted effects
  - Set the schedule to a time_NOW+dither factor, as snapshot schedules are not started immediately by RBD, it at times takes 15-60 minutes for the schedule to kick in for a 1 minute schedule
    - [Benamar] Testing the dither start time factor on the setups
    - [Sunny/Deepika] Looking at code to determine if there is a root cause for the same

Comment 26 Benamar Mekhissi 2021-11-16 23:36:38 UTC
  There is a workaround for this issue. It is not attractive but it solves the issue and simple to implement.
It involves creating two dummy images in both clusters. Every time we failover to a cluster, we enable mirroring for the dummy image. We do the same thing when we failback. The following steps show this using RBD commands.
In these steps, we'll show the steps "test_1" image that we want to DR protect. We also use two dummy images (one in each cluster) to refresh the snap schedule every time we failover/failback.  The dummy images can be created either ahead of time or a the time of the first image creation. In these steps, I'll show the creation of the dummy images at the time they are needed.

```
Cluster1:
---------
 1. rbd create ocs-storagecluster-cephblockpool/test_1 --size 1G
 2. rbd mirror image enable ocs-storagecluster-cephblockpool/test_1 snapshot
 3. rbd mirror snapshot schedule add --pool ocs-storagecluster-cephblockpool --image test_1 2m
 
Create a dummy image for cluster1
-----------------------------------
 1. rbd create ocs-storagecluster-cephblockpool/dummy1 --size 1M
 Note: There is no need to enable the image for mirroring at this time

```

Test Case 1: Failover to Cluster2
==================================
```
Cluster1
---------
 1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1

Cluster2
---------
 1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
 2. rbd create ocs-storagecluster-cephblockpool/dummy2 --size 1M
 Note: No need to enable mirroring for dummy image here as snaps for the first failover will work just fine
```

Test Case 2: Failback to Cluster1:
==================================
```
Cluster2:
---------
 1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1

Cluster1:
---------
 1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy1 snapshot
 Note: We enabled the dummy1 image in order for the snap schedule to be refreshed
```


Test Case 3: Failover to Cluster2 again
=======================================
Cluster1
---------
 1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror image disable ocs-storagecluster-cephblockpool/dummy1
 Note: We demote the real image and disable the dummy one to make it ready for the Failback

Cluster2
--------
 1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy2 snapshot
 Note: Enabling mirroring for the dummy image will force the snap schedule to be refreshed
 

 Test Case 4: Failback to Cluster1 again
========================================
Cluster2:
---------
 1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror image disable ocs-storagecluster-cephblockpool/dummy2

Cluster1:
---------
 1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy1 snapshot


From this point on, every time you failover/failback, you promote the real image and enable the dummy image in the target cluster, and you demote the real image and disable the dummy one from the source cluster as in TestCase 4. 

If we create the dummy image upfront, let's say at the time of the pool creation, then TestCase4 is all that's needed.

Comment 29 Benamar Mekhissi 2021-11-17 08:29:57 UTC
(In reply to Benamar Mekhissi from comment #26)
>   There is a workaround for this issue. It is not attractive but it solves
> the issue and simple to implement.
> It involves creating two dummy images in both clusters. Every time we
> failover to a cluster, we enable mirroring for the dummy image. We do the
> same thing when we failback. The following steps show this using RBD
> commands.
> In these steps, we'll show the steps "test_1" image that we want to DR
> protect. We also use two dummy images (one in each cluster) to refresh the
> snap schedule every time we failover/failback.  The dummy images can be
> created either ahead of time or a the time of the first image creation. In
> these steps, I'll show the creation of the dummy images at the time they are
> needed.
> 
> ```
> Cluster1:
> ---------
>  1. rbd create ocs-storagecluster-cephblockpool/test_1 --size 1G
>  2. rbd mirror image enable ocs-storagecluster-cephblockpool/test_1 snapshot
>  3. rbd mirror snapshot schedule add --pool ocs-storagecluster-cephblockpool
> --image test_1 2m
>  
> Create a dummy image for cluster1
> -----------------------------------
>  1. rbd create ocs-storagecluster-cephblockpool/dummy1 --size 1M
>  Note: There is no need to enable the image for mirroring at this time
> 
> ```
> 
> Test Case 1: Failover to Cluster2
> ==================================
> ```
> Cluster1
> ---------
>  1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
> 
> Cluster2
> ---------
>  1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
>  2. rbd create ocs-storagecluster-cephblockpool/dummy2 --size 1M
>  Note: No need to enable mirroring for dummy image here as snaps for the
> first failover will work just fine
> ```
> 
> Test Case 2: Failback to Cluster1:
> ==================================
> ```
> Cluster2:
> ---------
>  1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
> 
> Cluster1:
> ---------
>  1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
>  2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy1 snapshot
>  Note: We enabled the dummy1 image in order for the snap schedule to be
> refreshed
> ```
> 
> 
> Test Case 3: Failover to Cluster2 again
> =======================================
> Cluster1
> ---------
>  1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
>  2. rbd mirror image disable ocs-storagecluster-cephblockpool/dummy1
>  Note: We demote the real image and disable the dummy one to make it ready
> for the Failback
> 
> Cluster2
> --------
>  1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
>  2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy2 snapshot
>  Note: Enabling mirroring for the dummy image will force the snap schedule
> to be refreshed
>  
> 
>  Test Case 4: Failback to Cluster1 again
> ========================================
> Cluster2:
> ---------
>  1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1
>  2. rbd mirror image disable ocs-storagecluster-cephblockpool/dummy2
> 
> Cluster1:
> ---------
>  1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
>  2. rbd mirror image enable ocs-storagecluster-cephblockpool/dummy1 snapshot
> 
> 
> From this point on, every time you failover/failback, you promote the real
> image and enable the dummy image in the target cluster, and you demote the
> real image and disable the dummy one from the source cluster as in TestCase
> 4. 
> 
> If we create the dummy image upfront, let's say at the time of the pool
> creation, then TestCase4 is all that's needed.



TestCase1 need add schedule for cluster2. I didn't provide that above. Here it is corrected.

Test Case 1: Failover to Cluster2
==================================
```
Cluster1
---------
 1. rbd mirror image demote ocs-storagecluster-cephblockpool/test_1

Cluster2
---------
 1. rbd mirror image promote ocs-storagecluster-cephblockpool/test_1
 2. rbd mirror snapshot schedule add --pool ocs-storagecluster-cephblockpool --image test_1 2m
 3. rbd create ocs-storagecluster-cephblockpool/dummy2 --size 1M
 Note: No need to enable mirroring for dummy image here as snaps for the first failover will work just fine
```

Comment 31 Mudit Agarwal 2022-01-27 03:09:41 UTC
Providing dev ack given that the ceph BZ is already approved for RHCS 5.1

Comment 71 errata-xmlrpc 2023-01-31 00:19:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551


Note You need to log in before you can comment on or make changes to this bug.