Bug 2093690

Summary: [Tracker for Ceph BZ #2081715] [RDR] Relocate operation gets stuck with message "waiting for complete data resync to happen to target cluster" and takes ~1hour to complete for 20 Pods/PVCs
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: cephAssignee: Ilya Dryomov <idryomov>
ceph sub component: RBD-Mirror QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: bmekhiss, bniver, idryomov, kramdoss, madam, mmuench, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, sheggodu, srangana, vashastr
Version: 4.10Keywords: Regression
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.0-107 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2102253 (view as bug list) Environment:
Last Closed: 2023-01-31 00:19:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2081715    
Bug Blocks:    

Comment 6 Benamar Mekhissi 2022-06-06 12:02:49 UTC
Relocation cannot proceed without complete coordination. IOW, we should not experience any data loss, as DRPC is making sure that workload is not started on the PreferredCluster until all the data has been synced from the FailoverCluster to the PreferredCluster. The error above tells us that the DRPC is doing the right thing.

DRPC was waiting for the VRG condition on C1 to show that data is protected. IOW, the sync from the C1 to C2 is complete.  That didn't happen for the time mentioned above. VRG was waiting for the VR to complete the sync, but VR was showing the error below:
```
022-06-05T10:37:20.567071687Z {"level":"info","timestamp":"2022-06-05T10:37:20.566Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef07717d-e20a-11ec-a285-0a580a830016"}
2022-06-05T10:37:20.763626807Z {"level":"error","timestamp":"2022-06-05T10:37:20.763Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:275","msg":"failed to resync volume","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","error":"rpc error: code = Internal desc = not resyncing. image is in \"stopped\" state"}
2022-06-05T10:37:20.763626807Z {"level":"error","timestamp":"2022-06-05T10:37:20.763Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"failed to Replicate","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","ReplicationState":"secondary","error":"rpc error: code = Internal desc = not resyncing. image is in \"stopped\" state"}
```

We are investigating the VR now.

Comment 8 Shyamsundar 2022-06-06 19:51:26 UTC
Observations:

1. All VolRep resources are marked to be demoted in ~52 seconds
  - See difference in timestamp between first initial request and last initial request below

2. After ~25m:6s there are still 2 PVCs yet to complete sync of their demoted snapshot
  - See difference in timestamp between initial request and pending PVCs that are not marked secondary below

Overall relocate is waiting for the demoted snapshot sync to complete before starting the workload on the preferredCluster. This is to ensure there is no data loss due to relocation as in comment #6

The time to relocate is slow due to the wait for the demoted snapshot sync duration (or equivalent report).

As this is 4.10 additional metrics is not available to understand the sync duration and its slowness in this situation. It maybe feasible to understand the sync progression using RBD logs that are collected and looking at the RBD images:
  - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016
  - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016

<more on the above in the next comment>


======================
Log: must gather c1 -> csi-rbdplugin-provisioner-5888864ccc-4bdnp current.log

First initial request (for PVC mysql-pv-claim):
2022-06-05T10:36:15.349059638Z {"level":"info","timestamp":"2022-06-05T10:36:15.348Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"}

Last initial request (for PVC busybox-pvc-20):
2022-06-05T10:37:07.746357139Z {"level":"info","timestamp":"2022-06-05T10:37:07.746Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-20","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ee31915c-e20a-11ec-a285-0a580a830016"}

Pending PVCs that still are not marked Secondary (busybox-pvc-1 and mysql-pv-claim):
2022-06-05T11:01:19.337599712Z {"level":"info","timestamp":"2022-06-05T11:01:19.337Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ec8a1b56-e20a-11ec-a285-0a580a830016"}
2022-06-05T11:01:19.625787812Z {"level":"info","timestamp":"2022-06-05T11:01:19.625Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1"}
2022-06-05T11:01:21.734996080Z {"level":"info","timestamp":"2022-06-05T11:01:21.734Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"}
2022-06-05T11:01:21.904551317Z {"level":"info","timestamp":"2022-06-05T11:01:21.904Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1"}

Comment 9 Shyamsundar 2022-06-06 20:13:58 UTC
(In reply to Shyamsundar from comment #8)
> As this is 4.10 additional metrics is not available to understand the sync
> duration and its slowness in this situation. It maybe feasible to understand
> the sync progression using RBD logs that are collected and looking at the
> RBD images:
>   - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016
>   - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016

Looking at the rbd mirror image status information for the above 2 images it is seen that they are reporting (files from c1 must-gather in comment #3 http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/bz2093690.zip ):
    "peer_sites": [
        {
            "site_name": "7e1f194b-d347-4808-94fc-77ee3e4ad18b",
            "mirror_uuids": "55136b98-327f-4cdf-a0a8-a881583058fe",
            "state": "up+error",
            "description": "failed to unlink local peer from remote image",

As of the instance when the must-gather is collected the status remained the same (i.e "up+error"). As relocation proceeded later (as per comment #5) without further errors, it is suspected that an intermediate RBD mirror daemon restart cleared the erroneous state and C1 was able to complete its sync of the demoted snapshot and relocation was able to proceed as required.

------------------------------

From the rbd mirror pod logs in c2 (file: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/c2-ceph.tar.gz ) (which is what should rectify the situation on c2 that is reporting the error), the restarts seem to be at these timestamps:

2022-06-05T10:49:42.190+0000 7f1767629540  1 --  start start
2022-06-05T10:49:42.193+0000 7f1767629540  1 --  start start
2022-06-05T10:49:42.222+0000 7f1767629540  1 --  start start

2022-06-05T11:49:44.151+0000 7f9f5441b540  1 --  start start
2022-06-05T11:49:44.154+0000 7f9f5441b540  1 --  start start
2022-06-05T11:49:44.192+0000 7f9f5441b540  1 --  start start

Here are other related messages from the log with the said error message that stops at 2022-06-05T11:27:47.779 (the "global_id" for these messages matches the images that are reported above as "up+error"):

2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::ImageReplayer: 0x562605572280 [1/7d12e52b-aa82-4578-bb63-07586cbdd103] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::ImageReplayer: 0x56260553a500 [1/b856aa2c-260a-476f-a24d-8a2daae150bb] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}

Comment 16 Ilya Dryomov 2022-06-13 19:56:18 UTC
*** Bug 2095510 has been marked as a duplicate of this bug. ***

Comment 45 errata-xmlrpc 2023-01-31 00:19:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551