2093690 – [Tracker for Ceph BZ #2081715] [RDR] Relocate operation gets stuck with message "waiting for complete data resync to happen to target cluster" and takes ~1hour to complete for 20 Pods/PVCs

Bug 2093690 - [Tracker for Ceph BZ #2081715] [RDR] Relocate operation gets stuck with message "waiting for complete data resync to happen to target cluster" and takes ~1hour to complete for 20 Pods/PVCs

Summary: [Tracker for Ceph BZ #2081715] [RDR] Relocate operation gets stuck with messa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Ilya Dryomov
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2095510 (view as bug list)
Depends On:	2081715
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-05 11:13 UTC by Aman Agrawal
Modified:	2023-08-09 16:37 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.11.0-107
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2102253 (view as bug list)
Environment:
Last Closed:	2023-01-31 00:19:21 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2023:0551	0	None	None	None	2023-01-31 00:20:04 UTC

Comment 6 Benamar Mekhissi 2022-06-06 12:02:49 UTC

Relocation cannot proceed without complete coordination. IOW, we should not experience any data loss, as DRPC is making sure that workload is not started on the PreferredCluster until all the data has been synced from the FailoverCluster to the PreferredCluster. The error above tells us that the DRPC is doing the right thing.

DRPC was waiting for the VRG condition on C1 to show that data is protected. IOW, the sync from the C1 to C2 is complete. That didn't happen for the time mentioned above. VRG was waiting for the VR to complete the sync, but VR was showing the error below:
```
022-06-05T10:37:20.567071687Z {"level":"info","timestamp":"2022-06-05T10:37:20.566Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef07717d-e20a-11ec-a285-0a580a830016"}
2022-06-05T10:37:20.763626807Z {"level":"error","timestamp":"2022-06-05T10:37:20.763Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:275","msg":"failed to resync volume","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","error":"rpc error: code = Internal desc = not resyncing. image is in \"stopped\" state"}
2022-06-05T10:37:20.763626807Z {"level":"error","timestamp":"2022-06-05T10:37:20.763Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"failed to Replicate","Request.Name":"busybox-pvc-8","Request.Namespace":"busybox-workloads-1","ReplicationState":"secondary","error":"rpc error: code = Internal desc = not resyncing. image is in \"stopped\" state"}
```

We are investigating the VR now.

Comment 8 Shyamsundar 2022-06-06 19:51:26 UTC

Observations:

1. All VolRep resources are marked to be demoted in ~52 seconds
  - See difference in timestamp between first initial request and last initial request below

2. After ~25m:6s there are still 2 PVCs yet to complete sync of their demoted snapshot
  - See difference in timestamp between initial request and pending PVCs that are not marked secondary below

Overall relocate is waiting for the demoted snapshot sync to complete before starting the workload on the preferredCluster. This is to ensure there is no data loss due to relocation as in comment #6

The time to relocate is slow due to the wait for the demoted snapshot sync duration (or equivalent report).

As this is 4.10 additional metrics is not available to understand the sync duration and its slowness in this situation. It maybe feasible to understand the sync progression using RBD logs that are collected and looking at the RBD images:
  - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016
  - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016

<more on the above in the next comment>


======================
Log: must gather c1 -> csi-rbdplugin-provisioner-5888864ccc-4bdnp current.log

First initial request (for PVC mysql-pv-claim):
2022-06-05T10:36:15.349059638Z {"level":"info","timestamp":"2022-06-05T10:36:15.348Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"}

Last initial request (for PVC busybox-pvc-20):
2022-06-05T10:37:07.746357139Z {"level":"info","timestamp":"2022-06-05T10:37:07.746Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-20","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ee31915c-e20a-11ec-a285-0a580a830016"}

Pending PVCs that still are not marked Secondary (busybox-pvc-1 and mysql-pv-claim):
2022-06-05T11:01:19.337599712Z {"level":"info","timestamp":"2022-06-05T11:01:19.337Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ec8a1b56-e20a-11ec-a285-0a580a830016"}
2022-06-05T11:01:19.625787812Z {"level":"info","timestamp":"2022-06-05T11:01:19.625Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1"}
2022-06-05T11:01:21.734996080Z {"level":"info","timestamp":"2022-06-05T11:01:21.734Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"}
2022-06-05T11:01:21.904551317Z {"level":"info","timestamp":"2022-06-05T11:01:21.904Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1"}

Comment 9 Shyamsundar 2022-06-06 20:13:58 UTC

(In reply to Shyamsundar from comment #8)
> As this is 4.10 additional metrics is not available to understand the sync
> duration and its slowness in this situation. It maybe feasible to understand
> the sync progression using RBD logs that are collected and looking at the
> RBD images:
>   - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016
>   - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016

Looking at the rbd mirror image status information for the above 2 images it is seen that they are reporting (files from c1 must-gather in comment #3 http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/bz2093690.zip ):
    "peer_sites": [
        {
            "site_name": "7e1f194b-d347-4808-94fc-77ee3e4ad18b",
            "mirror_uuids": "55136b98-327f-4cdf-a0a8-a881583058fe",
            "state": "up+error",
            "description": "failed to unlink local peer from remote image",

As of the instance when the must-gather is collected the status remained the same (i.e "up+error"). As relocation proceeded later (as per comment #5) without further errors, it is suspected that an intermediate RBD mirror daemon restart cleared the erroneous state and C1 was able to complete its sync of the demoted snapshot and relocation was able to proceed as required.

------------------------------

From the rbd mirror pod logs in c2 (file: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/c2-ceph.tar.gz ) (which is what should rectify the situation on c2 that is reporting the error), the restarts seem to be at these timestamps:

2022-06-05T10:49:42.190+0000 7f1767629540  1 --  start start
2022-06-05T10:49:42.193+0000 7f1767629540  1 --  start start
2022-06-05T10:49:42.222+0000 7f1767629540  1 --  start start

2022-06-05T11:49:44.151+0000 7f9f5441b540  1 --  start start
2022-06-05T11:49:44.154+0000 7f9f5441b540  1 --  start start
2022-06-05T11:49:44.192+0000 7f9f5441b540  1 --  start start

Here are other related messages from the log with the said error message that stops at 2022-06-05T11:27:47.779 (the "global_id" for these messages matches the images that are reported above as "up+error"):

2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::ImageReplayer: 0x562605572280 [1/7d12e52b-aa82-4578-bb63-07586cbdd103] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::ImageReplayer: 0x56260553a500 [1/b856aa2c-260a-476f-a24d-8a2daae150bb] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}
2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]}

Comment 16 Ilya Dryomov 2022-06-13 19:56:18 UTC

*** Bug 2095510 has been marked as a duplicate of this bug. ***

Comment 45 errata-xmlrpc 2023-01-31 00:19:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Note You need to log in before you can comment on or make changes to this bug.