Bug 2093690
Summary: | [Tracker for Ceph BZ #2081715] [RDR] Relocate operation gets stuck with message "waiting for complete data resync to happen to target cluster" and takes ~1hour to complete for 20 Pods/PVCs | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> | |
Component: | ceph | Assignee: | Ilya Dryomov <idryomov> | |
ceph sub component: | RBD-Mirror | QA Contact: | Aman Agrawal <amagrawa> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | unspecified | CC: | bmekhiss, bniver, idryomov, kramdoss, madam, mmuench, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, sheggodu, srangana, vashastr | |
Version: | 4.10 | Keywords: | Regression | |
Target Milestone: | --- | |||
Target Release: | ODF 4.12.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 4.11.0-107 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2102253 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-31 00:19:21 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2081715 | |||
Bug Blocks: |
Comment 6
Benamar Mekhissi
2022-06-06 12:02:49 UTC
Observations: 1. All VolRep resources are marked to be demoted in ~52 seconds - See difference in timestamp between first initial request and last initial request below 2. After ~25m:6s there are still 2 PVCs yet to complete sync of their demoted snapshot - See difference in timestamp between initial request and pending PVCs that are not marked secondary below Overall relocate is waiting for the demoted snapshot sync to complete before starting the workload on the preferredCluster. This is to ensure there is no data loss due to relocation as in comment #6 The time to relocate is slow due to the wait for the demoted snapshot sync duration (or equivalent report). As this is 4.10 additional metrics is not available to understand the sync duration and its slowness in this situation. It maybe feasible to understand the sync progression using RBD logs that are collected and looking at the RBD images: - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016 - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016 <more on the above in the next comment> ====================== Log: must gather c1 -> csi-rbdplugin-provisioner-5888864ccc-4bdnp current.log First initial request (for PVC mysql-pv-claim): 2022-06-05T10:36:15.349059638Z {"level":"info","timestamp":"2022-06-05T10:36:15.348Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"} Last initial request (for PVC busybox-pvc-20): 2022-06-05T10:37:07.746357139Z {"level":"info","timestamp":"2022-06-05T10:37:07.746Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-20","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ee31915c-e20a-11ec-a285-0a580a830016"} Pending PVCs that still are not marked Secondary (busybox-pvc-1 and mysql-pv-claim): 2022-06-05T11:01:19.337599712Z {"level":"info","timestamp":"2022-06-05T11:01:19.337Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ec8a1b56-e20a-11ec-a285-0a580a830016"} 2022-06-05T11:01:19.625787812Z {"level":"info","timestamp":"2022-06-05T11:01:19.625Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"busybox-pvc-1","Request.Namespace":"busybox-workloads-1"} 2022-06-05T11:01:21.734996080Z {"level":"info","timestamp":"2022-06-05T11:01:21.734Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-ef44851e-e20a-11ec-a285-0a580a830016"} 2022-06-05T11:01:21.904551317Z {"level":"info","timestamp":"2022-06-05T11:01:21.904Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads-1"} (In reply to Shyamsundar from comment #8) > As this is 4.10 additional metrics is not available to understand the sync > duration and its slowness in this situation. It maybe feasible to understand > the sync progression using RBD logs that are collected and looking at the > RBD images: > - csi-vol-ef44851e-e20a-11ec-a285-0a580a830016 > - csi-vol-ec8a1b56-e20a-11ec-a285-0a580a830016 Looking at the rbd mirror image status information for the above 2 images it is seen that they are reporting (files from c1 must-gather in comment #3 http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/bz2093690.zip ): "peer_sites": [ { "site_name": "7e1f194b-d347-4808-94fc-77ee3e4ad18b", "mirror_uuids": "55136b98-327f-4cdf-a0a8-a881583058fe", "state": "up+error", "description": "failed to unlink local peer from remote image", As of the instance when the must-gather is collected the status remained the same (i.e "up+error"). As relocation proceeded later (as per comment #5) without further errors, it is suspected that an intermediate RBD mirror daemon restart cleared the erroneous state and C1 was able to complete its sync of the demoted snapshot and relocation was able to proceed as required. ------------------------------ From the rbd mirror pod logs in c2 (file: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/5jun22/c2-ceph.tar.gz ) (which is what should rectify the situation on c2 that is reporting the error), the restarts seem to be at these timestamps: 2022-06-05T10:49:42.190+0000 7f1767629540 1 -- start start 2022-06-05T10:49:42.193+0000 7f1767629540 1 -- start start 2022-06-05T10:49:42.222+0000 7f1767629540 1 -- start start 2022-06-05T11:49:44.151+0000 7f9f5441b540 1 -- start start 2022-06-05T11:49:44.154+0000 7f9f5441b540 1 -- start start 2022-06-05T11:49:44.192+0000 7f9f5441b540 1 -- start start Here are other related messages from the log with the said error message that stops at 2022-06-05T11:27:47.779 (the "global_id" for these messages matches the images that are reported above as "up+error"): 2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::ImageReplayer: 0x562605572280 [1/7d12e52b-aa82-4578-bb63-07586cbdd103] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} 2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} 2022-06-05T11:27:47.777+0000 7f17559db700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=7d12e52b-aa82-4578-bb63-07586cbdd103, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} 2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::ImageReplayer: 0x56260553a500 [1/b856aa2c-260a-476f-a24d-8a2daae150bb] set_mirror_image_status_update: status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} 2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562604745680 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} 2022-06-05T11:27:47.779+0000 7f17551da700 15 rbd::mirror::MirrorStatusUpdater 0x562603a82120 set_mirror_image_status: global_image_id=b856aa2c-260a-476f-a24d-8a2daae150bb, mirror_image_site_status={state=up+error, description=failed to unlink local peer from remote image, last_update=0.000000]} *** Bug 2095510 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551 |