Description of problem (please be detailed as possible and provide log snippests): [DR] When Relocate action is performed PVC's are in Termination state and workload is not moved to a preferred cluster Version of all relevant components (if applicable): ODF version:- 4.9.0-212.ci OCP version:- 4.9.0-0.nightly-2021-10-28-052726 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy DR cluster 2. Deploy workloads 3. Perform failover of workloads 4. Relocate the workloads 5. Check pvc status and workloads state Actual results: #oc get pvc -n busybox-workloads NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE busybox-pvc Terminating pvc-73e64578-88e5-4b82-bde8-4831a9187e75 1Gi RWO ocs-storagecluster-ceph-rbd 28h busybox-pvc-1 Terminating pvc-0d39ff58-2bb1-4c44-92ab-6058753a6f45 5Gi RWO ocs-storagecluster-ceph-rbd 28h busybox-pvc-2 Terminating pvc-ad7fa6ce-9dd6-4cf7-809b-59810314a4fa 8Gi RWO ocs-storagecluster-ceph-rbd 28h busybox-pvc-3 Terminating pvc-05ae2d0a-77c1-4806-a969-d876bae1a6e9 10Gi RWO ocs-storagecluster-ceph-rbd 28h busybox-pvc-4 Terminating pvc-5b31ab80-cf45-47be-9958-bc971d8dd0b7 3Gi RWO ocs-storagecluster-ceph-rbd 28h busybox-pvc-5 Terminating pvc-209e19ac-6f2e-4b4c-864b-f150a3d61d27 8Gi RWO ocs-storagecluster-ceph-rbd 28h mysql-pv-claim Terminating pvc-76502a33-701c-423c-a097-1d1078a29fd6 24Gi RWO ocs-storagecluster-ceph-rbd 28h Expected results: Workloads should be relocated to prefered cluster with no issue Additional info:
The user requested the workload to be relocated to the preferred cluster. In summary, during the relocation, the VR is requested by VRG to move VR state to secondary. During that process, we demote the image, we resync if necessary, and then report success. Our investigation revealed that the primary image was deleted due to a resync operation. The VR requested a resync due to split-brain error as seen from the VR log snippet on the Primary. Out of the 7 PVCs, the image for the mysql-pv-claim pvc failed with split-brain error. ``` "2021-11-02T11:39:10.473Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:164","msg":"adding finalizer to PersistentVolumeClaim object","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads","Finalizer":"replication.storage.openshift.io/pvc-protection"} {"level":"info","timestamp":"2021-11-02T11:39:11.490Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is marked primary","Reques t.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads"} "2021-11-02T13:56:21.002Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"my sql-pv-claim","Request.Namespace":"busybox-workloads","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241"} {"level":"info","timestamp":"2021-11-02T13:56:22.384Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use","Requ est.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads"} "2021-11-02T13:56:37.875Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:236","msg":"failed to resync volume","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads","error":"rpc error: code = Unavailable desc = awaiting initial resync due to split brain"} ``` We beleive that the resync ended up deleting the local image and the VR ended up in "kind of" deadlock waiting for the image to reappear with no further progress as there is no primary. ``` "2021-11-02T13:57:13.199Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:198","msg":"failed to enable volume replication","Request.Name":"mysql-pv-claim","Request.Namespace":"busybox-workloads","error":"rpc error: code = NotFound desc = volume 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 not found"} ``` The rbd-mirror log shows the following: ``` debug 2021-11-02T13:56:35.104+0000 7fa98c2ac700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x564bbfe2b400 scan_remote_mirror_snapshots: split-brain detected: failed to find matching non-primary snapshot in remote image: local_snap_id_start=16491, local_snap_ns=[mirror state=primary (demoted), complete=1, mirror_peer_uuids=d3aaa712-4bc9-4aef-967e-56dcf3ae3585, primary_mirror_uuid=, primary_snap_id=head, last_copied_object_number=0, snap_seqs={}] ``` The error is "failed to find matching non-primary snapshot in remote image". ``` rbd snap ls csi-vol-f038968b-389c-11ec-bca9-0a580a800241 --pool=ocs-storagecluster-cephblockpool ``` show no snapshots in the remote cluster or the local cluster. Next, we will be investigating RBD to see if we can determine the reason behind the split-brain error. One last thing to note; on the primary cluster, the VR requested a resync due to split-brain error (which ended up deleting the primary image). We believe that's wrong. A resyc should be requested from the secondary cluster and not the primary. -- Log files attached ---
Created attachment 1839887 [details] rbd-mirror.log (from primary)
Created attachment 1839888 [details] VR log from Primary
Created attachment 1839889 [details] VR log from secondary
Created attachment 1839890 [details] rbdplugin log from primary
Created attachment 1839891 [details] csi-rbdplugin from secondary
Adding one more observation from the csi-rbdplugin logs from the current primary (comment #6): The following request is the one that detects a split-brain report from RBD, and issues a resync that ends up deleting the image with no known primary to sync from. I1102 13:56:37.529982 1 utils.go:177] ID: 6096 Req-ID: 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 GRPC call: /replication.Controller/ResyncVolume I1102 13:56:37.530223 1 utils.go:179] ID: 6096 Req-ID: 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 GRPC request: {"parameters":{"mirroringMode":"snapshot","schedulingInterval":"5m"},"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241"} I1102 13:56:37.539866 1 omap.go:87] ID: 6096 Req-ID: 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.f038968b-389c-11ec-bca9-0a580a800241"): map[csi.imageid:63cb7d9a035e csi.imagename:csi-vol-f038968b-389c-11ec-bca9-0a580a800241 csi.volname:pvc-76502a33-701c-423c-a097-1d1078a29fd6 csi.volume.owner:busybox-workloads] I1102 13:56:37.611589 1 replicationcontrollerserver.go:612] ID: 6096 Req-ID: 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 local status: daemon up=true, image mirroring state="error", description="split-brain" and lastUpdate=2021-11-02 13:56:35 +0000 UTC ===> NOTE: The response from RBD on image status reads "up+error" with a description of split-brain, causing the resync to be issued. E1102 13:56:37.874855 1 utils.go:186] ID: 6096 Req-ID: 0001-0011-openshift-storage-0000000000000001-f038968b-389c-11ec-bca9-0a580a800241 GRPC error: rpc error: code = Unavailable desc = awaiting initial resync due to split brain Further notes: - As the image was demoted and later promoted on a peer cluster, the subsequent demote on the peer cluster should not result in a split-brain report from RBD. The status should have ideally been up+stopped (post syncing the last snapshot). - As Benamar points out in comment #2 the RBD mirror daemon decides the image is in split-brain due to missing snapshots, which seems to be the issue causing the problem here
For doc text confirmation.
Pratik please confirm that this works. It can be tested with 4.10.
So this issue is no longer reproducible? Removing the test blocker flag. If this issue is no longer reproducible, please close it as such and reopen it if it occurs again.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days