Description of problem (please be detailed as possible and provide log snippests): RBD image that is mirror enabled is left as split-brain on old-primary post being marked as Secondary in VolumeReplication CR, and VolumeReplication CR reporting status as "Degraded: true" and "Resyncing: true", this is post image has been force promoted on the peer cluster which hence requires a resync to avoid the split-brain. Order of operations: - VolRep created as primary on cluster "one" (named as per log files attached) - Later VolRep created as primary on cluster "two" - This would force promote the volume on "two" - VolRep on "one" marked Secondary - This would technically enter split-brain as remote is primary as well, and should have issued a resync once this is detected - VolRep reports "Degraded: true" and "Resyncing: true", indicating image is now resyncing - Image is still left as split-brain on "one" RBD image name investigated: ocs-storagecluster-cephblockpool/csi-vol-97e46d10-25cf-11ec-9fa9-0a580a8103a8 PVC name: busybox-pvc-2 PV name: pvc-6a03c816-27a2-43cf-9486-2e1d74e8276c RBD mirror image status on cluster ("one") where volume was primary previously, and has been issued a demote to Secondary (volume would have split brained as it was force promoted on the remote peer): bash-4.4$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-97e46d10-25cf-11ec-9fa9-0a580a8103a8 csi-vol-97e46d10-25cf-11ec-9fa9-0a580a8103a8: global_id: a991b001-0470-492e-93d5-2fd0ccfa4063 state: up+stopping_replay description: split-brain last_update: 2021-10-05 13:08:02 peer_sites: name: afc6aaac-199c-472e-bf35-390eb2799b3e state: up+stopped description: local image is primary last_update: 2021-10-05 13:07:44 Relevant VolRep logs from cluster "one" (log file: csi-volrep-one-265t7.txt): Initially marked for replication primary: {"level":"info","timestamp":"2021-10-05T11:44:32.939Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-97e46d10-25cf-11ec-9fa9-0a580a8103a8"} {"level":"info","timestamp":"2021-10-05T11:44:32.939Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:160","msg":"adding finalizer to volumeReplication object","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","Finalizer":"replication.storage.openshift.io"} {"level":"info","timestamp":"2021-10-05T11:44:32.952Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:164","msg":"adding finalizer to PersistentVolumeClaim object","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","Finalizer":"replication.storage.openshift.io/pvc-protection"} {"level":"info","timestamp":"2021-10-05T11:44:33.950Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is marked primary","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik"} Moved to Secondary and resync: {"level":"info","timestamp":"2021-10-05T12:20:50.660Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-97e46d10-25cf-11ec-9fa9-0a580a8103a8"} {"level":"info","timestamp":"2021-10-05T12:20:51.661Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik"} {"level":"info","timestamp":"2021-10-05T12:21:15.562Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-97e46d10-25cf-11ec-9fa9-0a580a8103a8"} {"level":"info","timestamp":"2021-10-05T12:21:15.698Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume is not ready to use, requeuing for resync","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik"} Disable and delete: {"level":"info","timestamp":"2021-10-05T12:21:56.971Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-97e46d10-25cf-11ec-9fa9-0a580a8103a8"} {"level":"error","timestamp":"2021-10-05T12:21:57.025Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:170","msg":"failed to disable volume replication","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","error":"rpc error: code = InvalidArgument desc = secondary image status is up=true and state=starting_replay"} {"level":"error","timestamp":"2021-10-05T12:21:57.025Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"failed to disable replication","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","error":"rpc error: code = InvalidArgument desc = secondary image status is up=true and state=starting_replay"} {"level":"error","timestamp":"2021-10-05T12:21:57.025Z","logger":"controller-runtime.manager.controller.volumereplication","caller":"controller/controller.go:253","msg":"Reconciler error","reconciler group":"replication.storage.openshift.io","reconciler kind":"VolumeReplication","name":"busybox-pvc-2","namespace":"busybox-pratik","error":"rpc error: code = InvalidArgument desc = secondary image status is up=true and state=starting_replay"} {"level":"info","timestamp":"2021-10-05T12:23:22.506Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volume handle","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","VolumeHandleName":"0001-0011-openshift-storage-0000000000000001-97e46d10-25cf-11ec-9fa9-0a580a8103a8"} {"level":"info","timestamp":"2021-10-05T12:23:22.561Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:175","msg":"removing finalizer from PersistentVolumeClaim object","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","Finalizer":"replication.storage.openshift.io/pvc-protection"} {"level":"info","timestamp":"2021-10-05T12:23:22.572Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:182","msg":"removing finalizer from volumeReplication object","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik","Finalizer":"replication.storage.openshift.io"} {"level":"info","timestamp":"2021-10-05T12:23:22.585Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volumeReplication object is terminated, skipping reconciliation","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik"} {"level":"info","timestamp":"2021-10-05T12:23:22.585Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"volumeReplication resource not found","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-pratik"} The issue seems to be VolRep (and/or ceph-csi) reporting moved to Secondary and Resyncing true and Degraded true, based on image state which possibly does not cover state=starting_replay or up+stopping_replay? Also, If disable failed, I am wondering why delete finally succeeded? List of log files attached: csi-volrep-one-265t7.txt (CSI volume replication operator sidecar from cluster "one") csi-rbdplugin-one-265t7.txt (CSI provisioner logs from rbdplugin container, in cluster "one") csi-volrep-two-dcqd2.txt (CSI volume replication operator sidecar from cluster "two", for reference) csi-rbdplugin-two-dcqd2.txt (CSI provisioner logs from rbdplugin container, in cluster "two", for reference) rbd-mirror-one.log (RBD mirror daemon logs from cluster "one") rbd-mirror-two.log (RBD mirror daemon logs from cluster "two") Version of all relevant components (if applicable): 4.9 build #164 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is a silent error, and hence undetectable and would cause data corrpution in the future if the workload is moved to cluster "one". Is there any workaround available to the best of your knowledge? None Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 The issue was reproduced in a mirroring setup across BLR and RDU, which hence has high latency interconnects and may impact the ability to reproduce the problem. Can this issue reproducible? Unsure as yet, other instance of split-brain have been root caused to different issues, so it does occur, but with the above RBD mirroring states is unsure.
Automation priority is low since DR automation is yet to be implemented.