Description of problem (please be detailed as possible and provide log snippests): Created a test application for initial ODR testing. Application consist of two jobs running fast non interrupted loops that keep writing to RBD based PVCs. When failing over the application it appears that on the remote cluster the PVs are successfully attached to the pods but the filesystems can not be mounted and an error message in the pod show "still in use". Version of all relevant components (if applicable): OCP 4.8 ODF 4.9 Build 140-ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Not identified yet Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5 Can this issue reproducible? Do not know at this point Can this issue reproduce from the UI? Do not know at this point If this is a regression, please provide more details to justify this: No as this is a new feature Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: On the secondary cluster where the applciation has been failed over the following message can be foubd when running an "oc describe" on the pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 113s default-scheduler Successfully assigned rbdloop/batch-4k-ll6mg to perf2-lcb2j-ocs-qvxhp Normal SuccessfulAttachVolume 112s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-16b867bd-95f6-401b-8760-077477c90d4a" Warning FailedMount 46s kubelet MountVolume.MountDevice failed for volume "pvc-16b867bd-95f6-401b-8760-077477c90d4a" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 is still being used On secondary cluster after failover oc get vr -A NAMESPACE NAME AGE VOLUMEREPLICATIONCLASS PVCNAME DESIREDSTATE CURRENTSTATE rbdloop pvc-cephrbd-4k 5h27m rbd-volumereplicationclass pvc-cephrbd-4k primary Unknown rbdloop pvc-cephrbd-4m 5h27m rbd-volumereplicationclass pvc-cephrbd-4m primary Unknown lastStartTime: "2021-09-23T00:45:22Z" message: 'rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 already exists' rbd mirror image status csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55: global_id: bb22f7df-b48f-4633-8e58-28e368725e83 state: up+error description: local image linked to unknown peer last_update: 2021-09-23 06:36:27 peer_sites: name: 5eec6e62-1219-4a2a-84f0-e6c26367e09a state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1632358077,"replay_state":"idle"} last_update: 2021-09-23 06:36:43 rbd mirror image promote csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool 2021-09-23T06:56:44.468+0000 7faa169af700 -1 librbd::mirror::PromoteRequest: 0x55afe3dedc10 handle_get_info: image is still primary within a remote cluster rbd: error promoting image to primary 2021-09-23T06:56:44.468+0000 7faa29a6e2c0 -1 librbd::api::Mirror: image_promote: failed to promote image When i tried to manuall promote the image, it saying the image is primary on remote cluster. but its not sh-4.4$ rbd mirror image status csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55: global_id: bb22f7df-b48f-4633-8e58-28e368725e83 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1632358077,"replay_state":"idle"} last_update: 2021-09-23 06:56:13 peer_sites: name: 930034e0-0f34-4c23-98bf-89a3ffe27501 state: up+error description: local image linked to unknown peer last_update: 2021-09-23 06:55:57 sh-4.4$ rbd info csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool rbd image 'csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55': size 100 GiB in 25600 objects order 22 (4 MiB objects) snapshot_count: 3 id: 5656c9f3afb7b block_name_prefix: rbd_data.5656c9f3afb7b format: 2 features: layering, non-primary op_features: flags: create_timestamp: Thu Sep 23 00:30:20 2021 access_timestamp: Thu Sep 23 00:49:42 2021 modify_timestamp: Thu Sep 23 00:30:20 2021 mirroring state: enabled mirroring mode: snapshot mirroring global id: bb22f7df-b48f-4633-8e58-28e368725e83 mirroring primary: false
Moving this to 4.9.z while Pratik tries out the workaround suggested by Shyam, fix would be delivered in Ceph 5.0z2 which means this won't be done within 4.9 timeline.
Removing the 4.9.0 flag and adding it as a known issue (The feature moved to DP)
I reproduced the “still being in use” issue in my minikube clusters and here are the steps I used to reproduce it and also my observations. Note that I have done no special configuration to reproduce it other than the ramen patch and I also added one more PVC at a size of 10T to simulate (kind of) latency. The Ramen patch has a bug that expedited the reproduction of the issue. The bottom line is that I believe that the combination of the Ramen fix and Fast-diff fix will resolve this issue for good. 1. I deployed an application to C1. Here are the 2 PVCs ------------- pvc on C1 ----------------- NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE busybox-sample busybox-pvc Bound pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c 1Gi RWO rook-ceph-block 83s busybox-sample busybox-pvc-2 Bound pvc-3ebc31d0-1101-4592-a289-632f8654f7d2 10000Gi RWO rook-ceph-block 82s 2. I failed over and I had two windows open watching for the parallel execution: Every 2.0s: kubectl get pvc -A --context C1 NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE busybox-sample busybox-pvc Terminating pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c 1Gi RWO rook-ceph-block 142m busybox-sample busybox-pvc-2 Terminating pvc-3ebc31d0-1101-4592-a289-632f8654f7d2 10000Gi RWO rook-ceph-block 142 Every 2.0s: kubectl get pvc -A —context C2 NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE busybox-sample busybox-pvc Bound pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c 1Gi RWO rook-ceph-block 126m busybox-sample busybox-pvc-2 Bound pvc-3ebc31d0-1101-4592-a289-632f8654f7d2 10000Gi RWO rook-ceph-block 126m I briefly caught C1 PVCs going to terminating state first. But I am going to ignore that for now. 3. The Ramen bug VRG reported DataReady in the first time we checked where it shouldn’t 4. The VR log started reporting the following error every second, and then exponentially started to increase forever grep "force promoting volume due to known grpc error" vr.log {"level”:”…”,"logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:215","msg":"force promoting volume due to known grpc error","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-sample","error":"rpc error: code = FailedPrecondition desc = failed to promote image \"replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a\" with error: rbd: ret=-16, Device or resource busy"} 5. At this point, describing the busybox pod shows the *Still being used* error. But the interesting thing is that it is showing in on the volume that its snap has already been fully copied. Nevertheless, I believe they are all related. kubectl describe -nbusybox-sample pod busybox ... ... ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 20m default-scheduler Successfully assigned busybox-sample/busybox to hub Normal CreateResource 20m subscription Synchronizer created resource busybox of gvk:/v1, Kind=Pod Normal SuccessfulAttachVolume 20m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3ebc31d0-1101-4592-a289-632f8654f7d2" Normal SuccessfulAttachVolume 20m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" Warning FailedMount 18m kubelet MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 17m (x6 over 18m) kubelet MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-cc1467cf-5348-11ec-99db-0242ac11000a already exists Warning FailedMount 2m43s (x6 over 16m) kubelet MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = Internal desc = rbd image replicapool/csi-vol-cc1467cf-5348-11ec-99db-0242ac11000a is still being used 6. An RBD info on the image on C2 show that it is still reporting non-primary c2 — rbd info replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a 2>/dev/null rbd image 'csi-vol-cc146afa-5348-11ec-99db-0242ac11000a': size 9.8 TiB in 2560000 objects order 22 (4 MiB objects) snapshot_count: 1 id: 32b8fabdf189de block_name_prefix: rbd_data.32b8fabdf189de format: 2 features: layering, non-primary op_features: flags: ... mirroring state: enabled mirroring mode: snapshot mirroring global id: 8693ca05-40d4-4f51-a982-fe9f9024ccc3 mirroring primary: false 7. An RBD info on the image on C1 show that it is non-primary (with the right Ramen fix, this should not happen) c1 — rbd info replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a rbd image 'csi-vol-cc146afa-5348-11ec-99db-0242ac11000a': size 9.8 TiB in 2560000 objects order 22 (4 MiB objects) snapshot_count: 4 id: 1e9e167979b7a5 block_name_prefix: rbd_data.1e9e167979b7a5 format: 2 features: layering, non-primary op_features: flags: ... mirroring state: enabled mirroring mode: snapshot mirroring global id: 8693ca05-40d4-4f51-a982-fe9f9024ccc3 mirroring primary: false 8. On C2, the snap for the large image show that it is still copying c2 -- rbd snap ls replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a --all 2>/dev/null SNAPID NAME SIZE PROTECTED NAMESPACE 231307 .mirror.non_primary.8693ca05-40d4-4f51-a982-fe9f9024ccc3.d03e40c4-3b36-466c-84c3-2cbc42e5bb2a 9.8 TiB mirror (non-primary peer_uuids:[] 0d463db9-d6d5-496d-af45-16d780db76f6:229056 36% copied) 9. Force promote will not work while (8) is true. Which meant (4) is reporting the same. And as of this writing, the exponential gap reached 17 minutes. I think this is the max for the reconciler. 10. The interesting thing though, the image is non-primary in both clusters. I thought the snap copy stopped, but after 2 hours and while I wasn’t watching, the snap copy moved from 36% complete to 40% complete. It is still progressing but very very very slow. This never happened when things are functioning properly. I expect that after the snap is copied fully, the “Still being used” error will go away. I remember that Annette, JC, and Pratik or reported the same issue (where “Still being used” error gets cleared after a long time) if I am not mistaken.
Providing dev ack given that the ceph BZ is already approved for RHCS 5.1
Ceph bug is retargeted for 5.3
*** Bug 2155564 has been marked as a duplicate of this bug. ***