Bug 2007376 - [RDR] [Tracker for bz #2009735] After failing over an application, the mirrored RBDs can be attached but the filesystems can not be mounted (still in use)
Summary: [RDR] [Tracker for bz #2009735] After failing over an application, the mirro...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ilya Dryomov
QA Contact: Elad
URL:
Whiteboard:
: 2155564 (view as bug list)
Depends On: 2009735
Blocks: 2011326 2030752 2155564
TreeView+ depends on / blocked
 
Reported: 2021-09-23 17:22 UTC by Jean-Charles Lopez
Modified: 2024-01-04 11:16 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Failover action reports RADOS block device image mount failed on the pod with RPC error `still in use` Failing over a disaster recovery (DR) protected workload may result in pods using the volume on the failover cluster to be stuck in reporting RADOS block device (RBD) image is still in use. This prevents the pods from starting up for a long duration (upto several hours).
Clone Of:
: 2009735 2030752 (view as bug list)
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Jean-Charles Lopez 2021-09-23 17:22:50 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Created a test application for initial ODR testing. Application consist of two jobs running fast non interrupted loops that keep writing to RBD based PVCs. When failing over the application it appears that on the remote cluster the PVs are successfully attached to the pods but the filesystems can not be mounted and an error message in the pod show "still in use".


Version of all relevant components (if applicable):
OCP 4.8
ODF 4.9 Build 140-ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
Not identified yet

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Do not know at this point

Can this issue reproduce from the UI?
Do not know at this point

If this is a regression, please provide more details to justify this:
No as this is a new feature

Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:
On the secondary cluster where the applciation has been failed over the following message can be foubd when running an "oc describe" on the pod

Events:
  Type     Reason                  Age   From                     Message
  ----     ------                  ----  ----                     -------
  Normal   Scheduled               113s  default-scheduler        Successfully assigned rbdloop/batch-4k-ll6mg to perf2-lcb2j-ocs-qvxhp
  Normal   SuccessfulAttachVolume  112s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-16b867bd-95f6-401b-8760-077477c90d4a"
  Warning  FailedMount             46s   kubelet                  MountVolume.MountDevice failed for volume "pvc-16b867bd-95f6-401b-8760-077477c90d4a" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 is still being used

On secondary cluster after failover
oc get vr -A
NAMESPACE   NAME             AGE     VOLUMEREPLICATIONCLASS       PVCNAME          DESIREDSTATE   CURRENTSTATE
rbdloop     pvc-cephrbd-4k   5h27m   rbd-volumereplicationclass   pvc-cephrbd-4k   primary        Unknown
rbdloop     pvc-cephrbd-4m   5h27m   rbd-volumereplicationclass   pvc-cephrbd-4m   primary        Unknown

lastStartTime: "2021-09-23T00:45:22Z"
  message: 'rpc error: code = Aborted desc = an operation with the given Volume ID
    0001-0011-openshift-storage-0000000000000001-6ec8cc65-1c05-11ec-8cf6-0a580a050e55
    already exists'

rbd mirror image status csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool
csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55:
  global_id:   bb22f7df-b48f-4633-8e58-28e368725e83
  state:       up+error
  description: local image linked to unknown peer
  last_update: 2021-09-23 06:36:27
  peer_sites:
    name: 5eec6e62-1219-4a2a-84f0-e6c26367e09a
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1632358077,"replay_state":"idle"}
    last_update: 2021-09-23 06:36:43

rbd mirror image promote csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool
2021-09-23T06:56:44.468+0000 7faa169af700 -1 librbd::mirror::PromoteRequest: 0x55afe3dedc10 handle_get_info: image is still primary within a remote cluster
rbd: error promoting image to primary
2021-09-23T06:56:44.468+0000 7faa29a6e2c0 -1 librbd::api::Mirror: image_promote: failed to promote image

When i tried to manuall promote the image, it saying the image is primary on remote cluster. but its not
sh-4.4$ rbd mirror image status csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool
csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55:
  global_id:   bb22f7df-b48f-4633-8e58-28e368725e83
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"remote_snapshot_timestamp":1632358077,"replay_state":"idle"}
  last_update: 2021-09-23 06:56:13
  peer_sites:
    name: 930034e0-0f34-4c23-98bf-89a3ffe27501
    state: up+error
    description: local image linked to unknown peer
    last_update: 2021-09-23 06:55:57
sh-4.4$ rbd info csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55 -p ocs-storagecluster-cephblockpool
rbd image 'csi-vol-6ec8cc65-1c05-11ec-8cf6-0a580a050e55':
	size 100 GiB in 25600 objects
	order 22 (4 MiB objects)
	snapshot_count: 3
	id: 5656c9f3afb7b
	block_name_prefix: rbd_data.5656c9f3afb7b
	format: 2
	features: layering, non-primary
	op_features: 
	flags: 
	create_timestamp: Thu Sep 23 00:30:20 2021
	access_timestamp: Thu Sep 23 00:49:42 2021
	modify_timestamp: Thu Sep 23 00:30:20 2021
	mirroring state: enabled
	mirroring mode: snapshot
	mirroring global id: bb22f7df-b48f-4633-8e58-28e368725e83
	mirroring primary: false

Comment 17 Mudit Agarwal 2021-11-02 14:53:27 UTC
Moving this to 4.9.z while Pratik tries out the workaround suggested by Shyam, fix would be delivered in Ceph 5.0z2 which means this won't be done within 4.9 timeline.

Comment 28 Mudit Agarwal 2021-11-30 13:26:55 UTC
Removing the 4.9.0 flag and adding it as a known issue (The feature moved to DP)

Comment 29 Benamar Mekhissi 2021-12-02 12:04:48 UTC
I reproduced the “still being in use” issue in my minikube clusters and here are the steps I used to reproduce it and also my observations.  Note that I have done no special configuration to reproduce it other than the ramen patch and I also added one more PVC at a size of 10T to simulate (kind of) latency.  The Ramen patch has a bug that expedited the reproduction of the issue.
The bottom line is that I believe that the combination of the Ramen fix and Fast-diff fix will resolve this issue for good.

1. I deployed an application to C1. Here are the 2 PVCs
------------- pvc on C1 -----------------
NAMESPACE        NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
busybox-sample   busybox-pvc         Bound    pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c   1Gi        RWO            rook-ceph-block   83s
busybox-sample   busybox-pvc-2       Bound    pvc-3ebc31d0-1101-4592-a289-632f8654f7d2   10000Gi    RWO            rook-ceph-block   82s

2. I failed over and I had two windows open watching for the parallel execution:
Every 2.0s: kubectl get pvc -A --context C1                                                                                                  

NAMESPACE        NAME                STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
busybox-sample   busybox-pvc         Terminating   pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c   1Gi        RWO            rook-ceph-block   142m
busybox-sample   busybox-pvc-2       Terminating   pvc-3ebc31d0-1101-4592-a289-632f8654f7d2   10000Gi    RWO            rook-ceph-block   142


Every 2.0s: kubectl get pvc -A —context C2                                                                                                                   

NAMESPACE        NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
busybox-sample   busybox-pvc         Bound    pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c   1Gi        RWO            rook-ceph-block   126m
busybox-sample   busybox-pvc-2       Bound    pvc-3ebc31d0-1101-4592-a289-632f8654f7d2   10000Gi    RWO            rook-ceph-block   126m

I briefly caught C1 PVCs going to terminating state first.  But I am going to ignore that for now.

3. The Ramen bug VRG reported DataReady in the first time we checked where it shouldn’t
4. The VR log started reporting the following error every second, and then exponentially started to increase forever
grep "force promoting volume due to known grpc error" vr.log
{"level”:”…”,"logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:215","msg":"force promoting volume due to known grpc error","Request.Name":"busybox-pvc-2","Request.Namespace":"busybox-sample","error":"rpc error: code = FailedPrecondition desc = failed to promote image \"replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a\" with error: rbd: ret=-16, Device or resource busy"}

5. At this point, describing the busybox pod shows the *Still being used* error. But the interesting thing is that it is showing in on the volume that its snap has already been fully copied.  Nevertheless, I believe they are all related.
kubectl describe -nbusybox-sample pod   busybox
...
...
...
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               20m                  default-scheduler        Successfully assigned busybox-sample/busybox to hub
  Normal   CreateResource          20m                  subscription             Synchronizer created resource busybox of gvk:/v1, Kind=Pod
  Normal   SuccessfulAttachVolume  20m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3ebc31d0-1101-4592-a289-632f8654f7d2"
  Normal   SuccessfulAttachVolume  20m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c"
  Warning  FailedMount             18m                  kubelet                  MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             17m (x6 over 18m)    kubelet                  MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-cc1467cf-5348-11ec-99db-0242ac11000a already exists
  Warning  FailedMount             2m43s (x6 over 16m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-31814e59-6603-45e2-b03c-fd92a04fbb2c" : rpc error: code = Internal desc = rbd image replicapool/csi-vol-cc1467cf-5348-11ec-99db-0242ac11000a is still being used

6. An RBD info on the image on C2 show that it is still reporting non-primary
c2 — rbd info replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a 2>/dev/null
rbd image 'csi-vol-cc146afa-5348-11ec-99db-0242ac11000a':
	size 9.8 TiB in 2560000 objects
	order 22 (4 MiB objects)
	snapshot_count: 1
	id: 32b8fabdf189de
	block_name_prefix: rbd_data.32b8fabdf189de
	format: 2
	features: layering, non-primary
	op_features: 
	flags: 
...
	mirroring state: enabled
	mirroring mode: snapshot
	mirroring global id: 8693ca05-40d4-4f51-a982-fe9f9024ccc3
	mirroring primary: false

7. An RBD info on the image on C1 show that it is non-primary (with the right Ramen fix, this should not happen)
c1 — rbd info replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a
rbd image 'csi-vol-cc146afa-5348-11ec-99db-0242ac11000a':
	size 9.8 TiB in 2560000 objects
	order 22 (4 MiB objects)
	snapshot_count: 4
	id: 1e9e167979b7a5
	block_name_prefix: rbd_data.1e9e167979b7a5
	format: 2
	features: layering, non-primary
	op_features: 
	flags: 
...
	mirroring state: enabled
	mirroring mode: snapshot
	mirroring global id: 8693ca05-40d4-4f51-a982-fe9f9024ccc3
	mirroring primary: false


8. On C2, the snap for the large image show that it is still copying
c2 -- rbd snap ls replicapool/csi-vol-cc146afa-5348-11ec-99db-0242ac11000a --all 2>/dev/null
SNAPID  NAME                                                                                           SIZE     PROTECTED   NAMESPACE                                                                                
231307  .mirror.non_primary.8693ca05-40d4-4f51-a982-fe9f9024ccc3.d03e40c4-3b36-466c-84c3-2cbc42e5bb2a  9.8 TiB              mirror (non-primary peer_uuids:[] 0d463db9-d6d5-496d-af45-16d780db76f6:229056 36% copied)


9. Force promote will not work while (8) is true. Which meant (4) is reporting the same. And as of this writing, the exponential gap reached 17 minutes. I think this is the max for the reconciler.
10. The interesting thing though, the image is non-primary in both clusters. I thought the snap copy stopped, but after 2 hours and while I wasn’t watching, the snap copy moved from 36% complete to 40% complete. It is still progressing but very very very slow. This never happened when things are functioning properly.

I expect that after the snap is copied fully, the “Still being used” error will go away. I remember that Annette, JC, and Pratik or reported the same issue (where “Still being used” error gets cleared after a long time) if I am not mistaken.

Comment 35 Mudit Agarwal 2022-01-27 03:12:38 UTC
Providing dev ack given that the ceph BZ is already approved for RHCS 5.1

Comment 45 Mudit Agarwal 2022-06-16 01:24:45 UTC
Ceph bug is retargeted for 5.3

Comment 85 Aman Agrawal 2023-03-23 15:43:42 UTC
*** Bug 2155564 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.