2239776 – [Tracker ACM-7600][RDR] Source pods remain stuck on the primary cluster and sync stops for cephfs workloads

Bug 2239776 - [Tracker ACM-7600][RDR] Source pods remain stuck on the primary cluster and sync stops for cephfs workloads

Summary: [Tracker ACM-7600][RDR] Source pods remain stuck on the primary cluster and s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Benamar Mekhissi
QA Contact:	kmanohar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-20 06:46 UTC by Aman Agrawal
Modified:	2023-11-08 18:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 18:54:58 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ACM-7600	0	None	None	None	2023-09-21 12:18:55 UTC
Red Hat Product Errata	RHSA-2023:6832	0	None	None	None	2023-11-08 18:56:32 UTC

Description Aman Agrawal 2023-09-20 06:46:21 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ODF 4.14.0-132.stable
OCP 4.14.0-0.nightly-2023-09-02-132842
ACM 2.9.0-DOWNSTREAM-2023-08-24-09-30-12
subctl version: v0.16.0
ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, deploy cephfs based DR protected workloads on both primary and secondary clusters. Do **not** perform any failover/relocate operations on the workloads.
2. Run IOs for a week or so and keep monitoring the pod/pvc status on primary and secondary managed clusters, lastGroupSyncTime on hub etc.
3.


Actual results: Source pods remain stuck on the primary cluster and sync stops for cephfs workloads

amagrawa:c2$ busybox-5
Now using project "busybox-workloads-5" on server "https://api.amagrawa-c2.qe.rh-ocs.com:6443".
NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    AGE    VOLUMEMODE
persistentvolumeclaim/dd-io-pvc-1               Bound    pvc-63d64bd6-1524-487e-83be-76773c05a906   117Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-2               Bound    pvc-935999f5-ab46-404f-ac81-d713cdcd9d4a   143Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-3               Bound    pvc-628fd825-c9e7-4959-9ded-c8107efee004   134Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-4               Bound    pvc-8f458cd3-5a71-4357-8c1c-eb59af04b68f   106Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-5               Bound    pvc-b52b6e64-4be6-4865-ae74-5524fc398f97   115Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-6               Bound    pvc-e3575099-8433-4f28-9360-e3d7865c23b2   129Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/dd-io-pvc-7               Bound    pvc-77563d35-cb7a-46fd-83cc-f929c52dcdd3   149Gi      RWO            ocs-storagecluster-cephfs       5d3h   Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-1-src   Bound    pvc-601f29dd-6cf3-4c0a-865d-d82398f9e324   117Gi      ROX            ocs-storagecluster-cephfs-vrg   15h    Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-2-src   Bound    pvc-4cfc6648-9b1a-422f-a2db-c2b2ed96146d   143Gi      ROX            ocs-storagecluster-cephfs-vrg   4s     Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-3-src   Bound    pvc-1c736f2e-a7e8-4175-90c9-9dbeb3660952   134Gi      ROX            ocs-storagecluster-cephfs-vrg   3s     Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-4-src   Bound    pvc-a99c1bea-46bb-473b-9525-a294ec075663   106Gi      ROX            ocs-storagecluster-cephfs-vrg   15h    Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-5-src   Bound    pvc-14da1d92-99b4-4e28-a1c8-12ae885684ed   115Gi      ROX            ocs-storagecluster-cephfs-vrg   15h    Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-6-src   Bound    pvc-6ccc5e3c-2815-43a3-a266-14287dfc2d39   129Gi      ROX            ocs-storagecluster-cephfs-vrg   15h    Filesystem
persistentvolumeclaim/volsync-dd-io-pvc-7-src   Bound    pvc-4bbb501e-8543-4292-a53a-d97febfa032c   149Gi      ROX            ocs-storagecluster-cephfs-vrg   15h    Filesystem

NAME                                                                               DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/busybox-workloads-5-placement-1-drpc   primary        Primary

NAME                                          READY   STATUS              RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod/dd-io-1-5dbcfccf76-6bnwn                  1/1     Running             0          5d3h    10.131.0.208   compute-0   <none>           <none>
pod/dd-io-2-684fc84b64-jfxzh                  1/1     Running             0          5d3h    10.131.0.209   compute-0   <none>           <none>
pod/dd-io-3-68bf99586d-kznw8                  1/1     Running             0          5d3h    10.129.3.15    compute-1   <none>           <none>
pod/dd-io-4-757c8d8b7b-s5ld2                  1/1     Running             0          5d3h    10.131.0.207   compute-0   <none>           <none>
pod/dd-io-5-74768ccf84-bqk45                  1/1     Running             0          5d3h    10.128.2.136   compute-2   <none>           <none>
pod/dd-io-6-68d5769c76-5wczl                  1/1     Running             0          5d3h    10.129.3.16    compute-1   <none>           <none>
pod/dd-io-7-67d87688b4-ffwmr                  1/1     Running             0          5d3h    10.131.0.206   compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-1-xznbk   0/1     ContainerCreating   0          78s     <none>         compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-2-fkb48   0/1     ContainerCreating   0          5s      <none>         compute-2   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-3-nzj5k   0/1     ContainerCreating   0          4s      <none>         compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-4-2zlkv   1/1     Running             0          28s     10.128.2.224   compute-2   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-4-5fbtg   0/1     Error               0          3m48s   10.128.2.202   compute-2   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-5-rm7l4   1/1     Running             0          85s     10.131.1.79    compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-6-vt4dl   0/1     Error               0          3m17s   10.131.1.64    compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-6-wjh8l   0/1     ContainerCreating   0          4s      <none>         compute-0   <none>           <none>
pod/volsync-rsync-tls-src-dd-io-pvc-7-2dspj   1/1     Running             0          83s     10.131.1.82    compute-0   <none>           <none>



Expected results: Source pod should reach Running state on the primary cluster and shouldn't remain stuck. Sync shouldn't stop for cephfs workloads on both the managed clusters where workloads are running. 


Additional info:

Comment 5 Karolin Seeger 2023-09-22 05:56:27 UTC

ACM team addressed this issue, fixes are included in submariner-operator-bundle-container-v0.16.0-23 (and later).
-> ON_QA

Comment 6 Karolin Seeger 2023-09-22 07:50:33 UTC

Newer version available, please use submariner-operator-bundle-container-v0.16.0-25 (and later).

Comment 11 kmanohar 2023-10-30 04:55:49 UTC

VERIFICATION COMMENTS 
=====================


Steps to Reproduce:
1. On a RDR setup, deploy cephfs based DR protected workloads on both primary and secondary clusters. Do **not** perform any failover/relocate operations on the workloads.
2. Run IOs for a week or so and keep monitoring the pod/pvc status on primary and secondary managed clusters, lastGroupSyncTime on hub etc.

Actual issue: Source pods remain stuck on the primary cluster and sync stops for cephfs workloads

With the fix didn't observe the above behavior
 
Output on C1
------------

$ oc get replicationsources.volsync.backube
NAME          SOURCE        LAST SYNC              DURATION          NEXT SYNC
dd-io-pvc-1   dd-io-pvc-1   2023-10-30T04:41:28Z   1m28.064760303s   2023-10-30T04:50:00Z
dd-io-pvc-2   dd-io-pvc-2   2023-10-30T04:41:27Z   1m27.962788093s   2023-10-30T04:50:00Z
dd-io-pvc-3   dd-io-pvc-3   2023-10-30T04:41:21Z   1m21.964984125s   2023-10-30T04:50:00Z
dd-io-pvc-4   dd-io-pvc-4   2023-10-30T04:41:24Z   1m24.495460567s   2023-10-30T04:50:00Z
dd-io-pvc-5   dd-io-pvc-5   2023-10-30T04:41:22Z   1m22.898981791s   2023-10-30T04:50:00Z
dd-io-pvc-6   dd-io-pvc-6   2023-10-30T04:41:29Z   1m29.050621317s   2023-10-30T04:50:00Z
dd-io-pvc-7   dd-io-pvc-7   2023-10-30T04:41:30Z   1m30.858916464s   2023-10-30T04:50:00Z

$ pods
NAME                       READY   STATUS    RESTARTS   AGE
dd-io-1-5dbcfccf76-q4twv   1/1     Running   3          4d20h
dd-io-2-684fc84b64-f4ztj   1/1     Running   2          2d17h
dd-io-3-68bf99586d-7czc4   1/1     Running   3          4d20h
dd-io-4-757c8d8b7b-2xgm2   1/1     Running   3          4d20h
dd-io-5-74768ccf84-s9gqr   1/1     Running   3          4d20h
dd-io-6-68d5769c76-qkfvm   1/1     Running   3          4d20h
dd-io-7-67d87688b4-kpnnm   1/1     Running   2          2d17h

$ pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
dd-io-pvc-1   Bound    pvc-e479369e-8ea3-416c-8f51-1bee7c26b471   117Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-2   Bound    pvc-e17ca78e-e2bf-466e-be8e-d44ba14cc14d   143Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-3   Bound    pvc-f00f84ed-c662-4b91-a781-8cf27912e54f   134Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-4   Bound    pvc-9fc0cce1-18c7-4612-84c4-8cf4b839cc49   106Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-5   Bound    pvc-e7bf1a2d-3d77-4b76-92c5-1bce1854072e   115Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-6   Bound    pvc-9f1c932d-69d0-4f89-a938-717f7beaf516   129Gi      RWO            ocs-storagecluster-cephfs   4d20h
dd-io-pvc-7   Bound    pvc-b07966c2-83bf-489d-9c32-0656f3ea8622   149Gi      RWO            ocs-storagecluster-cephfs   4d20h



$ oc get vrg
NAME                                 DESIREDSTATE   CURRENTSTATE
busybox-1-cephfs-c1-placement-drpc   primary        Primary



On C2
-----

$ oc get replicationdestinations.volsync.backube
NAME          LAST SYNC              DURATION           NEXT SYNC
dd-io-pvc-1   2023-10-30T04:41:32Z   9m43.583270288s    
dd-io-pvc-2   2023-10-30T04:41:36Z   9m53.64786527s     
dd-io-pvc-3   2023-10-30T04:41:23Z   9m38.1503665s      
dd-io-pvc-4   2023-10-30T04:41:29Z   10m14.653941103s   
dd-io-pvc-5   2023-10-30T04:41:23Z   9m40.087622682s    
dd-io-pvc-6   2023-10-30T04:41:35Z   10m3.789656323s    
dd-io-pvc-7   2023-10-30T04:41:39Z   10m29.950678547s
   
$ pods
NAME                                      READY   STATUS    RESTARTS   AGE
volsync-rsync-tls-dst-dd-io-pvc-1-q6nhw   1/1     Running   0          8m3s
volsync-rsync-tls-dst-dd-io-pvc-2-z8vk2   1/1     Running   0          7m58s
volsync-rsync-tls-dst-dd-io-pvc-3-ssmn6   1/1     Running   0          8m12s
volsync-rsync-tls-dst-dd-io-pvc-4-m5qnn   1/1     Running   0          8m6s
volsync-rsync-tls-dst-dd-io-pvc-5-mk64p   1/1     Running   0          8m12s
volsync-rsync-tls-dst-dd-io-pvc-6-7n4jh   1/1     Running   0          7m59s
volsync-rsync-tls-dst-dd-io-pvc-7-798gl   1/1     Running   0          7m56s

Verified on 
-----------

ODF - 4.14.0-150
OCP - 4.14.0-0.nightly-2023-10-17-113123
MCO - 4.14.0-150
Submariner - 0.16.0(brew.registry.redhat.io/rh-osbs/iib:599799)
ACM - 2.9.0 (2.9.0-DOWNSTREAM-2023-10-03-20-08-35)

Must gather
-----------

C1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/bz-v/bz-2239776/c1/

C2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/bz-v/bz-2239776/c2/

HUB - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/bz-v/bz-2239776/hub/

Comment 13 errata-xmlrpc 2023-11-08 18:54:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.