OCP logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/29oct23/
Moving to 4.14.z based on offline discussion
Hi Venky, A similar/same issue was hit again. Do we have any updates on this BZ?
With ACM 2.9.2 GA'ed ODF 4.14.4-2 ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable) We are hitting this issue again on a RDR setup. Post failover from C1 to C2 when older primary cluster which was earlier down was recovered and after it's successful cleanup, sync didn't progress for almost all cephfs workloads deployed on this setup and dst pods has the same mount issue. busybox-workloads-11 was in deployed state on C2 (cluster amagrawa-c123j-140) and no action was performed on it. From C2- amagrawa:c2$ oc get pvc,vrg,vr,pods -o wide -n busybox-workloads-11 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-1 Bound pvc-3e217751-97d3-412d-9fba-b5c350add790 94Gi RWX ocs-storagecluster-cephfs 38h Filesystem persistentvolumeclaim/busybox-pvc-2 Bound pvc-bbaca5f0-b899-4fd9-aa46-ad739eb24a61 44Gi RWX ocs-storagecluster-cephfs 38h Filesystem persistentvolumeclaim/busybox-pvc-3 Bound pvc-25536498-92f2-4923-81a4-ae93504f2436 76Gi RWX ocs-storagecluster-cephfs 38h Filesystem persistentvolumeclaim/busybox-pvc-4 Bound pvc-366082c9-d2dc-4b0f-9b3a-ff4d2b00d6a2 144Gi RWX ocs-storagecluster-cephfs 38h Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox11-placement-drpc secondary Secondary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/volsync-rsync-tls-dst-busybox-pvc-1-q6jrz 1/1 Running 0 27m 10.128.2.76 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-2-r8knw 1/1 Running 0 26m 10.128.2.78 compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-3-4mtcd 0/1 ContainerStatusUnknown 1 26h <none> compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-3-ngprd 0/1 ContainerCreating 0 4h6m <none> compute-1 <none> <none> pod/volsync-rsync-tls-dst-busybox-pvc-4-cpxk9 1/1 Running 0 26m 10.129.2.159 compute-0 <none> <none> amagrawa:c2$ oc describe pod/volsync-rsync-tls-dst-busybox-pvc-3-ngprd -n busybox-workloads-11 Name: volsync-rsync-tls-dst-busybox-pvc-3-ngprd Namespace: busybox-workloads-11 Priority: 0 Service Account: volsync-dst-busybox-pvc-3 Node: compute-1/10.1.114.154 Start Time: Thu, 25 Jan 2024 11:51:30 +0530 Labels: app.kubernetes.io/component=rsync-tls-mover app.kubernetes.io/created-by=volsync app.kubernetes.io/name=dst-busybox-pvc-3 app.kubernetes.io/part-of=volsync batch.kubernetes.io/controller-uid=787251d2-486c-4168-b9e8-4adfbe174a2d batch.kubernetes.io/job-name=volsync-rsync-tls-dst-busybox-pvc-3 controller-uid=787251d2-486c-4168-b9e8-4adfbe174a2d job-name=volsync-rsync-tls-dst-busybox-pvc-3 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.2.15/23"],"mac_address":"0a:58:0a:80:02:0f","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0... openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Pending SeccompProfile: RuntimeDefault IP: IPs: <none> Controlled By: Job/volsync-rsync-tls-dst-busybox-pvc-3 Containers: rsync-tls: Container ID: Image: registry.redhat.io/rhacm2/volsync-rhel8@sha256:e01c2278c966ba2b55c35e3f9b17736ef33b9543efd18c98dbb18db9d58bc2c6 Image ID: Port: <none> Host Port: <none> Command: /bin/bash -c /mover-rsync-tls/server.sh State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: PRIVILEGED_MOVER: 0 Mounts: /data from data (rw) /keys from keys (rw) /tmp from tempdir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nmd4w (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: busybox-pvc-3 ReadOnly: false keys: Type: Secret (a volume populated by a Secret) SecretName: cephfs-appset-busybox11-placement-drpc-vs-secret Optional: false tempdir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: <unset> kube-api-access-nmd4w: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 11m (x114 over 4h4m) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 74s (x128 over 4h5m) kubelet MountVolume.SetUp failed for volume "pvc-25536498-92f2-4923-81a4-ae93504f2436" : rpc error: code = Internal desc = staging path /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/64dba363a883b5b01a554e5634675de39ce11f4a3c4e49396fb246f53821491d/globalmount for volume 0001-0011-openshift-storage-0000000000000001-136ea1ee-b977-47a7-84e1-860763360dff is not a mountpoint I am attaching the must-gather logs from this setup- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/24jan24/ C1 and C2 are managed clusters where ODF is installed. For the open questions in #comment16, seeking inputs from Benamar. Pls note, this issue is critical for RDR, and we should prioritize it's fix.
Moving the bug out to 4.16 for verification once we have fix or workaround for BZ2270064
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days