Bug 2246834 - [RDR] [Node failure] [CephFS] Relocate remains stuck forever with MountVolume.SetUp failed error when one of the three worker nodes was rebooted during the relocation
Summary: [RDR] [Node failure] [CephFS] Relocate remains stuck forever with MountVolume...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.16.0
Assignee: Venky Shankar
QA Contact: Aman Agrawal
URL:
Whiteboard: verification-blocked
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-10-29 12:54 UTC by Aman Agrawal
Modified: 2024-11-15 04:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:10:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:10:16 UTC

Comment 7 Mudit Agarwal 2023-11-07 11:38:01 UTC
Moving to 4.14.z based on offline discussion

Comment 10 Aman Agrawal 2023-12-17 16:18:51 UTC
Hi Venky, 

A similar/same issue was hit again. Do we have any updates on this BZ?

Comment 41 Aman Agrawal 2024-01-25 10:38:32 UTC
With ACM 2.9.2 GA'ed
ODF 4.14.4-2
ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)

We are hitting this issue again on a RDR setup.
Post failover from C1 to C2 when older primary cluster which was earlier down was recovered and after it's successful cleanup, sync didn't progress for almost all cephfs workloads deployed on this setup and dst pods has the same mount issue.

busybox-workloads-11 was in deployed state on C2 (cluster amagrawa-c123j-140) and no action was performed on it.

From C2-

amagrawa:c2$ oc get pvc,vrg,vr,pods -o wide -n busybox-workloads-11
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE   VOLUMEMODE
persistentvolumeclaim/busybox-pvc-1   Bound    pvc-3e217751-97d3-412d-9fba-b5c350add790   94Gi       RWX            ocs-storagecluster-cephfs   38h   Filesystem
persistentvolumeclaim/busybox-pvc-2   Bound    pvc-bbaca5f0-b899-4fd9-aa46-ad739eb24a61   44Gi       RWX            ocs-storagecluster-cephfs   38h   Filesystem
persistentvolumeclaim/busybox-pvc-3   Bound    pvc-25536498-92f2-4923-81a4-ae93504f2436   76Gi       RWX            ocs-storagecluster-cephfs   38h   Filesystem
persistentvolumeclaim/busybox-pvc-4   Bound    pvc-366082c9-d2dc-4b0f-9b3a-ff4d2b00d6a2   144Gi      RWX            ocs-storagecluster-cephfs   38h   Filesystem

NAME                                                                                 DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/cephfs-appset-busybox11-placement-drpc   secondary      Secondary

NAME                                            READY   STATUS                   RESTARTS   AGE    IP             NODE        NOMINATED NODE   READINESS GATES
pod/volsync-rsync-tls-dst-busybox-pvc-1-q6jrz   1/1     Running                  0          27m    10.128.2.76    compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-2-r8knw   1/1     Running                  0          26m    10.128.2.78    compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-3-4mtcd   0/1     ContainerStatusUnknown   1          26h    <none>         compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-3-ngprd   0/1     ContainerCreating        0          4h6m   <none>         compute-1   <none>           <none>
pod/volsync-rsync-tls-dst-busybox-pvc-4-cpxk9   1/1     Running                  0          26m    10.129.2.159   compute-0   <none>           <none>


amagrawa:c2$ oc describe pod/volsync-rsync-tls-dst-busybox-pvc-3-ngprd -n busybox-workloads-11
Name:             volsync-rsync-tls-dst-busybox-pvc-3-ngprd
Namespace:        busybox-workloads-11
Priority:         0
Service Account:  volsync-dst-busybox-pvc-3
Node:             compute-1/10.1.114.154
Start Time:       Thu, 25 Jan 2024 11:51:30 +0530
Labels:           app.kubernetes.io/component=rsync-tls-mover
                  app.kubernetes.io/created-by=volsync
                  app.kubernetes.io/name=dst-busybox-pvc-3
                  app.kubernetes.io/part-of=volsync
                  batch.kubernetes.io/controller-uid=787251d2-486c-4168-b9e8-4adfbe174a2d
                  batch.kubernetes.io/job-name=volsync-rsync-tls-dst-busybox-pvc-3
                  controller-uid=787251d2-486c-4168-b9e8-4adfbe174a2d
                  job-name=volsync-rsync-tls-dst-busybox-pvc-3
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.128.2.15/23"],"mac_address":"0a:58:0a:80:02:0f","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0...
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    Job/volsync-rsync-tls-dst-busybox-pvc-3
Containers:
  rsync-tls:
    Container ID:  
    Image:         registry.redhat.io/rhacm2/volsync-rhel8@sha256:e01c2278c966ba2b55c35e3f9b17736ef33b9543efd18c98dbb18db9d58bc2c6
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      /mover-rsync-tls/server.sh
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      PRIVILEGED_MOVER:  0
    Mounts:
      /data from data (rw)
      /keys from keys (rw)
      /tmp from tempdir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nmd4w (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  busybox-pvc-3
    ReadOnly:   false
  keys:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cephfs-appset-busybox11-placement-drpc-vs-secret
    Optional:    false
  tempdir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-nmd4w:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  11m (x114 over 4h4m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount  74s (x128 over 4h5m)  kubelet  MountVolume.SetUp failed for volume "pvc-25536498-92f2-4923-81a4-ae93504f2436" : rpc error: code = Internal desc = staging path /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/64dba363a883b5b01a554e5634675de39ce11f4a3c4e49396fb246f53821491d/globalmount for volume 0001-0011-openshift-storage-0000000000000001-136ea1ee-b977-47a7-84e1-860763360dff is not a mountpoint

I am attaching the must-gather logs from this setup- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/24jan24/

C1 and C2 are managed clusters where ODF is installed. 

For the open questions in #comment16, seeking inputs from Benamar. 

Pls note, this issue is critical for RDR, and we should prioritize it's fix.

Comment 66 krishnaram Karthick 2024-03-18 15:21:23 UTC
Moving the bug out to 4.16 for verification once we have fix or workaround for BZ2270064

Comment 75 errata-xmlrpc 2024-07-17 13:10:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 76 Red Hat Bugzilla 2024-11-15 04:25:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.