Bug 2184068

Summary: [Fusion-aaS] Failed to mount CephFS volumes while creating pods
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: CLOSED ERRATA QA Contact: Jilju Joy <jijoy>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: mparida, muagarwa, nberry, ocs-bugs, odf-bz-bot
Target Milestone: ---Keywords: TestBlocker
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2187804 (view as bug list) Environment:
Last Closed: 2023-06-21 15:25:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2187804    

Description Jilju Joy 2023-04-03 14:51:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Pods with CephFS PVcs are not reaching Running state due to the error given below.

$ oc describe pod -n test pod-pvc-cephfs2 | grep "Events:" -A 50
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               16m                   default-scheduler        Successfully assigned test/pod-pvc-cephfs2 to ip-10-0-23-231.us-east-2.compute.internal
  Normal   SuccessfulAttachVolume  16m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2"
  Warning  FailedMount             15m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3413363383,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out
  Warning  FailedMount             14m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3942104566,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out

........

Testing was done on ODF to ODF on ROSA configuration. StorageClient was created in the  consumer cluster. Storagecluster and storageclient are created in the namespace odf-storage on provider and consumer respectively.



Provider storagecluster:

$ oc get storagecluster -n odf-storage ocs-storagecluster -o yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  annotations:
    uninstall.ocs.openshift.io/cleanup-policy: delete
    uninstall.ocs.openshift.io/mode: graceful
  creationTimestamp: "2023-04-03T10:43:31Z"
  finalizers:
  - storagecluster.ocs.openshift.io
  generation: 1
  name: ocs-storagecluster
  namespace: odf-storage
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: ManagedOCS
    name: managedocs
    uid: e283f807-2d01-4bdb-9c56-1521bb7fb143
  - apiVersion: odf.openshift.io/v1alpha1
    kind: StorageSystem
    name: ocs-storagecluster-storagesystem
    uid: 6bb41373-a6dd-4cd0-b673-8690e1627c0b
  resourceVersion: "409263"
  uid: e756a472-80ad-4b47-b08d-7a52bab3eea3
spec:
  allowRemoteStorageConsumers: true
  arbiter: {}
  defaultStorageProfile: default
  encryption:
    kms: {}
  externalStorage: {}
  hostNetwork: true
  labelSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/worker
      operator: Exists
    - key: node-role.kubernetes.io/infra
      operator: DoesNotExist
  managedResources:
    cephBlockPools:
      disableSnapshotClass: true
      disableStorageClass: true
      reconcileStrategy: ignore
    cephCluster: {}
    cephConfig: {}
    cephDashboard: {}
    cephFilesystems:
      disableSnapshotClass: true
      disableStorageClass: true
    cephNonResilientPools: {}
    cephObjectStoreUsers: {}
    cephObjectStores: {}
    cephToolbox: {}
  mirroring: {}
  monPVCTemplate:
    metadata: {}
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: gp2
    status: {}
  multiCloudGateway:
    reconcileStrategy: ignore
  resources:
    crashcollector:
      limits:
        cpu: 50m
        memory: 80Mi
      requests:
        cpu: 50m
        memory: 80Mi
    mds:
      limits:
        cpu: 1500m
        memory: 8Gi
      requests:
        cpu: 1500m
        memory: 8Gi
    mgr:
      limits:
        cpu: "1"
        memory: 3Gi
      requests:
        cpu: "1"
        memory: 3Gi
    mon:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4Ti
        storageClassName: gp2
        volumeMode: Block
      status: {}
    deviceClass: ssd
    name: default
    placement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
    portable: true
    preparePlacement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
    replica: 3
    resources:
      limits:
        cpu: 1650m
        memory: 6Gi
      requests:
        cpu: 1650m
        memory: 6Gi
  storageProfiles:
  - blockPoolConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
    deviceClass: ssd
    name: default
    sharedFilesystemConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
status:
  conditions:
  - lastHeartbeatTime: "2023-04-03T10:43:31Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Version check successful
    reason: VersionMatched
    status: "False"
    type: VersionMismatch
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:42Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: ReconcileComplete
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:47:26Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Available
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Upgradeable
  externalStorage:
    grantedCapacity: "0"
  failureDomain: zone
  failureDomainKey: topology.kubernetes.io/zone
  failureDomainValues:
  - us-east-2a
  - us-east-2b
  - us-east-2c
  images:
    ceph:
      actualImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
      desiredImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
    noobaaCore:
      desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:08e5ae2e9869d434bd7d522dd81bb9900c71b05d63bbfde474d8d2250dd5fae9
    noobaaDB:
      desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:e88838e08efabb25d2450f3c178ffe0a9e1be1a14c5bc998bde0e807c9a3f15f
  kmsServerConnection: {}
  nodeTopologies:
    labels:
      kubernetes.io/hostname:
      - ip-10-0-14-207.us-east-2.compute.internal
      - ip-10-0-17-36.us-east-2.compute.internal
      - ip-10-0-21-79.us-east-2.compute.internal
      topology.kubernetes.io/region:
      - us-east-2
      topology.kubernetes.io/zone:
      - us-east-2a
      - us-east-2b
      - us-east-2c
  phase: Ready
  relatedObjects:
  - apiVersion: ceph.rook.io/v1
    kind: CephCluster
    name: ocs-storagecluster-cephcluster
    namespace: odf-storage
    resourceVersion: "409255"
    uid: 912aba9f-1f08-4420-9011-01913598dca8
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
  version: 4.13.0





Consumer storageclient:

$ oc -n odf-storage get storageclient ocs-storageclient -o yaml
apiVersion: ocs.openshift.io/v1alpha1
kind: StorageClient
metadata:
  creationTimestamp: "2023-04-03T11:14:23Z"
  finalizers:
  - storageclient.ocs.openshift.io
  generation: 1
  name: ocs-storageclient
  namespace: odf-storage
  resourceVersion: "211583"
  uid: 9c2e0d4d-be0c-4233-a872-ed946d59d2d0
spec:
  onboardingTicket: eyJpZCI6ImRhMmI1OTU0LTcwZWEtNGY3Ny1hZDU5LWU3MDAzNmQyZjk2NCIsImV4cGlyYXRpb25EYXRlIjoiMTY4MDY5MTI5NCJ9.YEfkN4KsKfeYTNVHl5L83SQMwo4zOkDLoTJlHmsj2L6lxxzJLpzGlkjHy437mRU3HWdkiBHCcOWX/7B00DJDsGUPxwfo7mM1dVvUklDGzHIw9POW/CwhSf5qzuqf5mfqer/KyLMoIKIzBHLrhF3K8mwKncBEbkxQQlhHY1D4ALYnho6MJoQEj8Qhp9YMP4/k321WrGYcoJNEDXVU3vXpRW0uFsZDDl8/XdpIKA7aA/V5lfWwrV1OP8haSDjU3p9lrC16Y2dA1X5nfnwSUq3b1+h9kzawhuv58TnMl5nk+Y49OT24hCRwpvTnZWYU26J6nXosLFGc9MekB0V5a3oI6n/mEztPrgqmftaHBY8ZydVha7TYw/IDW41TQ3odAhGx6eWqPVe/8to1u6vfrUxHNTN9faoPK0cDro64wmcD1VViDfvHNA1mb8QyfA3kEQILXbdgh3Xm8WJj9o80jtebgmJgqv8OCnQ+FSjHnCwCfTvkiLdeOzIf9VSR/EvGzsbtvZ+Z8/RIR+FQHBa/4jLI6WuUt0yQwwxpBfO1XZVtBFBH85VH8lJp8ERBxl/luXXIhLW9U+O3jETrFtjafVy+GlxI73Y7MeRKH3t0FNUmtxQRIA/nYHMsgzX4rUK7ydT8eGYpB89UJgnOtzy1W0ZmEWS6Ht/zoF2GiioqIHbyJ4U=
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
status:
  id: 6c222074-f5b2-4925-a227-8957e0353725
  phase: Connected





==============================================================================

Version of all relevant components (if applicable):
OCP 4.12.8
ODF 4.13.0-121.stable
======================================================================

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, cannot create pod.
=====================================================================

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Creation of pods were working in Managed services (ODF 4.10) clusters.

Steps to Reproduce:
1. Create provider nd consumer clusters in ODF to ODF on ROSA configuration.
2. Create storageclient on consumer.
3. Create CephFS PVC and attach it to pod.


Actual results:
Pod should reach Running state.

Expected results:
Pod is not reaching the state "Running" due to mount failed error.

Additional info:

Comment 3 Madhu Rajanna 2023-04-03 14:59:48 UTC
Rook cephcluster is created with RequireMsgv2 for the managed service, and it should not be we need to add a check in ocs-operator to use v1 ports for the managed-service provider cluster.

This is because the consumer cluster is supposed to work with multiple providers and we cannot set global configuration at csi level.

Comment 7 Mudit Agarwal 2023-04-03 15:45:05 UTC
This bug is because of enabling msgr-v2, I don't think it is relevant for odf-4.12?

Comment 9 Malay Kumar parida 2023-04-03 18:01:14 UTC
@mrajanna Is there any existing flags that is available already which we can use to identify if a provider storagecluster is Managed by MS? Or do we need to introduce some new mechanism for that?

Comment 12 Jilju Joy 2023-04-10 14:49:54 UTC
Verified in version:

ocs-client-operator.v4.13.0-130.stable             
odf-csi-addons-operator.v4.13.0-130.stable         

OCP 4.12.9
----------------------------------------------------
CephFS PVC was Bound. Attached the PVC to an app pod and the pod reached the state "Running".

$ oc get pvc pvc-cephfs1 
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
pvc-cephfs1   Bound    pvc-d37321bd-9a8e-4564-9dbe-c466500bb191   10Gi       RWO            ocs-storagecluster-cephfs   6m44s


$ oc get pod
NAME              READY   STATUS    RESTARTS   AGE
pod-pvc-cephfs1   1/1     Running   0          5m36s

$ oc get pod pod-pvc-cephfs1 -o yaml | grep claimName
      claimName: pvc-cephfs1

Created file in pod.
$ oc rsh pod-pvc-cephfs1 cat /var/lib/www/html/f1.txt
123


Testing was done on ODF to ODF on ROSA configuration without agent. Installation of ocs-client-operator and creation of the storageclient were manual process. The storageclassclaims were created manually.

Comment 16 errata-xmlrpc 2023-06-21 15:25:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742