2184068 – [Fusion-aaS] Failed to mount CephFS volumes while creating pods

Bug 2184068 - [Fusion-aaS] Failed to mount CephFS volumes while creating pods

Summary: [Fusion-aaS] Failed to mount CephFS volumes while creating pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Malay Kumar parida
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2187804
TreeView+	depends on / blocked

Reported:	2023-04-03 14:51 UTC by Jilju Joy
Modified:	2023-08-09 17:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2187804 (view as bug list)
Environment:
Last Closed:	2023-06-21 15:25:02 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1993	None	open	Don't require msgr2 port for provider/consumer clusters	2023-04-04 13:01:37 UTC
Github	red-hat-storage ocs-operator pull 1994	None	open	Bug 2184068: [release-4.13] Don't require msgr2 port for provider/consumer clusters	2023-04-05 09:42:33 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:25:32 UTC

Description Jilju Joy 2023-04-03 14:51:52 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Pods with CephFS PVcs are not reaching Running state due to the error given below.

$ oc describe pod -n test pod-pvc-cephfs2 | grep "Events:" -A 50
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               16m                   default-scheduler        Successfully assigned test/pod-pvc-cephfs2 to ip-10-0-23-231.us-east-2.compute.internal
  Normal   SuccessfulAttachVolume  16m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2"
  Warning  FailedMount             15m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3413363383,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out
  Warning  FailedMount             14m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3942104566,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out

........

Testing was done on ODF to ODF on ROSA configuration. StorageClient was created in the  consumer cluster. Storagecluster and storageclient are created in the namespace odf-storage on provider and consumer respectively.



Provider storagecluster:

$ oc get storagecluster -n odf-storage ocs-storagecluster -o yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  annotations:
    uninstall.ocs.openshift.io/cleanup-policy: delete
    uninstall.ocs.openshift.io/mode: graceful
  creationTimestamp: "2023-04-03T10:43:31Z"
  finalizers:
  - storagecluster.ocs.openshift.io
  generation: 1
  name: ocs-storagecluster
  namespace: odf-storage
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: ManagedOCS
    name: managedocs
    uid: e283f807-2d01-4bdb-9c56-1521bb7fb143
  - apiVersion: odf.openshift.io/v1alpha1
    kind: StorageSystem
    name: ocs-storagecluster-storagesystem
    uid: 6bb41373-a6dd-4cd0-b673-8690e1627c0b
  resourceVersion: "409263"
  uid: e756a472-80ad-4b47-b08d-7a52bab3eea3
spec:
  allowRemoteStorageConsumers: true
  arbiter: {}
  defaultStorageProfile: default
  encryption:
    kms: {}
  externalStorage: {}
  hostNetwork: true
  labelSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/worker
      operator: Exists
    - key: node-role.kubernetes.io/infra
      operator: DoesNotExist
  managedResources:
    cephBlockPools:
      disableSnapshotClass: true
      disableStorageClass: true
      reconcileStrategy: ignore
    cephCluster: {}
    cephConfig: {}
    cephDashboard: {}
    cephFilesystems:
      disableSnapshotClass: true
      disableStorageClass: true
    cephNonResilientPools: {}
    cephObjectStoreUsers: {}
    cephObjectStores: {}
    cephToolbox: {}
  mirroring: {}
  monPVCTemplate:
    metadata: {}
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: gp2
    status: {}
  multiCloudGateway:
    reconcileStrategy: ignore
  resources:
    crashcollector:
      limits:
        cpu: 50m
        memory: 80Mi
      requests:
        cpu: 50m
        memory: 80Mi
    mds:
      limits:
        cpu: 1500m
        memory: 8Gi
      requests:
        cpu: 1500m
        memory: 8Gi
    mgr:
      limits:
        cpu: "1"
        memory: 3Gi
      requests:
        cpu: "1"
        memory: 3Gi
    mon:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4Ti
        storageClassName: gp2
        volumeMode: Block
      status: {}
    deviceClass: ssd
    name: default
    placement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
    portable: true
    preparePlacement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
    replica: 3
    resources:
      limits:
        cpu: 1650m
        memory: 6Gi
      requests:
        cpu: 1650m
        memory: 6Gi
  storageProfiles:
  - blockPoolConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
    deviceClass: ssd
    name: default
    sharedFilesystemConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
status:
  conditions:
  - lastHeartbeatTime: "2023-04-03T10:43:31Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Version check successful
    reason: VersionMatched
    status: "False"
    type: VersionMismatch
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:42Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: ReconcileComplete
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:47:26Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Available
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Upgradeable
  externalStorage:
    grantedCapacity: "0"
  failureDomain: zone
  failureDomainKey: topology.kubernetes.io/zone
  failureDomainValues:
  - us-east-2a
  - us-east-2b
  - us-east-2c
  images:
    ceph:
      actualImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
      desiredImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
    noobaaCore:
      desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:08e5ae2e9869d434bd7d522dd81bb9900c71b05d63bbfde474d8d2250dd5fae9
    noobaaDB:
      desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:e88838e08efabb25d2450f3c178ffe0a9e1be1a14c5bc998bde0e807c9a3f15f
  kmsServerConnection: {}
  nodeTopologies:
    labels:
      kubernetes.io/hostname:
      - ip-10-0-14-207.us-east-2.compute.internal
      - ip-10-0-17-36.us-east-2.compute.internal
      - ip-10-0-21-79.us-east-2.compute.internal
      topology.kubernetes.io/region:
      - us-east-2
      topology.kubernetes.io/zone:
      - us-east-2a
      - us-east-2b
      - us-east-2c
  phase: Ready
  relatedObjects:
  - apiVersion: ceph.rook.io/v1
    kind: CephCluster
    name: ocs-storagecluster-cephcluster
    namespace: odf-storage
    resourceVersion: "409255"
    uid: 912aba9f-1f08-4420-9011-01913598dca8
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
  version: 4.13.0





Consumer storageclient:

$ oc -n odf-storage get storageclient ocs-storageclient -o yaml
apiVersion: ocs.openshift.io/v1alpha1
kind: StorageClient
metadata:
  creationTimestamp: "2023-04-03T11:14:23Z"
  finalizers:
  - storageclient.ocs.openshift.io
  generation: 1
  name: ocs-storageclient
  namespace: odf-storage
  resourceVersion: "211583"
  uid: 9c2e0d4d-be0c-4233-a872-ed946d59d2d0
spec:
  onboardingTicket: eyJpZCI6ImRhMmI1OTU0LTcwZWEtNGY3Ny1hZDU5LWU3MDAzNmQyZjk2NCIsImV4cGlyYXRpb25EYXRlIjoiMTY4MDY5MTI5NCJ9.YEfkN4KsKfeYTNVHl5L83SQMwo4zOkDLoTJlHmsj2L6lxxzJLpzGlkjHy437mRU3HWdkiBHCcOWX/7B00DJDsGUPxwfo7mM1dVvUklDGzHIw9POW/CwhSf5qzuqf5mfqer/KyLMoIKIzBHLrhF3K8mwKncBEbkxQQlhHY1D4ALYnho6MJoQEj8Qhp9YMP4/k321WrGYcoJNEDXVU3vXpRW0uFsZDDl8/XdpIKA7aA/V5lfWwrV1OP8haSDjU3p9lrC16Y2dA1X5nfnwSUq3b1+h9kzawhuv58TnMl5nk+Y49OT24hCRwpvTnZWYU26J6nXosLFGc9MekB0V5a3oI6n/mEztPrgqmftaHBY8ZydVha7TYw/IDW41TQ3odAhGx6eWqPVe/8to1u6vfrUxHNTN9faoPK0cDro64wmcD1VViDfvHNA1mb8QyfA3kEQILXbdgh3Xm8WJj9o80jtebgmJgqv8OCnQ+FSjHnCwCfTvkiLdeOzIf9VSR/EvGzsbtvZ+Z8/RIR+FQHBa/4jLI6WuUt0yQwwxpBfO1XZVtBFBH85VH8lJp8ERBxl/luXXIhLW9U+O3jETrFtjafVy+GlxI73Y7MeRKH3t0FNUmtxQRIA/nYHMsgzX4rUK7ydT8eGYpB89UJgnOtzy1W0ZmEWS6Ht/zoF2GiioqIHbyJ4U=
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
status:
  id: 6c222074-f5b2-4925-a227-8957e0353725
  phase: Connected





==============================================================================

Version of all relevant components (if applicable):
OCP 4.12.8
ODF 4.13.0-121.stable
======================================================================

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, cannot create pod.
=====================================================================

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Creation of pods were working in Managed services (ODF 4.10) clusters.

Steps to Reproduce:
1. Create provider nd consumer clusters in ODF to ODF on ROSA configuration.
2. Create storageclient on consumer.
3. Create CephFS PVC and attach it to pod.


Actual results:
Pod should reach Running state.

Expected results:
Pod is not reaching the state "Running" due to mount failed error.

Additional info:

Comment 3 Madhu Rajanna 2023-04-03 14:59:48 UTC

Rook cephcluster is created with RequireMsgv2 for the managed service, and it should not be we need to add a check in ocs-operator to use v1 ports for the managed-service provider cluster.

This is because the consumer cluster is supposed to work with multiple providers and we cannot set global configuration at csi level.

Comment 7 Mudit Agarwal 2023-04-03 15:45:05 UTC

This bug is because of enabling msgr-v2, I don't think it is relevant for odf-4.12?

Comment 9 Malay Kumar parida 2023-04-03 18:01:14 UTC

@mrajanna Is there any existing flags that is available already which we can use to identify if a provider storagecluster is Managed by MS? Or do we need to introduce some new mechanism for that?

Comment 12 Jilju Joy 2023-04-10 14:49:54 UTC

Verified in version:

ocs-client-operator.v4.13.0-130.stable             
odf-csi-addons-operator.v4.13.0-130.stable         

OCP 4.12.9
----------------------------------------------------
CephFS PVC was Bound. Attached the PVC to an app pod and the pod reached the state "Running".

$ oc get pvc pvc-cephfs1 
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
pvc-cephfs1   Bound    pvc-d37321bd-9a8e-4564-9dbe-c466500bb191   10Gi       RWO            ocs-storagecluster-cephfs   6m44s


$ oc get pod
NAME              READY   STATUS    RESTARTS   AGE
pod-pvc-cephfs1   1/1     Running   0          5m36s

$ oc get pod pod-pvc-cephfs1 -o yaml | grep claimName
      claimName: pvc-cephfs1

Created file in pod.
$ oc rsh pod-pvc-cephfs1 cat /var/lib/www/html/f1.txt
123


Testing was done on ODF to ODF on ROSA configuration without agent. Installation of ocs-client-operator and creation of the storageclient were manual process. The storageclassclaims were created manually.

Comment 16 errata-xmlrpc 2023-06-21 15:25:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.