Description of problem (please be detailed as possible and provide log snippests): Pods with CephFS PVcs are not reaching Running state due to the error given below. $ oc describe pod -n test pod-pvc-cephfs2 | grep "Events:" -A 50 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 16m default-scheduler Successfully assigned test/pod-pvc-cephfs2 to ip-10-0-23-231.us-east-2.compute.internal Normal SuccessfulAttachVolume 16m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" Warning FailedMount 15m kubelet MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3413363383,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out Warning FailedMount 14m kubelet MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3942104566,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out ........ Testing was done on ODF to ODF on ROSA configuration. StorageClient was created in the consumer cluster. Storagecluster and storageclient are created in the namespace odf-storage on provider and consumer respectively. Provider storagecluster: $ oc get storagecluster -n odf-storage ocs-storagecluster -o yaml apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2023-04-03T10:43:31Z" finalizers: - storagecluster.ocs.openshift.io generation: 1 name: ocs-storagecluster namespace: odf-storage ownerReferences: - apiVersion: ocs.openshift.io/v1alpha1 blockOwnerDeletion: true controller: true kind: ManagedOCS name: managedocs uid: e283f807-2d01-4bdb-9c56-1521bb7fb143 - apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: 6bb41373-a6dd-4cd0-b673-8690e1627c0b resourceVersion: "409263" uid: e756a472-80ad-4b47-b08d-7a52bab3eea3 spec: allowRemoteStorageConsumers: true arbiter: {} defaultStorageProfile: default encryption: kms: {} externalStorage: {} hostNetwork: true labelSelector: matchExpressions: - key: node-role.kubernetes.io/worker operator: Exists - key: node-role.kubernetes.io/infra operator: DoesNotExist managedResources: cephBlockPools: disableSnapshotClass: true disableStorageClass: true reconcileStrategy: ignore cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: disableSnapshotClass: true disableStorageClass: true cephNonResilientPools: {} cephObjectStoreUsers: {} cephObjectStores: {} cephToolbox: {} mirroring: {} monPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: gp2 status: {} multiCloudGateway: reconcileStrategy: ignore resources: crashcollector: limits: cpu: 50m memory: 80Mi requests: cpu: 50m memory: 80Mi mds: limits: cpu: 1500m memory: 8Gi requests: cpu: 1500m memory: 8Gi mgr: limits: cpu: "1" memory: 3Gi requests: cpu: "1" memory: 3Gi mon: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi storageDeviceSets: - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Ti storageClassName: gp2 volumeMode: Block status: {} deviceClass: ssd name: default placement: topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule portable: true preparePlacement: topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule replica: 3 resources: limits: cpu: 1650m memory: 6Gi requests: cpu: 1650m memory: 6Gi storageProfiles: - blockPoolConfiguration: parameters: pg_autoscale_mode: "on" pg_num: "128" pgp_num: "128" deviceClass: ssd name: default sharedFilesystemConfiguration: parameters: pg_autoscale_mode: "on" pg_num: "128" pgp_num: "128" status: conditions: - lastHeartbeatTime: "2023-04-03T10:43:31Z" lastTransitionTime: "2023-04-03T10:43:31Z" message: Version check successful reason: VersionMatched status: "False" type: VersionMismatch - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:43:42Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:47:26Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Available - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:57:55Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Progressing - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:43:31Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Degraded - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:57:55Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Upgradeable externalStorage: grantedCapacity: "0" failureDomain: zone failureDomainKey: topology.kubernetes.io/zone failureDomainValues: - us-east-2a - us-east-2b - us-east-2c images: ceph: actualImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63 desiredImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63 noobaaCore: desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:08e5ae2e9869d434bd7d522dd81bb9900c71b05d63bbfde474d8d2250dd5fae9 noobaaDB: desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:e88838e08efabb25d2450f3c178ffe0a9e1be1a14c5bc998bde0e807c9a3f15f kmsServerConnection: {} nodeTopologies: labels: kubernetes.io/hostname: - ip-10-0-14-207.us-east-2.compute.internal - ip-10-0-17-36.us-east-2.compute.internal - ip-10-0-21-79.us-east-2.compute.internal topology.kubernetes.io/region: - us-east-2 topology.kubernetes.io/zone: - us-east-2a - us-east-2b - us-east-2c phase: Ready relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: odf-storage resourceVersion: "409255" uid: 912aba9f-1f08-4420-9011-01913598dca8 storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051 version: 4.13.0 Consumer storageclient: $ oc -n odf-storage get storageclient ocs-storageclient -o yaml apiVersion: ocs.openshift.io/v1alpha1 kind: StorageClient metadata: creationTimestamp: "2023-04-03T11:14:23Z" finalizers: - storageclient.ocs.openshift.io generation: 1 name: ocs-storageclient namespace: odf-storage resourceVersion: "211583" uid: 9c2e0d4d-be0c-4233-a872-ed946d59d2d0 spec: onboardingTicket: eyJpZCI6ImRhMmI1OTU0LTcwZWEtNGY3Ny1hZDU5LWU3MDAzNmQyZjk2NCIsImV4cGlyYXRpb25EYXRlIjoiMTY4MDY5MTI5NCJ9.YEfkN4KsKfeYTNVHl5L83SQMwo4zOkDLoTJlHmsj2L6lxxzJLpzGlkjHy437mRU3HWdkiBHCcOWX/7B00DJDsGUPxwfo7mM1dVvUklDGzHIw9POW/CwhSf5qzuqf5mfqer/KyLMoIKIzBHLrhF3K8mwKncBEbkxQQlhHY1D4ALYnho6MJoQEj8Qhp9YMP4/k321WrGYcoJNEDXVU3vXpRW0uFsZDDl8/XdpIKA7aA/V5lfWwrV1OP8haSDjU3p9lrC16Y2dA1X5nfnwSUq3b1+h9kzawhuv58TnMl5nk+Y49OT24hCRwpvTnZWYU26J6nXosLFGc9MekB0V5a3oI6n/mEztPrgqmftaHBY8ZydVha7TYw/IDW41TQ3odAhGx6eWqPVe/8to1u6vfrUxHNTN9faoPK0cDro64wmcD1VViDfvHNA1mb8QyfA3kEQILXbdgh3Xm8WJj9o80jtebgmJgqv8OCnQ+FSjHnCwCfTvkiLdeOzIf9VSR/EvGzsbtvZ+Z8/RIR+FQHBa/4jLI6WuUt0yQwwxpBfO1XZVtBFBH85VH8lJp8ERBxl/luXXIhLW9U+O3jETrFtjafVy+GlxI73Y7MeRKH3t0FNUmtxQRIA/nYHMsgzX4rUK7ydT8eGYpB89UJgnOtzy1W0ZmEWS6Ht/zoF2GiioqIHbyJ4U= storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051 status: id: 6c222074-f5b2-4925-a227-8957e0353725 phase: Connected ============================================================================== Version of all relevant components (if applicable): OCP 4.12.8 ODF 4.13.0-121.stable ====================================================================== Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, cannot create pod. ===================================================================== Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Creation of pods were working in Managed services (ODF 4.10) clusters. Steps to Reproduce: 1. Create provider nd consumer clusters in ODF to ODF on ROSA configuration. 2. Create storageclient on consumer. 3. Create CephFS PVC and attach it to pod. Actual results: Pod should reach Running state. Expected results: Pod is not reaching the state "Running" due to mount failed error. Additional info:
Rook cephcluster is created with RequireMsgv2 for the managed service, and it should not be we need to add a check in ocs-operator to use v1 ports for the managed-service provider cluster. This is because the consumer cluster is supposed to work with multiple providers and we cannot set global configuration at csi level.
This bug is because of enabling msgr-v2, I don't think it is relevant for odf-4.12?
@mrajanna Is there any existing flags that is available already which we can use to identify if a provider storagecluster is Managed by MS? Or do we need to introduce some new mechanism for that?
Verified in version: ocs-client-operator.v4.13.0-130.stable odf-csi-addons-operator.v4.13.0-130.stable OCP 4.12.9 ---------------------------------------------------- CephFS PVC was Bound. Attached the PVC to an app pod and the pod reached the state "Running". $ oc get pvc pvc-cephfs1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-cephfs1 Bound pvc-d37321bd-9a8e-4564-9dbe-c466500bb191 10Gi RWO ocs-storagecluster-cephfs 6m44s $ oc get pod NAME READY STATUS RESTARTS AGE pod-pvc-cephfs1 1/1 Running 0 5m36s $ oc get pod pod-pvc-cephfs1 -o yaml | grep claimName claimName: pvc-cephfs1 Created file in pod. $ oc rsh pod-pvc-cephfs1 cat /var/lib/www/html/f1.txt 123 Testing was done on ODF to ODF on ROSA configuration without agent. Installation of ocs-client-operator and creation of the storageclient were manual process. The storageclassclaims were created manually.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742