+++ This bug was initially created as a clone of Bug #2184068 +++ Description of problem (please be detailed as possible and provide log snippests): Pods with CephFS PVcs are not reaching Running state due to the error given below. $ oc describe pod -n test pod-pvc-cephfs2 | grep "Events:" -A 50 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 16m default-scheduler Successfully assigned test/pod-pvc-cephfs2 to ip-10-0-23-231.us-east-2.compute.internal Normal SuccessfulAttachVolume 16m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" Warning FailedMount 15m kubelet MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3413363383,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out Warning FailedMount 14m kubelet MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3942104566,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out ........ Testing was done on ODF to ODF on ROSA configuration. StorageClient was created in the consumer cluster. Storagecluster and storageclient are created in the namespace odf-storage on provider and consumer respectively. Provider storagecluster: $ oc get storagecluster -n odf-storage ocs-storagecluster -o yaml apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2023-04-03T10:43:31Z" finalizers: - storagecluster.ocs.openshift.io generation: 1 name: ocs-storagecluster namespace: odf-storage ownerReferences: - apiVersion: ocs.openshift.io/v1alpha1 blockOwnerDeletion: true controller: true kind: ManagedOCS name: managedocs uid: e283f807-2d01-4bdb-9c56-1521bb7fb143 - apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: 6bb41373-a6dd-4cd0-b673-8690e1627c0b resourceVersion: "409263" uid: e756a472-80ad-4b47-b08d-7a52bab3eea3 spec: allowRemoteStorageConsumers: true arbiter: {} defaultStorageProfile: default encryption: kms: {} externalStorage: {} hostNetwork: true labelSelector: matchExpressions: - key: node-role.kubernetes.io/worker operator: Exists - key: node-role.kubernetes.io/infra operator: DoesNotExist managedResources: cephBlockPools: disableSnapshotClass: true disableStorageClass: true reconcileStrategy: ignore cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: disableSnapshotClass: true disableStorageClass: true cephNonResilientPools: {} cephObjectStoreUsers: {} cephObjectStores: {} cephToolbox: {} mirroring: {} monPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: gp2 status: {} multiCloudGateway: reconcileStrategy: ignore resources: crashcollector: limits: cpu: 50m memory: 80Mi requests: cpu: 50m memory: 80Mi mds: limits: cpu: 1500m memory: 8Gi requests: cpu: 1500m memory: 8Gi mgr: limits: cpu: "1" memory: 3Gi requests: cpu: "1" memory: 3Gi mon: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi storageDeviceSets: - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Ti storageClassName: gp2 volumeMode: Block status: {} deviceClass: ssd name: default placement: topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule portable: true preparePlacement: topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule replica: 3 resources: limits: cpu: 1650m memory: 6Gi requests: cpu: 1650m memory: 6Gi storageProfiles: - blockPoolConfiguration: parameters: pg_autoscale_mode: "on" pg_num: "128" pgp_num: "128" deviceClass: ssd name: default sharedFilesystemConfiguration: parameters: pg_autoscale_mode: "on" pg_num: "128" pgp_num: "128" status: conditions: - lastHeartbeatTime: "2023-04-03T10:43:31Z" lastTransitionTime: "2023-04-03T10:43:31Z" message: Version check successful reason: VersionMatched status: "False" type: VersionMismatch - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:43:42Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: ReconcileComplete - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:47:26Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Available - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:57:55Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Progressing - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:43:31Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "False" type: Degraded - lastHeartbeatTime: "2023-04-03T14:43:24Z" lastTransitionTime: "2023-04-03T10:57:55Z" message: Reconcile completed successfully reason: ReconcileCompleted status: "True" type: Upgradeable externalStorage: grantedCapacity: "0" failureDomain: zone failureDomainKey: topology.kubernetes.io/zone failureDomainValues: - us-east-2a - us-east-2b - us-east-2c images: ceph: actualImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63 desiredImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63 noobaaCore: desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:08e5ae2e9869d434bd7d522dd81bb9900c71b05d63bbfde474d8d2250dd5fae9 noobaaDB: desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:e88838e08efabb25d2450f3c178ffe0a9e1be1a14c5bc998bde0e807c9a3f15f kmsServerConnection: {} nodeTopologies: labels: kubernetes.io/hostname: - ip-10-0-14-207.us-east-2.compute.internal - ip-10-0-17-36.us-east-2.compute.internal - ip-10-0-21-79.us-east-2.compute.internal topology.kubernetes.io/region: - us-east-2 topology.kubernetes.io/zone: - us-east-2a - us-east-2b - us-east-2c phase: Ready relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: odf-storage resourceVersion: "409255" uid: 912aba9f-1f08-4420-9011-01913598dca8 storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051 version: 4.13.0 Consumer storageclient: $ oc -n odf-storage get storageclient ocs-storageclient -o yaml apiVersion: ocs.openshift.io/v1alpha1 kind: StorageClient metadata: creationTimestamp: "2023-04-03T11:14:23Z" finalizers: - storageclient.ocs.openshift.io generation: 1 name: ocs-storageclient namespace: odf-storage resourceVersion: "211583" uid: 9c2e0d4d-be0c-4233-a872-ed946d59d2d0 spec: onboardingTicket: eyJpZCI6ImRhMmI1OTU0LTcwZWEtNGY3Ny1hZDU5LWU3MDAzNmQyZjk2NCIsImV4cGlyYXRpb25EYXRlIjoiMTY4MDY5MTI5NCJ9.YEfkN4KsKfeYTNVHl5L83SQMwo4zOkDLoTJlHmsj2L6lxxzJLpzGlkjHy437mRU3HWdkiBHCcOWX/7B00DJDsGUPxwfo7mM1dVvUklDGzHIw9POW/CwhSf5qzuqf5mfqer/KyLMoIKIzBHLrhF3K8mwKncBEbkxQQlhHY1D4ALYnho6MJoQEj8Qhp9YMP4/k321WrGYcoJNEDXVU3vXpRW0uFsZDDl8/XdpIKA7aA/V5lfWwrV1OP8haSDjU3p9lrC16Y2dA1X5nfnwSUq3b1+h9kzawhuv58TnMl5nk+Y49OT24hCRwpvTnZWYU26J6nXosLFGc9MekB0V5a3oI6n/mEztPrgqmftaHBY8ZydVha7TYw/IDW41TQ3odAhGx6eWqPVe/8to1u6vfrUxHNTN9faoPK0cDro64wmcD1VViDfvHNA1mb8QyfA3kEQILXbdgh3Xm8WJj9o80jtebgmJgqv8OCnQ+FSjHnCwCfTvkiLdeOzIf9VSR/EvGzsbtvZ+Z8/RIR+FQHBa/4jLI6WuUt0yQwwxpBfO1XZVtBFBH85VH8lJp8ERBxl/luXXIhLW9U+O3jETrFtjafVy+GlxI73Y7MeRKH3t0FNUmtxQRIA/nYHMsgzX4rUK7ydT8eGYpB89UJgnOtzy1W0ZmEWS6Ht/zoF2GiioqIHbyJ4U= storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051 status: id: 6c222074-f5b2-4925-a227-8957e0353725 phase: Connected ============================================================================== Version of all relevant components (if applicable): OCP 4.12.8 ODF 4.13.0-121.stable ====================================================================== Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes, cannot create pod. ===================================================================== Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Creation of pods were working in Managed services (ODF 4.10) clusters. Steps to Reproduce: 1. Create provider nd consumer clusters in ODF to ODF on ROSA configuration. 2. Create storageclient on consumer. 3. Create CephFS PVC and attach it to pod. Actual results: Pod should reach Running state. Expected results: Pod is not reaching the state "Running" due to mount failed error. Additional info: --- Additional comment from RHEL Program Management on 2023-04-03 14:52:01 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-04-03 14:58:35 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. --- Additional comment from Madhu Rajanna on 2023-04-03 14:59:48 UTC --- Rook cephcluster is created with RequireMsgv2 for the managed service, and it should not be we need to add a check in ocs-operator to use v1 ports for the managed-service provider cluster. This is because the consumer cluster is supposed to work with multiple providers and we cannot set global configuration at csi level. --- Additional comment from RHEL Program Management on 2023-04-03 14:59:57 UTC --- This BZ is being approved for ODF 4.13.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.13.0 --- Additional comment from RHEL Program Management on 2023-04-03 14:59:57 UTC --- Since this bug has been approved for ODF 4.13.0 release, through release flag 'odf-4.13.0+', the Target Release is being set to 'ODF 4.13.0 --- Additional comment from Jilju Joy on 2023-04-03 15:09:51 UTC --- must-gather logs from provider cluster which contain logs from the namespace odf-storage where the storagecluster is present --- Additional comment from Mudit Agarwal on 2023-04-03 15:45:05 UTC --- This bug is because of enabling msgr-v2, I don't think it is relevant for odf-4.12? --- Additional comment from Madhu Rajanna on 2023-04-03 15:48:10 UTC --- (In reply to Mudit Agarwal from comment #7) > This bug is because of enabling msgr-v2, I don't think it is relevant for > odf-4.12? Yes correct its because of ODF 4.13, for managed service provider cluster ODF 4.13 is used here. --- Additional comment from Malay Kumar parida on 2023-04-03 18:01:14 UTC --- @mrajanna Is there any existing flags that is available already which we can use to identify if a provider storagecluster is Managed by MS? Or do we need to introduce some new mechanism for that? --- Additional comment from Madhu Rajanna on 2023-04-04 06:55:26 UTC --- You can check for https://github.com/red-hat-storage/ocs-operator/blob/main/api/v1/storagecluster_types.go#L87 AllowRemoteStorageConsumers flag, and dont set the RequireMsgr2 flag. --- Additional comment from Mudit Agarwal on 2023-04-07 05:08:14 UTC --- Fixed in version: 4.13.0-130 --- Additional comment from Jilju Joy on 2023-04-10 14:49:54 UTC --- Verified in version: ocs-client-operator.v4.13.0-130.stable odf-csi-addons-operator.v4.13.0-130.stable OCP 4.12.9 ---------------------------------------------------- CephFS PVC was Bound. Attached the PVC to an app pod and the pod reached the state "Running". $ oc get pvc pvc-cephfs1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-cephfs1 Bound pvc-d37321bd-9a8e-4564-9dbe-c466500bb191 10Gi RWO ocs-storagecluster-cephfs 6m44s $ oc get pod NAME READY STATUS RESTARTS AGE pod-pvc-cephfs1 1/1 Running 0 5m36s $ oc get pod pod-pvc-cephfs1 -o yaml | grep claimName claimName: pvc-cephfs1 Created file in pod. $ oc rsh pod-pvc-cephfs1 cat /var/lib/www/html/f1.txt 123 Testing was done on ODF to ODF on ROSA configuration without agent. Installation of ocs-client-operator and creation of the storageclient were manual process. The storageclassclaims were created manually. --- Additional comment from errata-xmlrpc on 2023-04-17 13:27:10 UTC --- This bug has been added to advisory RHBA-2023:108078 by Boris Ranto (branto)
@mrajanna , I am confused as to why this issue is being reported on 4.12. The initial bug of which this is a clone was due to the requirement of msgrv2 ports which is a thing only in ocs 4.13 & has no relevance in 4.12. So I am not sure what's happening here. Am I missing something?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days