2187804 – [Fusion-aaS] [Backport to 4.12.3]Failed to mount CephFS volumes while creating pods

Bug 2187804 - [Fusion-aaS] [Backport to 4.12.3]Failed to mount CephFS volumes while creating pods

Summary: [Fusion-aaS] [Backport to 4.12.3]Failed to mount CephFS volumes while creatin...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Malay Kumar parida
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	2184068
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-18 17:24 UTC by Neha Berry
Modified:	2023-12-08 04:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2184068
Environment:
Last Closed:	2023-04-19 09:59:10 UTC
Embargoed:

Attachments	(Terms of Use)

Description Neha Berry 2023-04-18 17:24:01 UTC

+++ This bug was initially created as a clone of Bug #2184068 +++

Description of problem (please be detailed as possible and provide log
snippests):
Pods with CephFS PVcs are not reaching Running state due to the error given below.

$ oc describe pod -n test pod-pvc-cephfs2 | grep "Events:" -A 50
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               16m                   default-scheduler        Successfully assigned test/pod-pvc-cephfs2 to ip-10-0-23-231.us-east-2.compute.internal
  Normal   SuccessfulAttachVolume  16m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2"
  Warning  FailedMount             15m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3413363383,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out
  Warning  FailedMount             14m                   kubelet                  MountVolume.MountDevice failed for volume "pvc-c2a6475a-b82d-4a95-bbf9-a8e6f02416b2" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.0.14.207:3300,10.0.17.36:3300,10.0.21.79:3300:/volumes/cephfilesystemsubvolumegroup-storageconsumer-6c222074-f5b2-4925-a227-8957e0353725/csi-vol-9fcf671e-6b8d-49ca-9bc8-9d7ce256d567/fb7d71c0-5125-40d5-9d46-59ce5698b67d /var/lib/kubelet/plugins/kubernetes.io/csi/odf-storage.cephfs.csi.ceph.com/531e47114f2308a7459c194c6cfdf117b9fe20a1c8bff1657fc985455db52636/globalmount -o name=62972e1ff4714b87ce4dea5c0107f05d,secretfile=/tmp/csi/keys/keyfile-3942104566,mds_namespace=ocs-storagecluster-cephfilesystem,_netdev] stderr: mount error 110 = Connection timed out

........

Testing was done on ODF to ODF on ROSA configuration. StorageClient was created in the  consumer cluster. Storagecluster and storageclient are created in the namespace odf-storage on provider and consumer respectively.



Provider storagecluster:

$ oc get storagecluster -n odf-storage ocs-storagecluster -o yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  annotations:
    uninstall.ocs.openshift.io/cleanup-policy: delete
    uninstall.ocs.openshift.io/mode: graceful
  creationTimestamp: "2023-04-03T10:43:31Z"
  finalizers:
  - storagecluster.ocs.openshift.io
  generation: 1
  name: ocs-storagecluster
  namespace: odf-storage
  ownerReferences:
  - apiVersion: ocs.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: ManagedOCS
    name: managedocs
    uid: e283f807-2d01-4bdb-9c56-1521bb7fb143
  - apiVersion: odf.openshift.io/v1alpha1
    kind: StorageSystem
    name: ocs-storagecluster-storagesystem
    uid: 6bb41373-a6dd-4cd0-b673-8690e1627c0b
  resourceVersion: "409263"
  uid: e756a472-80ad-4b47-b08d-7a52bab3eea3
spec:
  allowRemoteStorageConsumers: true
  arbiter: {}
  defaultStorageProfile: default
  encryption:
    kms: {}
  externalStorage: {}
  hostNetwork: true
  labelSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/worker
      operator: Exists
    - key: node-role.kubernetes.io/infra
      operator: DoesNotExist
  managedResources:
    cephBlockPools:
      disableSnapshotClass: true
      disableStorageClass: true
      reconcileStrategy: ignore
    cephCluster: {}
    cephConfig: {}
    cephDashboard: {}
    cephFilesystems:
      disableSnapshotClass: true
      disableStorageClass: true
    cephNonResilientPools: {}
    cephObjectStoreUsers: {}
    cephObjectStores: {}
    cephToolbox: {}
  mirroring: {}
  monPVCTemplate:
    metadata: {}
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: gp2
    status: {}
  multiCloudGateway:
    reconcileStrategy: ignore
  resources:
    crashcollector:
      limits:
        cpu: 50m
        memory: 80Mi
      requests:
        cpu: 50m
        memory: 80Mi
    mds:
      limits:
        cpu: 1500m
        memory: 8Gi
      requests:
        cpu: 1500m
        memory: 8Gi
    mgr:
      limits:
        cpu: "1"
        memory: 3Gi
      requests:
        cpu: "1"
        memory: 3Gi
    mon:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
  storageDeviceSets:
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4Ti
        storageClassName: gp2
        volumeMode: Block
      status: {}
    deviceClass: ssd
    name: default
    placement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
    portable: true
    preparePlacement:
      topologySpreadConstraints:
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchExpressions:
          - key: ceph.rook.io/pvc
            operator: Exists
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
    replica: 3
    resources:
      limits:
        cpu: 1650m
        memory: 6Gi
      requests:
        cpu: 1650m
        memory: 6Gi
  storageProfiles:
  - blockPoolConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
    deviceClass: ssd
    name: default
    sharedFilesystemConfiguration:
      parameters:
        pg_autoscale_mode: "on"
        pg_num: "128"
        pgp_num: "128"
status:
  conditions:
  - lastHeartbeatTime: "2023-04-03T10:43:31Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Version check successful
    reason: VersionMatched
    status: "False"
    type: VersionMismatch
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:42Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: ReconcileComplete
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:47:26Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Available
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:43:31Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "False"
    type: Degraded
  - lastHeartbeatTime: "2023-04-03T14:43:24Z"
    lastTransitionTime: "2023-04-03T10:57:55Z"
    message: Reconcile completed successfully
    reason: ReconcileCompleted
    status: "True"
    type: Upgradeable
  externalStorage:
    grantedCapacity: "0"
  failureDomain: zone
  failureDomainKey: topology.kubernetes.io/zone
  failureDomainValues:
  - us-east-2a
  - us-east-2b
  - us-east-2c
  images:
    ceph:
      actualImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
      desiredImage: quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63
    noobaaCore:
      desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:08e5ae2e9869d434bd7d522dd81bb9900c71b05d63bbfde474d8d2250dd5fae9
    noobaaDB:
      desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:e88838e08efabb25d2450f3c178ffe0a9e1be1a14c5bc998bde0e807c9a3f15f
  kmsServerConnection: {}
  nodeTopologies:
    labels:
      kubernetes.io/hostname:
      - ip-10-0-14-207.us-east-2.compute.internal
      - ip-10-0-17-36.us-east-2.compute.internal
      - ip-10-0-21-79.us-east-2.compute.internal
      topology.kubernetes.io/region:
      - us-east-2
      topology.kubernetes.io/zone:
      - us-east-2a
      - us-east-2b
      - us-east-2c
  phase: Ready
  relatedObjects:
  - apiVersion: ceph.rook.io/v1
    kind: CephCluster
    name: ocs-storagecluster-cephcluster
    namespace: odf-storage
    resourceVersion: "409255"
    uid: 912aba9f-1f08-4420-9011-01913598dca8
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
  version: 4.13.0





Consumer storageclient:

$ oc -n odf-storage get storageclient ocs-storageclient -o yaml
apiVersion: ocs.openshift.io/v1alpha1
kind: StorageClient
metadata:
  creationTimestamp: "2023-04-03T11:14:23Z"
  finalizers:
  - storageclient.ocs.openshift.io
  generation: 1
  name: ocs-storageclient
  namespace: odf-storage
  resourceVersion: "211583"
  uid: 9c2e0d4d-be0c-4233-a872-ed946d59d2d0
spec:
  onboardingTicket: eyJpZCI6ImRhMmI1OTU0LTcwZWEtNGY3Ny1hZDU5LWU3MDAzNmQyZjk2NCIsImV4cGlyYXRpb25EYXRlIjoiMTY4MDY5MTI5NCJ9.YEfkN4KsKfeYTNVHl5L83SQMwo4zOkDLoTJlHmsj2L6lxxzJLpzGlkjHy437mRU3HWdkiBHCcOWX/7B00DJDsGUPxwfo7mM1dVvUklDGzHIw9POW/CwhSf5qzuqf5mfqer/KyLMoIKIzBHLrhF3K8mwKncBEbkxQQlhHY1D4ALYnho6MJoQEj8Qhp9YMP4/k321WrGYcoJNEDXVU3vXpRW0uFsZDDl8/XdpIKA7aA/V5lfWwrV1OP8haSDjU3p9lrC16Y2dA1X5nfnwSUq3b1+h9kzawhuv58TnMl5nk+Y49OT24hCRwpvTnZWYU26J6nXosLFGc9MekB0V5a3oI6n/mEztPrgqmftaHBY8ZydVha7TYw/IDW41TQ3odAhGx6eWqPVe/8to1u6vfrUxHNTN9faoPK0cDro64wmcD1VViDfvHNA1mb8QyfA3kEQILXbdgh3Xm8WJj9o80jtebgmJgqv8OCnQ+FSjHnCwCfTvkiLdeOzIf9VSR/EvGzsbtvZ+Z8/RIR+FQHBa/4jLI6WuUt0yQwwxpBfO1XZVtBFBH85VH8lJp8ERBxl/luXXIhLW9U+O3jETrFtjafVy+GlxI73Y7MeRKH3t0FNUmtxQRIA/nYHMsgzX4rUK7ydT8eGYpB89UJgnOtzy1W0ZmEWS6Ht/zoF2GiioqIHbyJ4U=
  storageProviderEndpoint: a863bd3bee9ab41bcbe001f7740ae36c-2072636220.us-east-2.elb.amazonaws.com:50051
status:
  id: 6c222074-f5b2-4925-a227-8957e0353725
  phase: Connected





==============================================================================

Version of all relevant components (if applicable):
OCP 4.12.8
ODF 4.13.0-121.stable
======================================================================

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, cannot create pod.
=====================================================================

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Creation of pods were working in Managed services (ODF 4.10) clusters.

Steps to Reproduce:
1. Create provider nd consumer clusters in ODF to ODF on ROSA configuration.
2. Create storageclient on consumer.
3. Create CephFS PVC and attach it to pod.


Actual results:
Pod should reach Running state.

Expected results:
Pod is not reaching the state "Running" due to mount failed error.

Additional info:

--- Additional comment from RHEL Program Management on 2023-04-03 14:52:01 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-04-03 14:58:35 UTC ---

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.

--- Additional comment from Madhu Rajanna on 2023-04-03 14:59:48 UTC ---

Rook cephcluster is created with RequireMsgv2 for the managed service, and it should not be we need to add a check in ocs-operator to use v1 ports for the managed-service provider cluster.

This is because the consumer cluster is supposed to work with multiple providers and we cannot set global configuration at csi level.

--- Additional comment from RHEL Program Management on 2023-04-03 14:59:57 UTC ---

This BZ is being approved for ODF 4.13.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.13.0

--- Additional comment from RHEL Program Management on 2023-04-03 14:59:57 UTC ---

Since this bug has been approved for ODF 4.13.0 release, through release flag 'odf-4.13.0+', the Target Release is being set to 'ODF 4.13.0

--- Additional comment from Jilju Joy on 2023-04-03 15:09:51 UTC ---

must-gather logs from provider cluster which contain logs from the namespace odf-storage where the storagecluster is present

--- Additional comment from Mudit Agarwal on 2023-04-03 15:45:05 UTC ---

This bug is because of enabling msgr-v2, I don't think it is relevant for odf-4.12?

--- Additional comment from Madhu Rajanna on 2023-04-03 15:48:10 UTC ---

(In reply to Mudit Agarwal from comment #7)
> This bug is because of enabling msgr-v2, I don't think it is relevant for
> odf-4.12?

Yes correct its because of ODF 4.13, for managed service provider cluster ODF 4.13 is used here.

--- Additional comment from Malay Kumar parida on 2023-04-03 18:01:14 UTC ---

@mrajanna Is there any existing flags that is available already which we can use to identify if a provider storagecluster is Managed by MS? Or do we need to introduce some new mechanism for that?

--- Additional comment from Madhu Rajanna on 2023-04-04 06:55:26 UTC ---

You can check for https://github.com/red-hat-storage/ocs-operator/blob/main/api/v1/storagecluster_types.go#L87 AllowRemoteStorageConsumers flag, and dont set the RequireMsgr2 flag.

--- Additional comment from Mudit Agarwal on 2023-04-07 05:08:14 UTC ---

Fixed in version: 4.13.0-130

--- Additional comment from Jilju Joy on 2023-04-10 14:49:54 UTC ---

Verified in version:

ocs-client-operator.v4.13.0-130.stable             
odf-csi-addons-operator.v4.13.0-130.stable         

OCP 4.12.9
----------------------------------------------------
CephFS PVC was Bound. Attached the PVC to an app pod and the pod reached the state "Running".

$ oc get pvc pvc-cephfs1 
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
pvc-cephfs1   Bound    pvc-d37321bd-9a8e-4564-9dbe-c466500bb191   10Gi       RWO            ocs-storagecluster-cephfs   6m44s


$ oc get pod
NAME              READY   STATUS    RESTARTS   AGE
pod-pvc-cephfs1   1/1     Running   0          5m36s

$ oc get pod pod-pvc-cephfs1 -o yaml | grep claimName
      claimName: pvc-cephfs1

Created file in pod.
$ oc rsh pod-pvc-cephfs1 cat /var/lib/www/html/f1.txt
123


Testing was done on ODF to ODF on ROSA configuration without agent. Installation of ocs-client-operator and creation of the storageclient were manual process. The storageclassclaims were created manually.

--- Additional comment from errata-xmlrpc on 2023-04-17 13:27:10 UTC ---

This bug has been added to advisory RHBA-2023:108078 by Boris Ranto (branto)

Comment 4 Malay Kumar parida 2023-04-19 06:00:53 UTC

@mrajanna , I am confused as to why this issue is being reported on 4.12. The initial bug of which this is a clone was due to the requirement of msgrv2 ports which is a thing only in ocs 4.13 & has no relevance in 4.12. So I am not sure what's happening here. Am I missing something?

Comment 7 Red Hat Bugzilla 2023-12-08 04:33:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.