Description of problem (please be detailed as possible and provide log snippests): Deployment with external cluster fails when we try to change backing store for image registry to use cephfs Warning FailedMount 86s (x49 over 85m) kubelet (combined from similar events): MountVolume.MountDevice failed for volume "pvc-e9292a14-7e28-480c-87b2-1f5c76010ede" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 10.1.115.103:6789,10.1.115.104:6789,10.1.115.107:6789:/volumes/csi/csi-vol-106edc71-5e25-4471-9e90-974cde7f534d/c69a39ce-c081-454c-804a-fbdbad6d6b68 /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/2db2b6b2b97567badc67cd296bf9d86463c8b9914ef42235b5342aed3f1f8c02/globalmount -o name=csi-cephfs-node,secretfile=/tmp/csi/keys/keyfile-4098800741,mds_namespace=cephfs,ms_mode=prefer-crc,_netdev] stderr: unable to get monitor info from DNS SRV with service name: ceph-mon 2023-03-22T12:28:10.970+0000 7f8ec20b6d40 -1 failed for service _ceph-mon._tcp mount error: no mds server is up or the cluster is laggy Version of all relevant components (if applicable): ODF v4.13.0-109 OCP 4.13 latest nightly (4.13.0-0.nightly-2023-03-19-052243) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, cannot use cephfs SC Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes - 4 times already Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install ODF 4.13 with external mode - external RHCS cluster 2. Once deployed, change the SC to cephfs to be used for image-registry 3. Actual results: image-registry-9f957f57d-zzldc 0/1 ContainerCreating 0 80m vents: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 26m kubelet Unable to attach or mount volumes: unmounted volumes=[registry-storage], unattached volumes=[registry-storage registry-tls ca-trust-extracted registry-certificates trusted-ca installation-pull-secrets bound-sa-token kube-api-access-9c9wq]: timed out waiting for the condition Warning FailedMount 20m (x3 over 76m) kubelet Unable to attach or mount volumes: unmounted volumes=[registry-storage], unattached volumes=[kube-api-access-9c9wq registry-storage registry-tls ca-trust-extracted registry-certificates trusted-ca installation-pull-secrets bound-sa-token]: timed out waiting for the condition Warning FailedMount 5s (x44 over 70m) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[registry-storage], unattached volumes=[ca-trust-extracted registry-certificates trusted-ca installation-pull-secrets bound-sa-token kube-api-access-9c9wq registry-storage registry-tls]: timed out waiting for the condition pbalogh@MacBook-Pro external $ oc rsh -n openshift-storage rook-ceph-tools-external-7f77c6d78c-tq82d pbalogh@MacBook-Pro external $ oc get pvc -n openshift-image-registry NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE registry-cephfs-rwx-pvc Bound pvc-e9292a14-7e28-480c-87b2-1f5c76010ede 100Gi RWX ocs-external-storagecluster-cephfs 78m Expected results: To have pod running Additional info: Ceph health looks OK: oc rsh -n openshift-storage rook-ceph-tools-external-7f77c6d78c-tq82d ceph health detail HEALTH_OK
The csi configmap is getting the msgr2 ports: % oc get cm rook-ceph-csi-config -o yaml apiVersion: v1 data: csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["10.1.115.103:3300","10.1.115.104:3300","10.1.115.107:3300"],"namespace":"openshift-storage"}]' While the mon endpoints configmap correctly still contains the v1 ports: % oc get cm rook-ceph-mon-endpoints -oyaml apiVersion: v1 data: csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["10.1.115.104:6789","10.1.115.107:6789","10.1.115.103:6789"],"namespace":""}]' data: rhcs-1-node-1=10.1.115.104:6789,rhcs-1-node-2=10.1.115.107:6789,rhcs-1-node-3=10.1.115.103:6789 The cause of changing to the msgr2 endpoints is from this call [1]: monEndpoints := csi.MonEndpoints(cluster.ClusterInfo.Monitors, cluster.Spec.RequireMsgr2()) For external clusters, rook should really be ignoring the RequireMsgr2 setting and we should just use the same endpoints that were given to connect to the provider cluster. There would not be a scenario where we need to change these to the msgr2 ports. Madhu, how about if we just change this param to false? monEndpoints := csi.MonEndpoints(cluster.ClusterInfo.Monitors, false) And we may also need a fix in the ocs operator to not set the ms_mode=prefer-crc as a cephfs mount option for external clusters. [1] https://github.com/rook/rook/blob/master/pkg/operator/ceph/cluster/cluster_external.go#L116
Yes Madhu the v2 port adrresses you see in the rook-ceph-csi-config cm was changed by me. It was having the v1 ports 6789 & I changed it manually to 3300. Then the mounts worked & pod got to running.
The relevant discussion for this bug is happening here https://chat.google.com/room/AAAAREGEba8/QSwFEz4GICM We are mostly investigating why rook is not setting the csi-configmap to v2 ports on external clusters like on internal mode clusters. If we are able to debug that we will fix that, If it doesn't happen we will turn off require msgr2 for external ceph clusters & won't pass any kernel_mount_options.
Per discussion with csi team, for external cluster it will be less risky to set RequireMsgr2: false, in case the provider cluster still needs to use msgr1. This means the ms_mode cannot be set for cephfs for external clusters either.
*** Bug 2183073 has been marked as a duplicate of this bug. ***
Changing the backing store for image registry to use cephfs works as expected: 2023-04-13 22:49:03 19:49:02 - MainThread - ocs_ci.ocs.resources.ocs - INFO - Adding PersistentVolumeClaim with name registry-cephfs-rwx-pvc 2023-04-13 22:49:03 19:49:02 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: v1 2023-04-13 22:49:03 kind: PersistentVolumeClaim 2023-04-13 22:49:03 metadata: 2023-04-13 22:49:03 name: registry-cephfs-rwx-pvc 2023-04-13 22:49:03 namespace: openshift-image-registry 2023-04-13 22:49:03 spec: 2023-04-13 22:49:03 accessModes: 2023-04-13 22:49:03 - ReadWriteMany 2023-04-13 22:49:03 resources: 2023-04-13 22:49:03 requests: 2023-04-13 22:49:03 storage: 100Gi 2023-04-13 22:49:03 storageClassName: ocs-external-storagecluster-cephfs 2023-04-13 22:49:03 19:49:02 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-image-registry create -f /tmp/PersistentVolumeClaim1zygleng -o yaml 2023-04-13 22:49:03 19:49:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc patch configs.imageregistry.operator.openshift.io/cluster -p '[{"op": "add", "path": "/spec/storage", "value": {"pvc": {"claim": "registry-cephfs-rwx-pvc"}}}]' --type json 2023-04-13 22:50:15 19:50:15 - MainThread - ocs_ci.ocs.registry - INFO - Verified pvc is mounted on image-registry-868f8dc7c-8gc7j pod ===================== Verified with: ODF 4.13.0-162 Ceph Version 16.2.8-85.el8cp (0bdc6db9a80af40dd496b05674a938d406a9f6f5) pacific (stable) Cluster Version 4.13.0-0.nightly-2023-04-13-122023
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742