Description of problem (please be detailed as possible and provide log snippests): In odf 4.13 we are using the msgrv2 port by default for new installations. So for cephfs volumes we need to pass mount options. But when someone is upgrading from 4.12 or earlier versions to 4.13 they have mons which are still using v1 ports, And as we are passing the mount options to the v1 ports the volumes fail to mount. This is the error that comes up ` MountVolume.MountDevice failed for volume "pvc-f667660d-7974-45e3-87ee-2fafeae55cee" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 172.30.179.155:6789,172.30.209.170:6789,172.30.225.118:6789:/volumes/csi/csi-vol-67ebb9c3-d12d-4dfa-822e-73cbc47ecd42/29d20f1e-22fd-4d78-9ec7-69debed4c1a5 /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/181fb2841dea7dfea558abbf4d771e4cb0284b3e258bbb9be6eb694a9c13d895/globalmount -o name=csi-cephfs-node,secretfile=/tmp/csi/keys/keyfile-3109566895,mds_namespace=ocs-storagecluster-cephfilesystem,ms_mode=prefer-crc,_netdev] stderr: unable to get monitor info from DNS SRV with service name: ceph-mon 2023-03-06T13:56:28.070+0000 7f0c7ad26d40 -1 failed for service _ceph-mon._tcp mount error 110 = Connection timed out ` Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, In any cluster that is upgraded from 4.12 we will not be able to mount any new cephfs volumes. Is there any workaround available to the best of your knowledge? Failing over the mons 1 by 1 brings new mons on the v2 port which are able to handle the mount options & the volume mount succeeds Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install odf 4.12 rhceph-dev build 2. Upgrade to odf 4.13 rhceph-dev build 3. Try to create cephfs PVC & pod to mount the pvc Actual results: Volume is not mounted & pod is stuck in container creating phase Expected results: The volume should be mounted & pod should be running Additional info:
The ms_mode=prefer-crc for cephfs is set in the storage class so the ocs operator will need to fix this scenario for cephfs. Upgraded clusters should not set this ms_mode setting on the cephfs storageclass. Perhaps the simplest will be to detect if the storage class doesn't exist yet, assume it's a new cluster and generate the storage class with the ms_mode attribute. But if the storage class already exists, don't update it with the ms_mode property. If upgraded clusters want to enable it, we would need to document to add that property. Malay, did you see any issue with rbd volumes after the upgrade? Rook would need to fix that case since the ms_mode for rbd is set by the rook operator. In that case, please open a new BZ for rbd. Let's keep this issue for cephfs in the ocs operator.
The options to set the map/mount option work was done one rook side for rbd-https://github.com/rook/rook/pull/11523 for cephfs-https://github.com/rook/rook/pull/11625 So I am not sure if the fix needs to go on ocs or on rook. BTW another question the problem arises because the mon are using the v1 port, even though we set requiremsgr2 true. So I was asking can't we do something so that after upgrade to odf 4.13 the mons start using the msgr v2 port instead of the v1 port. I think that would be a more complete solution as we don't want customers to be left on the v1 port forever.
Oh right, the cephfs options were also done in rook, I was thinking they were done in the storageclass. Moving back to rook.
I'll take a look while Madhu is out
Testing this change: https://github.com/rook/rook/pull/11859 When msgr2 is required, ensure the mon endpoints passed to the csi configmap are on port 3300. The mons will still be listening on both 6789 and 3300 if they were created before msgr2 was required. Existing volumes can continue using port 6789 while new volumes will use port 3300.
This fix is merged downstream to 4.13 with https://github.com/red-hat-storage/rook/pull/452 This is expected to fix the issue with cephfs volumes not being created after upgrade from 4.12 to 4.13.
Fixed in version: Any latest stable 4.13 build
@nberry Yes. Creation of cephfs volumes & mounting them into a pod in an upgraded cluster will suffice our case.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days