Bug 2175867 - Rook sets cephfs kernel mount options even when mon is using v1 port
Summary: Rook sets cephfs kernel mount options even when mon is using v1 port
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: Travis Nielsen
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-06 16:51 UTC by Malay Kumar parida
Modified: 2023-12-08 04:32 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:24:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 11859 0 None open csi: Update port to 3300 if msgr2 is required 2023-03-08 22:53:49 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:24:52 UTC

Description Malay Kumar parida 2023-03-06 16:51:02 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In odf 4.13 we are using the msgrv2 port by default for new installations. So for cephfs volumes we need to pass mount options. But when someone is upgrading from 4.12 or earlier versions to 4.13 they have mons which are still using v1 ports, And as we are passing the mount options to the v1 ports the volumes fail to mount.

This is the error that comes up 
`
MountVolume.MountDevice failed for volume "pvc-f667660d-7974-45e3-87ee-2fafeae55cee" : rpc error: code = Internal desc = an error (exit status 32) occurred while running mount args: [-t ceph 172.30.179.155:6789,172.30.209.170:6789,172.30.225.118:6789:/volumes/csi/csi-vol-67ebb9c3-d12d-4dfa-822e-73cbc47ecd42/29d20f1e-22fd-4d78-9ec7-69debed4c1a5 /var/lib/kubelet/plugins/kubernetes.io/csi/openshift-storage.cephfs.csi.ceph.com/181fb2841dea7dfea558abbf4d771e4cb0284b3e258bbb9be6eb694a9c13d895/globalmount -o name=csi-cephfs-node,secretfile=/tmp/csi/keys/keyfile-3109566895,mds_namespace=ocs-storagecluster-cephfilesystem,ms_mode=prefer-crc,_netdev] stderr: unable to get monitor info from DNS SRV with service name: ceph-mon 2023-03-06T13:56:28.070+0000 7f0c7ad26d40 -1 failed for service _ceph-mon._tcp mount error 110 = Connection timed out
`

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, In any cluster that is upgraded from 4.12 we will not be able to mount any new cephfs volumes.

Is there any workaround available to the best of your knowledge?
Failing over the mons 1 by 1 brings new mons on the v2 port which are able to handle the mount options & the volume mount succeeds

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install odf 4.12 rhceph-dev build
2. Upgrade to odf 4.13 rhceph-dev build
3. Try to create cephfs PVC & pod to mount the pvc


Actual results:
Volume is not mounted & pod is stuck in container creating phase

Expected results:
The volume should be mounted & pod should be running

Additional info:

Comment 1 Travis Nielsen 2023-03-06 22:15:50 UTC
The ms_mode=prefer-crc for cephfs is set in the storage class so the ocs operator will need to fix this scenario for cephfs.

Upgraded clusters should not set this ms_mode setting on the cephfs storageclass.
Perhaps the simplest will be to detect if the storage class doesn't exist yet, assume it's a new cluster and generate the storage class with the ms_mode attribute.
But if the storage class already exists, don't update it with the ms_mode property. 
If upgraded clusters want to enable it, we would need to document to add that property.

Malay, did you see any issue with rbd volumes after the upgrade? Rook would need to fix that case since the ms_mode for rbd is set by the rook operator. In that case, please open a new BZ for rbd. Let's keep this issue for cephfs in the ocs operator.

Comment 2 Malay Kumar parida 2023-03-07 06:15:34 UTC
The options to set the map/mount option work was done one rook side
for rbd-https://github.com/rook/rook/pull/11523
for cephfs-https://github.com/rook/rook/pull/11625

So I am not sure if the fix needs to go on ocs or on rook.

BTW another question the problem arises because the mon are using the v1 port, even though we set requiremsgr2 true.
So I was asking can't we do something so that after upgrade to odf 4.13 the mons start using the msgr v2 port instead of the v1 port.

I think that would be a more complete solution as we don't want customers to be left on the v1 port forever.

Comment 3 Travis Nielsen 2023-03-07 14:45:28 UTC
Oh right, the cephfs options were also done in rook, I was thinking they were done in the storageclass. Moving back to rook.

Comment 4 Travis Nielsen 2023-03-08 16:31:29 UTC
I'll take a look while Madhu is out

Comment 5 Travis Nielsen 2023-03-08 22:53:49 UTC
Testing this change: https://github.com/rook/rook/pull/11859

When msgr2 is required, ensure the mon endpoints passed to the csi configmap are on port 3300. The mons will still be listening on both 6789 and 3300 if they were created before msgr2 was required. Existing volumes can continue using port 6789 while new volumes will use port 3300.

Comment 6 Travis Nielsen 2023-03-10 23:25:20 UTC
This fix is merged downstream to 4.13 with https://github.com/red-hat-storage/rook/pull/452
This is expected to fix the issue with cephfs volumes not being created after upgrade from 4.12 to 4.13.

Comment 7 Mudit Agarwal 2023-04-03 11:23:24 UTC
Fixed in version: Any latest stable 4.13 build

Comment 11 Malay Kumar parida 2023-04-05 05:57:32 UTC
@nberry Yes. Creation of cephfs volumes & mounting them into a pod in an upgraded cluster will suffice our case.

Comment 17 errata-xmlrpc 2023-06-21 15:24:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Comment 18 Red Hat Bugzilla 2023-12-08 04:32:37 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.