2238272 – [Nfs-Ganesha] Nfs mounts are taking longer time than expected when tried on a scale setup

Bug 2238272 - [Nfs-Ganesha] Nfs mounts are taking longer time than expected when tried on a scale setup

Summary: [Nfs-Ganesha] Nfs mounts are taking longer time than expected when tried on a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NFS-Ganesha
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.0
Assignee:	Frank Filz
QA Contact:	Manisha Saini
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-11 07:33 UTC by Pranav Prakash
Modified:	2023-12-13 15:23 UTC (History)
CC List:	9 users (show)
Fixed In Version:	nfs-ganesha-5.6-3.el9cp, rhceph-container-7-113
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-13 15:22:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7412	0	None	None	None	2023-09-11 07:33:43 UTC
Red Hat Product Errata	RHBA-2023:7780	0	None	None	None	2023-12-13 15:23:00 UTC

Description Pranav Prakash 2023-09-11 07:33:15 UTC

Description of problem:
When trying to perform a scale validation of 500 exports on 50 clients, we can see that the nfs-mounts are taking way longer time than expected. This behaviour is visible only after 300 + mounts. At times the mounts take more than 10-15 hours to complete. 


Version-Release number of selected component (if applicable): rhcs 7.0, ganesha 5.5


How reproducible: Always


Steps to Reproduce:
1. DEploy the rhcs 7.0 cluster
2. Create 500 exports
3. Try mounting the exports from the clients

Actual results:
Mounts are taking longer than expected


Expected results:
Mounts should succeeded without major delays

Additional info:

Log :http://magna002.ceph.redhat.com/cephci-jenkins/nfs_ganesha_scale_50_clients_500_exports.log

config : http://pastebin.test.redhat.com/1109131

Comment 7 Frank Filz 2023-09-21 16:37:24 UTC

Yes, let's get a fix in to consolidate Ganesha EXPORTs into a single (or a few) Ceph clients/mounts and see if that resolves this issue.

Comment 9 Manisha Saini 2023-10-04 06:53:07 UTC

Hi Frank,

Could you confirm if this is the new cephfs client consolidation patch - http://registry-proxy.engineering.redhat.com/rh-osbs/rhceph:7-72.0.TEST.gerrithub1170100 

Do we need to do any other changes? Or deployment of NFS with CephFS using this will be sufficient to run our test?

Comment 10 Kaleb KEITHLEY 2023-10-04 18:15:01 UTC

(In reply to Manisha Saini from comment #9)
> Hi Frank,
> 
> Could you confirm if this is the new cephfs client consolidation patch -
> http://registry-proxy.engineering.redhat.com/rh-osbs/rhceph:7-72.0.TEST.
> gerrithub1170100 

I get a 404 error on this link. Please check that it is correct. Yes, I'm on the vpn.

Comment 11 tserlin 2023-10-04 18:37:53 UTC

(In reply to Kaleb KEITHLEY from comment #10)
> (In reply to Manisha Saini from comment #9)
> > Hi Frank,
> > 
> > Could you confirm if this is the new cephfs client consolidation patch -
> > http://registry-proxy.engineering.redhat.com/rh-osbs/rhceph:7-72.0.TEST.
> > gerrithub1170100 
> 
> I get a 404 error on this link. Please check that it is correct. Yes, I'm on
> the vpn.

Visiting http://registry-proxy.engineering.redhat.com won't work. The Brew link for that container is:

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2701372

and you would have to pull with podman, e.g.:

$ podman pull registry-proxy.engineering.redhat.com/rh-osbs/rhceph:7-72.0.TEST.gerrithub1170100

Anyway, the -patches branch for that custom nfs-ganesha build in that container is:
https://gitlab.cee.redhat.com/ceph/nfs-ganesha/-/commits/private-tserlin-ffilz-gerrithub-1170100-rebase-back-onto-V5.5

Thomas

Comment 13 Manisha Saini 2023-10-05 20:58:34 UTC

Hi frank,

I tried both the scenarios to apply the additional changes i.e 1. modifying the export file 2.Modifying ganesha.conf file with export info.
But unable to get this working in containerised deployment.


Scenario 1: Editing export file and adding "cmount_path": "/" in export file
===============

As updated in doc - https://docs.google.com/document/d/1o4DBWc9oP-ayMmHtP7XimB1FCAICGp9I0akQ9In0cLk/edit , the export file is not getting updated with "cmount_path": "/"  and "user_id" is getting added again when updating the export file.


Scenario 2: Editing ganesha.conf with export block
=================
Below is the ganesha.conf file which I updated it with export block.



[ceph: root@argo018 /]# ceph config-key get mgr/cephadm/services/nfs/ganesha.conf
# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
        NFS_Port = 12049;
        HAProxy_Hosts = 10.8.128.216, 172.20.21.16, 2620:52:0:880:ae1f:6bff:fe0a:1844, 10.8.128.218, 172.20.21.18, 2620:52:0:880:ec4:7aff:fef3:e9f4, 10.8.128.219, 172.20.21.19, 2620:52:0:880:ae1f:6bff:fe0a:17ec, 10.8.128.220, 172.20.21.20, 2620:52:0:880:ec4:7aff:fef3:e9ee, 10.8.128.221, 172.20.21.21, 2620:52:0:880:ae1f:6bff:fe0a:180e;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        UserId = "nfs.nfsganesha.0.0.argo018.apiqnm";
        nodeid = "nfs.nfsganesha.0";
        pool = ".nfs";
        namespace = "nfsganesha";
}


RGW {
        cluster = "ceph";
        name = "client.nfs.nfsganesha.0.0.argo018.apiqnm-rgw";
}

EXPORT
{
	# Export Id (mandatory, each EXPORT must have a unique Export_Id)
	Export_Id = 2;

	# Exported path (mandatory)
	Path = "/";

	# Pseudo Path (required for NFS v4)
	Pseudo = "/ganesha2";

	# Required for access (default is None)
	# Could use CLIENT blocks instead
	Access_Type = RW;

	SecType = "sys";

	Protocols = 3,4;

	MaxRead = 9000000;
	MaxWrite = 9000000;

	Squash = No_Root_Squash;

	# Exporting FSAL
	FSAL {
		Name = CEPH;
		cmount_path = "/";
	}
}


===============
When restarting the nfs ganesha (redeploying the nfs daemon), ganesha service is started on argo018 but failed on argo019.

[ceph: root@argo016 /]# ceph orch ps | grep nfs
haproxy.nfs.nfsganesha.argo018.bayiim     argo018  *:2049,9049       running (114m)    10m ago  114m    17.2M        -  2.4.17-9f97155   bda92490ac6c  a519fb6243e6
haproxy.nfs.nfsganesha.argo019.cjdvgw     argo019  *:2049,9049       running (114m)    10m ago  114m    18.6M        -  2.4.17-9f97155   bda92490ac6c  6cb692b58a51
keepalived.nfs.nfsganesha.argo018.jcxjht  argo018                    running (114m)    10m ago  114m    1765k        -  2.2.4            b79b516c07ed  7a4d08395cde
keepalived.nfs.nfsganesha.argo019.hknukw  argo019                    running (114m)    10m ago  114m    1765k        -  2.2.4            b79b516c07ed  4fef36f212fb
nfs.nfsganesha.0.0.argo018.apiqnm         argo018  *:12049           running (10m)     10m ago  114m    46.9M        -  5.5              ffbec4c9b233  dd6cf28dce72
nfs.nfsganesha.1.0.argo019.pfkqrp         argo019  *:12049           unknown           10m ago  114m        -        -  <unknown>        <unknown>     <unknown>

================
When I tried mounting the share from argo018 machine, the mount command is  getting hanged

[root@argo021 mnt]# mount -t nfs -o vers=4,port=2049 10.8.128.218:/ganesha1 /mnt/test/
mount.nfs: Connection refused

================

ganesha.log of argo018 machine

42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=0
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] create_export :FSAL :CRIT :Unable to init Ceph handle.
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] mdcache_fsal_create_export :FSAL :MAJ :Failed to call create_export on underlying FSAL Ceph
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/ganesha2) to (/)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] export_commit_common :CONFIG :WARN :A protocol is specified for export 2 that is not enabled in NFS_CORE_PARAM, fixing up
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] export_commit_common :CONFIG :CRIT :fsal_export is NULL
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] main :NFS STARTUP :WARN :No export entries found in configuration file !!!
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:54): 1 validation errors in block FSAL
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:54): Errors processing block (FSAL)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:29): 1 validation errors in block EXPORT
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:29): Errors processing block (EXPORT)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:24): Unknown block (RGW)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] lower_my_caps :NFS STARTUP :EVENT :currently set capabilities are: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys>
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gsh_dbus_pkginit :DBUS :CRIT :dbus_bus_get failed (Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:2)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[dbus] gsh_dbus_thread :DBUS :CRIT :DBUS not initialized, service thread exiting
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[dbus] gsh_dbus_thread :DBUS :EVENT :shutdown
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
42:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
42:40 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
42:50 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:00 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:10 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:20 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:30 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:40 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
43:50 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
44:00 : epoch 651f1fb6 : argo018 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)

===============


Ganesha.log of argo019 machine


Oct 05 20:43:16 argo019 bash[4042305]: 92df895698f29bef0cc23a351104616ea136ce033668e3c04edbd8cd7832a521
Oct 05 20:43:16 argo019 podman[4042305]: 2023-10-05 20:43:16.881949627 +0000 UTC m=+0.025476943 image pull  registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:59d44bf3148860191743c61856075f58315e4ecc425ef436ba32a12e6790e4d4
Oct 05 20:43:16 argo019 systemd[1]: Started Ceph nfs.nfsganesha.1.0.argo019.pfkqrp for 0662c37c-4315-11ee-b3bd-ac1f6b0a1844.
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] init_logging :LOG :NULL :LOG: Setting log level for all components to NIV_EVENT
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 5.5
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] monitoring_init :NFS STARTUP :EVENT :Init monitoring at 0.0.0.0:9587
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] fsal_init_fds_limit :MDCACHE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576.
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] rados_kv_connect :CLIENT ID :EVENT :Failed to connect: -13
Oct 05 20:43:16 argo019 ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp[4042327]: 05/10/2023 20:43:16 : epoch 651f1fe4 : argo019 : ganesha.nfsd-2[main] rados_cluster_init :CLIENT ID :EVENT :Failed to connect to cluster: -13
Oct 05 20:43:17 argo019 podman[4042353]: 2023-10-05 20:43:17.366389648 +0000 UTC m=+0.032668356 container died 92df895698f29bef0cc23a351104616ea136ce033668e3c04edbd8cd7832a521 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:59d44bf3148860191743c61856075f58315e4ecc425ef436ba32a12e6790e4d4, name=ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp, RELEASE=main, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image., CEPH_POINT_RELEASE=, GIT_BRANCH=main, com.redhat.license_terms=https://www.redhat.com/agreements, io.openshift.tags=rhceph ceph, vendor=Red Hat, Inc., GIT_REPO=https://github.com/ceph/ceph-container.git, name=rhceph, build-date=2023-10-04T22:16:49, distribution-scope=public, com.redhat.component=rhceph-container, io.k8s.description=Red Hat Ceph Storage 7, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, description=Red Hat Ceph Storage 7, io.openshift.expose-services=, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-81.1.TEST.gerrithub1170100, vcs-type=git, GIT_CLEAN=True, ceph=True, io.buildah.version=1.29.0, architecture=x86_64, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, version=7, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, maintainer=Guillaume Abrioux <gabrioux>, release=81.1.TEST.gerrithub1170100)
Oct 05 20:43:17 argo019 podman[4042353]: 2023-10-05 20:43:17.381389564 +0000 UTC m=+0.047668269 container remove 92df895698f29bef0cc23a351104616ea136ce033668e3c04edbd8cd7832a521 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:59d44bf3148860191743c61856075f58315e4ecc425ef436ba32a12e6790e4d4, name=ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844-nfs-nfsganesha-1-0-argo019-pfkqrp, build-date=2023-10-04T22:16:49, GIT_REPO=https://github.com/ceph/ceph-container.git, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/7-81.1.TEST.gerrithub1170100, description=Red Hat Ceph Storage 7, vendor=Red Hat, Inc., GIT_BRANCH=main, name=rhceph, release=81.1.TEST.gerrithub1170100, io.openshift.tags=rhceph ceph, com.redhat.license_terms=https://www.redhat.com/agreements, io.k8s.display-name=Red Hat Ceph Storage 7 on RHEL 9, io.buildah.version=1.29.0, vcs-ref=6a3109234de1e767361375a550322ef998fe07ed, vcs-type=git, GIT_COMMIT=54fe819971d3d2dbde321203c5644c08d10742d5, distribution-scope=public, RELEASE=main, com.redhat.component=rhceph-container, ceph=True, version=7, maintainer=Guillaume Abrioux <gabrioux>, io.openshift.expose-services=, CEPH_POINT_RELEASE=, io.k8s.description=Red Hat Ceph Storage 7, architecture=x86_64, GIT_CLEAN=True, summary=Provides the latest Red Hat Ceph Storage 7 on RHEL 9 in a fully featured and supported base image.)
Oct 05 20:43:17 argo019 systemd[1]: ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844.1.0.argo019.pfkqrp.service: Main process exited, code=exited, status=139/n/a
Oct 05 20:43:17 argo019 systemd[1]: ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844.1.0.argo019.pfkqrp.service: Failed with result 'exit-code'.
Oct 05 20:43:27 argo019 systemd[1]: ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844.1.0.argo019.pfkqrp.service: Scheduled restart job, restart counter is at 5.
Oct 05 20:43:27 argo019 systemd[1]: Stopped Ceph nfs.nfsganesha.1.0.argo019.pfkqrp for 0662c37c-4315-11ee-b3bd-ac1f6b0a1844.
Oct 05 20:43:27 argo019 systemd[1]: ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844.1.0.argo019.pfkqrp.service: Start request repeated too quickly.
Oct 05 20:43:27 argo019 systemd[1]: ceph-0662c37c-4315-11ee-b3bd-ac1f6b0a1844.1.0.argo019.pfkqrp.service: Failed with result 'exit-code'.
Oct 05 20:43:27 argo019 systemd[1]: Failed to start Ceph nfs.nfsganesha.1.0.argo019.pfkqrp for 0662c37c-4315-11ee-b3bd-ac1f6b0a1844.

Comment 14 Manisha Saini 2023-10-05 21:09:45 UTC

The most straightforward approach to accomplish this task is by directly editing the export file for each export. However, I'm having difficulty determining the exact method to achieve this. The steps I've attempted so far haven't successfully updated the export block, particularly in terms of setting "cmount_path" to "/" and removing the "user_id" field. I have been successful in modifying other entries, such as changing permissions to read-only (RO) and adding client permissions.

Could you please provide guidance on how to make this work in my test environment for testing the patch? I'm uncertain about whom I should contact to gain a better understanding of the RADOS functioning to update the Ganesha export file.

Comment 15 Frank Filz 2023-10-06 21:25:10 UTC

The best bet for immediate might be to set up exports just using sub-directories.

Add the following to your ganesha.conf:

CEPH {
    # change this to the actual location of your ceph configuration
    # that should give Ganesha access to the MDS and any cephx keys required.
    ceph_conf = "/etc/ceph/ceph.conf";
}

Then add a ceph root export:

{
	# pick an export id:
	Export_Id = 99999;

	# Exported path (mandatory)
	Path = "/";

	# Pseudo Path (required for NFS v4)
	Pseudo = "/ceph";

	# Required for access (default is None)
	# Could use CLIENT blocks instead
	Access_Type = RW;

	SecType = "sys";

	Protocols = 4;

	Squash = No_Root_Squash;

	# Exporting FSAL
	FSAL {
		Name = CEPH;
		cmount_path = "/";
	}
}

Then mount that from a client and create directories for your exports:

mount server:/ceph /mnt
mkdir /mnt/exp001
mkdir /mnt/exp002
etc.
You may want to write a script to create 100 directories...

Then add an export for each:

{
	# change this for each export
	Export_Id = 1001;

	# Exported path (mandatory), change for each export
	Path = "/exp001";

	# Pseudo Path (required for NFS v4), change for each export
	Pseudo = "/ceph/exp001";

	# Required for access (default is None)
	# Could use CLIENT blocks instead
	Access_Type = RW;

	SecType = "sys";

	Protocols = 4;

	Squash = No_Root_Squash;

	# Exporting FSAL
	FSAL {
		Name = CEPH;
		cmount_path = "/";
	}
}

After adding those to the config file, you can either restart Ganesha, or invoke dynamic export config with SIGHUP:

killall -HUP ganesha.nfsd

That SHOULD get you a working setup with 100 exports.

Alternatively, contact Mark Kogan for how he set his system up.

Comment 20 Sachin Punadikar 2023-11-20 12:17:56 UTC

Logged into the setup:
1. Found that NFS Ganesha container was running on cali019 node only and not running on cali015 node
[root@cali015 ~]# cephadm ls | grep nfs
        "name": "nfs.cephfs-nfs.0.0.cali015.gxdbuc",
        "systemd_unit": "ceph-4e687a60-638e-11ee-8772-b49691cee574.0.0.cali015.gxdbuc",
        "service_name": "nfs.cephfs-nfs",
        "name": "haproxy.nfs.cephfs-nfs.cali015.xmcwkq",
        "systemd_unit": "ceph-4e687a60-638e-11ee-8772-b49691cee574.cephfs-nfs.cali015.xmcwkq",
        "service_name": "ingress.nfs.cephfs-nfs",
        "name": "keepalived.nfs.cephfs-nfs.cali015.amerio",
        "systemd_unit": "ceph-4e687a60-638e-11ee-8772-b49691cee574.cephfs-nfs.cali015.amerio",
        "service_name": "ingress.nfs.cephfs-nfs",
[root@cali015 ~]# cephadm logs --name haproxy.nfs.cephfs-nfs.cali015.xmcwkq
Inferring fsid 4e687a60-638e-11ee-8772-b49691cee574
-- No entries --
[root@cali015 ~]# cephadm logs --name nfs.cephfs-nfs.0.0.cali015.gxdbuc
Inferring fsid 4e687a60-638e-11ee-8772-b49691cee574
-- No entries --

2. On node cali019, NFS Ganesha faced syntax errors while processing the exports
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]: 17/11/2023 10:57:05 : epoch 65530ef4 : cali019 : ganesha.nfsd-2[sigmgr] sigmgr_thread :MAIN :EVENT :SIGHUP_HANDLER: Received SIGHUP.... initiating export list reload
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]: 17/11/2023 10:57:05 : epoch 65530ef4 : cali019 : ganesha.nfsd-2[sigmgr] ganesha_yyerror :CONFIG :CRIT :Config file (rados://.nfs/cephfs-nfs/conf-nfs.cephfs-nfs:1) error: syntax error
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]:
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]: 17/11/2023 10:57:05 : epoch 65530ef4 : cali019 : ganesha.nfsd-2[sigmgr] reread_config :CONFIG :CRIT :Error while parsing new configuration file /etc/ganesha/ganesha.conf
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]: 17/11/2023 10:57:05 : epoch 65530ef4 : cali019 : ganesha.nfsd-2[sigmgr] config_errs_to_log :CONFIG :CRIT :Config File (rados://.nfs/cephfs-nfs/conf-nfs.cephfs-nfs:1): Unexpected character (%)
Nov 17 10:57:05 cali019 ceph-4e687a60-638e-11ee-8772-b49691cee574-nfs-cephfs-nfs-1-0-cali019-tptjzg[530205]: 17/11/2023 10:57:05 : epoch 65530ef4 : cali019 : ganesha.nfsd-2[sigmgr] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:0): Configuration syntax errors found

In summary, NFS Ganesha process is running but not serving any export.

Comment 22 Sachin Punadikar 2023-11-21 13:26:43 UTC

More details :
NFS Ganesha refers to file "/etc/ganesha/ganesha.conf"
This file has below entries (the last entry points to exports related config)

# cat /etc/ganesha/ganesha.conf
# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
        NFS_Port = 12049;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        UserId = "nfs.cephfs-nfs.1.0.cali019.tptjzg";
        nodeid = "nfs.cephfs-nfs.1";
        pool = ".nfs";
        namespace = "cephfs-nfs";
}

RADOS_URLS {
        UserId = "nfs.cephfs-nfs.1.0.cali019.tptjzg";
        watch_url = "rados://.nfs/cephfs-nfs/conf-nfs.cephfs-nfs";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.cephfs-nfs.1.0.cali019.tptjzg-rgw";
}

%url    rados://.nfs/cephfs-nfs/conf-nfs.cephfs-nfs

The error is reported for 1st line in the file/rados object "rados://.nfs/cephfs-nfs/conf-nfs.cephfs-nfs"

And the rados object has below contents, which is leading to this issue :
[ceph: root@cali013 /]# rados ls -p .nfs --all | grep conf
cephfs-nfs	conf-nfs.cephfs-nfs
[ceph: root@cali013 /]# rados get -N cephfs-nfs -p ".nfs" conf-nfs.cephfs-nfs /nfs_conf
[ceph: root@cali013 /]# cat /nfs_conf
%%url "rados://.nfs/cephfs-nfs/export-1"   <=problematic extra %

%url "rados://.nfs/cephfs-nfs/export-2"

%url "rados://.nfs/cephfs-nfs/export-3"

%url "rados://.nfs/cephfs-nfs/export-4"
...
...

Comment 23 Frank Filz 2023-11-21 18:55:45 UTC

This is needed for 7.0. I don't think it needs doc text

Comment 27 Frank Filz 2023-11-28 19:44:16 UTC

That makes sense. So this is a setup issue and not a bug.

Can we close this as not a bug?

Comment 30 errata-xmlrpc 2023-12-13 15:22:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Note You need to log in before you can comment on or make changes to this bug.