Bug 1570084

Summary: [Ganesha] Rebooted node doesn't sync ongoing volume export entries in "showmount -e" in parallel when node is coming up
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manisha Saini <msaini>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED CANTFIX QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: dang, ffilz, grajoria, jthottan, msaini, pasik, rcyriac, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: Exporting volume before completing ganesha start up Consequence: dbus command use to send for exporting the volume will fail Workaround (if any): Need to restart nfs-ganesha process Result: Volume export will fail
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 11:56:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503143    

Description Manisha Saini 2018-04-20 15:24:43 UTC
Description of problem:

When the node is coming up post reboot,It doesn't sync ongoing volume export entries in parallel in "showmount -e" (on rebooted node).Since its in process of exporting the older volumes (which were exported before node is rebooted),it skips the volume entries export in showmount -e which are ongoing in parallel on other 3 nodes.

When ~ 25 volumes are exported,it will skip 1-2 volume entries.But when it has around ~100 volume export ongoing it skips around 5-6 volume entries.

This mismatch in export entries across node will cause problem in volume unexport (stagging failed errors)


Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
glusterfs-ganesha-3.12.2-7.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-5.el7rhgs.x86_64
nfs-ganesha-2.5.5-5.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-5.el7rhgs.x86_64



How reproducible:
3/3

Steps to Reproduce:
1.Create 4 node ganesha cluster
2.Create and start ~100 volume
3.Start exporting all ~100 volumes in loop.
4.When export is happening reboot the node

Actual results:

Some of the volume export entries failed to reflect in "showmount -e" on the rebooted node.


Expected results:
All volume exported via ganesha should reflect on rebooted node.


Additional info:
--
Rebooted node snippet where volume entries from vol58 to vol61 are missing 
--

----------------------------------
/vol37 (everyone)
/vol38 (everyone)
/vol39 (everyone)
/vol40 (everyone)
/vol41 (everyone)
/vol42 (everyone)
/vol43 (everyone)
/vol44 (everyone)
/vol45 (everyone)
/vol46 (everyone)
/vol47 (everyone)
/vol48 (everyone)
/vol49 (everyone)
/vol50 (everyone)
/vol51 (everyone)
/vol52 (everyone)
/vol53 (everyone)
/vol54 (everyone)
/vol55 (everyone)
/vol56 (everyone)
/vol57 (everyone)
/vol62 (everyone)
/vol63 (everyone)
/vol64 (everyone)

-------------------------

One of the other 3 nodes-

-------------

/vol38 (everyone)
/vol39 (everyone)
/vol40 (everyone)
/vol41 (everyone)
/vol42 (everyone)
/vol43 (everyone)
/vol44 (everyone)
/vol45 (everyone)
/vol46 (everyone)
/vol47 (everyone)
/vol48 (everyone)
/vol49 (everyone)
/vol50 (everyone)
/vol51 (everyone)
/vol52 (everyone)
/vol53 (everyone)
/vol54 (everyone)
/vol55 (everyone)
/vol56 (everyone)
/vol57 (everyone)
/vol58 (everyone)
/vol59 (everyone)
/vol60 (everyone)
/vol61 (everyone)
/vol62 (everyone)

---------------

Rebooted node ganesha.log snippet
---------------------------------

20/04/2018 20:47:26 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] glusterfs_create_export :FSAL :EVENT :Volume vol56 exported at : '/'
20/04/2018 20:47:29 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] glusterfs_create_export :FSAL :EVENT :Volume vol57 exported at : '/'
20/04/2018 20:47:31 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
20/04/2018 20:47:31 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:0)
20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
20/04/2018 20:47:49 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume vol62 exported at : '/'

-------------------------------

Comment 2 Daniel Gryniewicz 2018-04-20 16:59:19 UTC
These exports are being added by dbus, correct?  I think there's a window between when exports are read on startup, and when the DBUS thread is started during which exports could be lost.

I'm tempted to say "don't do that".  I'm not sure we can close this hole entirely.  Frank, what are you're thoughts?