Hide Forgot
Description of problem: When the node is coming up post reboot,It doesn't sync ongoing volume export entries in parallel in "showmount -e" (on rebooted node).Since its in process of exporting the older volumes (which were exported before node is rebooted),it skips the volume entries export in showmount -e which are ongoing in parallel on other 3 nodes. When ~ 25 volumes are exported,it will skip 1-2 volume entries.But when it has around ~100 volume export ongoing it skips around 5-6 volume entries. This mismatch in export entries across node will cause problem in volume unexport (stagging failed errors) Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha glusterfs-ganesha-3.12.2-7.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-5.el7rhgs.x86_64 nfs-ganesha-2.5.5-5.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-5.el7rhgs.x86_64 How reproducible: 3/3 Steps to Reproduce: 1.Create 4 node ganesha cluster 2.Create and start ~100 volume 3.Start exporting all ~100 volumes in loop. 4.When export is happening reboot the node Actual results: Some of the volume export entries failed to reflect in "showmount -e" on the rebooted node. Expected results: All volume exported via ganesha should reflect on rebooted node. Additional info: -- Rebooted node snippet where volume entries from vol58 to vol61 are missing -- ---------------------------------- /vol37 (everyone) /vol38 (everyone) /vol39 (everyone) /vol40 (everyone) /vol41 (everyone) /vol42 (everyone) /vol43 (everyone) /vol44 (everyone) /vol45 (everyone) /vol46 (everyone) /vol47 (everyone) /vol48 (everyone) /vol49 (everyone) /vol50 (everyone) /vol51 (everyone) /vol52 (everyone) /vol53 (everyone) /vol54 (everyone) /vol55 (everyone) /vol56 (everyone) /vol57 (everyone) /vol62 (everyone) /vol63 (everyone) /vol64 (everyone) ------------------------- One of the other 3 nodes- ------------- /vol38 (everyone) /vol39 (everyone) /vol40 (everyone) /vol41 (everyone) /vol42 (everyone) /vol43 (everyone) /vol44 (everyone) /vol45 (everyone) /vol46 (everyone) /vol47 (everyone) /vol48 (everyone) /vol49 (everyone) /vol50 (everyone) /vol51 (everyone) /vol52 (everyone) /vol53 (everyone) /vol54 (everyone) /vol55 (everyone) /vol56 (everyone) /vol57 (everyone) /vol58 (everyone) /vol59 (everyone) /vol60 (everyone) /vol61 (everyone) /vol62 (everyone) --------------- Rebooted node ganesha.log snippet --------------------------------- 20/04/2018 20:47:26 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] glusterfs_create_export :FSAL :EVENT :Volume vol56 exported at : '/' 20/04/2018 20:47:29 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] glusterfs_create_export :FSAL :EVENT :Volume vol57 exported at : '/' 20/04/2018 20:47:31 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL 20/04/2018 20:47:31 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] lower_my_caps :NFS STARTUP :EVENT :currenty set capabilities are: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap+ep 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:0) 20/04/2018 20:47:32 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor. 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- 20/04/2018 20:47:33 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 20/04/2018 20:47:49 : epoch c6970000 : dhcp37-218.lab.eng.blr.redhat.com : ganesha.nfsd-4152[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume vol62 exported at : '/' -------------------------------
These exports are being added by dbus, correct? I think there's a window between when exports are read on startup, and when the DBUS thread is started during which exports could be lost. I'm tempted to say "don't do that". I'm not sure we can close this hole entirely. Frank, what are you're thoughts?