Bug 1257548

Summary: nfs-ganesha service monitor period interval should be atleast twice the gluster ping timeout
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Soumya Koduri <skoduri>
Component: nfs-ganeshaAssignee: Soumya Koduri <skoduri>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, asriram, jthottan, kkeithle, ndevos, nlevinki, pasik, rhs-bugs, sankarshan, skoduri
Target Milestone: ---Keywords: FutureFeature, RFE, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
nfs-ganesha service monitor script which triggers IP failover runs periodically every 10 seconds. The ping-timeout of the GlusterFS server (after which the locks of the unreachable client gets flushed) is 42 seconds by default. After an IP failover, some locks may not get cleaned by the GlusterFS server process, hence reclaiming the lock state by NFS clients may fail Workaround (if any): It is recommended to set the nfs-ganesha service monitor period interval (default 10sec) at least as twice as the Gluster server ping-timout (default 42sec). Hence, either decrease the network ping-timeout using the following command: # gluster volume set <volname> network.ping-timeout <ping_timeout_value> or increase nfs-service monitor interval time using the following commands: # pcs resource op remove nfs-mon monitor # pcs resource op add nfs-mon monitor interval=<interval_period_value> timeout=<timeout_value>
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-20 12:40:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1255689    

Description Soumya Koduri 2015-08-27 10:12:55 UTC
Description of problem:

In case of failover/failback , there could be situations where in the VIP fails over within the glusterfs ping timeout. Since the gluster server may not have cleaned up the earlier lock state, the reclaim of the locks by the clients (if any) would fail. 

To fix the same, we may need to increase the nfs-ganesha service monitor interval to be at least twice to glusterfs ping timeout value.

Comment 2 Niels de Vos 2015-09-15 13:25:18 UTC
Modified the DocText a little bit. Soumya, was there not something with a grace time in the brick processes? I thought the IP-failover needed to be

  2 x network.ping-timeout
  1 x grace timeout for releasing the locks
  ------------------------------------------
  (total)

Comment 3 Soumya Koduri 2015-09-15 13:54:57 UTC
As soon as the network ping times out, glusterFS process starts flushing the locks of all the fds opened by that client. Assuming that the locks get flushed within another network.ping-timeout value, we are recommending to have monitor script pitch in after 2*network.ping-timeout. Is it safe to have that assumption?

Comment 4 Soumya Koduri 2015-09-15 13:57:13 UTC
Btw the grace period of the NFS server gets included in the total failover time seen by the NFS clients to get back the I/O going (which has to be documented as part of BZ#1257545)

Comment 5 Anjana Suparna Sriram 2015-09-18 07:45:08 UTC
Hi Soumya,

Please review the edited doc text and sign off to be included in the Known Issues chapter.

Regards,
Anjana

Comment 6 Soumya Koduri 2015-09-18 08:48:21 UTC
A small correction needed in the below statement 

' After an IP failover, some locks may get cleaned by the GlusterFS server process, hence reclaiming the lock state by NFS clients fails'

to

'After an IP failover, some locks may not get cleaned by the GlusterFS server process, hence reclaiming the lock state by NFS clients may fail'

Comment 7 Anjana Suparna Sriram 2015-09-18 09:59:22 UTC
upadated the doc text as per Comment 6(https://bugzilla.redhat.com/show_bug.cgi?id=1257548#c6)

Comment 8 Soumya Koduri 2016-01-28 11:08:20 UTC
Will address this as part of multi-protocol effort

Comment 9 Soumya Koduri 2017-05-03 12:13:56 UTC
This should be addressed as part of Lock reclaim support in GlusterFS - https://review.gluster.org/#/c/14986/

Comment 13 Soumya Koduri 2019-05-06 12:42:57 UTC
Increasing the monitor (as originally described in the bug) will increase the failover/failback time. Its tricky to decide how and to what values these intervals should be configured to and may vary from time to time. So I suggest we note down such recommendations in the admin guide (nfs-ganesha trouble shooting section). Hence converting this BZ to doc componenent.


See also : bug1608899
Doc link: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html-single/administration_guide/#nfs_ganesha