Bug 1290865
Summary: | nfs-ganesha server do not enter grace period during failover/failback | |||
---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Kaleb KEITHLEY <kkeithle> | |
Component: | common-ha | Assignee: | Kaleb KEITHLEY <kkeithle> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.7.6 | CC: | akhakhar, jthottan, kkeithle, mzywusko, ndevos, nlevinki, rnalakka, skoduri, storage-qa-internal | |
Target Milestone: | --- | Keywords: | Reopened, Triaged | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.11 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1278332 | |||
: | 1317424 (view as bug list) | Environment: | ||
Last Closed: | 2016-05-31 13:12:20 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1278332, 1317424 |
Comment 1
Vijay Bellur
2015-12-14 15:25:53 UTC
REVIEW: http://review.gluster.org/12964 (common-ha: reliable grace using pacemaker notify actions) posted (#2) for review on release-3.7 by Kaleb KEITHLEY (kkeithle) REVIEW: http://review.gluster.org/12964 (common-ha: reliable grace using pacemaker notify actions) posted (#3) for review on release-3.7 by Kaleb KEITHLEY (kkeithle) REVIEW: http://review.gluster.org/12964 (common-ha: reliable grace using pacemaker notify actions) posted (#4) for review on release-3.7 by Kaleb KEITHLEY (kkeithle) REVIEW: http://review.gluster.org/12964 (common-ha: reliable grace using pacemaker notify actions) posted (#5) for review on release-3.7 by Kaleb KEITHLEY (kkeithle) COMMIT: http://review.gluster.org/12964 committed in release-3.7 by Kaleb KEITHLEY (kkeithle) ------ commit e8121c4afb3680f532b450872b5a3ffcb3766a97 Author: Kaleb S KEITHLEY <kkeithle> Date: Mon Dec 14 09:24:57 2015 -0500 common-ha: reliable grace using pacemaker notify actions Using *-dead_ip-1 resources to track on which nodes the ganesha.nfsd had died was found to be unreliable. Running `pcs status` in the ganesha_grace monitor action was seen to time out during failover; the HA devs opined that it was, generally, not a good idea to run `pcs status` in a monitor action in any event. They suggested using the notify feature, where the resources on all the nodes are notified when a clone resource agent dies. This change adds a notify action to the ganesha_grace RA. The ganesha_mon RA monitors its ganesha.nfsd daemon. While the daemon is running, it creates two attributes: ganesha-active and grace-active. When the daemon stops for any reason, the attributes are deleted. Deleting the ganesha-active attribute triggers the failover of the virtual IP (the IPaddr RA) to another node where ganesha.nfsd is still running. The ganesha_grace RA monitors the grace-active attribute. When the grace-active attibute is deleted, the ganesha_grace RA stops, and will not restart. This triggers pacemaker to trigger the notify action in the ganesha_grace RAs on the other nodes in the cluster; which send a DBUS message to their ganesha.nfsd. (N.B. grace-active is a bit of a misnomer. while the grace-active attribute exists, everything is normal and healthy. Deleting the attribute triggers putting the surviving ganesha.nfsds into GRACE.) To ensure that the remaining/surviving ganesha.nfsds are put into NFS-GRACE before the IPaddr (virtual IP) fails over there is a short delay (sleep) between deleting the grace-active attribute and the ganesha-active attribute. To summarize: 1. on node 2 ganesha_mon:monitor notices that ganesha.nfsd has died 2. on node 2 ganesha_mon:monitor deletes its grace-active attribute 3. on node 2 ganesha_grace:monitor notices that grace-active is gone and returns OCF_ERR_GENERIC, a.k.a. new error. When pacemaker tries to (re)start ganesha_grace, its start action will return OCF_NOT_RUNNING, a.k.a. known error, don't attempt further restarts. 4. on nodes 1, 3, etc., ganesha_grace:notify receives a post-stop notification indicating that node 2 is gone, and sends a DBUS message to its ganesha.nfsd putting it into NFS-GRACE. 5. on node 2 ganesha_mon:monitor waits a short period, then deletes its ganesha-active attribute. This triggers the IPaddr (virt IP) failover according to constraint location rules. ganesha_nfsd modified to run for the duration, start action is invoked to setup the /var/lib/nfs symlink, stop action is invoked to restore it. ganesha-ha.sh modified accordingly to create it as a clone resource. Change-Id: Iad60b0c5222bbd55ef95c8b8f955e791caa3ffd0 BUG: 1290865 Signed-off-by: Kaleb S KEITHLEY <kkeithle> Reviewed-on: http://review.gluster.org/12964 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> REVIEW: http://review.gluster.org/13725 (common-ha: reliable grace using pacemaker notify actions) posted (#1) for review on master by Kaleb KEITHLEY (kkeithle) COMMIT: http://review.gluster.org/13725 committed in master by Kaleb KEITHLEY (kkeithle) ------ commit 40a24f5ab917863d1549508ae9cf31085955d174 Author: Kaleb S KEITHLEY <kkeithle> Date: Mon Dec 14 09:24:57 2015 -0500 common-ha: reliable grace using pacemaker notify actions Using *-dead_ip-1 resources to track on which nodes the ganesha.nfsd had died was found to be unreliable. Running `pcs status` in the ganesha_grace monitor action was seen to time out during failover; the HA devs opined that it was, generally, not a good idea to run `pcs status` in a monitor action in any event. They suggested using the notify feature, where the resources on all the nodes are notified when a clone resource agent dies. This change adds a notify action to the ganesha_grace RA. The ganesha_mon RA monitors its ganesha.nfsd daemon. While the daemon is running, it creates two attributes: ganesha-active and grace-active. When the daemon stops for any reason, the attributes are deleted. Deleting the ganesha-active attribute triggers the failover of the virtual IP (the IPaddr RA) to another node where ganesha.nfsd is still running. The ganesha_grace RA monitors the grace-active attribute. When the grace-active attibute is deleted, the ganesha_grace RA stops, and will not restart. This triggers pacemaker to trigger the notify action in the ganesha_grace RAs on the other nodes in the cluster; which send a DBUS message to their ganesha.nfsd. (N.B. grace-active is a bit of a misnomer. while the grace-active attribute exists, everything is normal and healthy. Deleting the attribute triggers putting the surviving ganesha.nfsds into GRACE.) To ensure that the remaining/surviving ganesha.nfsds are put into NFS-GRACE before the IPaddr (virtual IP) fails over there is a short delay (sleep) between deleting the grace-active attribute and the ganesha-active attribute. To summarize: 1. on node 2 ganesha_mon:monitor notices that ganesha.nfsd has died 2. on node 2 ganesha_mon:monitor deletes its grace-active attribute 3. on node 2 ganesha_grace:monitor notices that grace-active is gone and returns OCF_ERR_GENERIC, a.k.a. new error. When pacemaker tries to (re)start ganesha_grace, its start action will return OCF_NOT_RUNNING, a.k.a. known error, don't attempt further restarts. 4. on nodes 1, 3, etc., ganesha_grace:notify receives a post-stop notification indicating that node 2 is gone, and sends a DBUS message to its ganesha.nfsd putting it into NFS-GRACE. 5. on node 2 ganesha_mon:monitor waits a short period, then deletes its ganesha-active attribute. This triggers the IPaddr (virt IP) failover according to constraint location rules. ganesha_nfsd modified to run for the duration, start action is invoked to setup the /var/lib/nfs symlink, stop action is invoked to restore it. ganesha-ha.sh modified accordingly to create it as a clone resource. BUG: 1290865 Change-Id: I1ba24f38fa4338b3aeb17c65645e9f439387ff57 Signed-off-by: Kaleb S KEITHLEY <kkeithle> Reviewed-on: http://review.gluster.org/12964 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-on: http://review.gluster.org/13725 This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report. glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |