Bug 1461019

Summary: [Ganesha] : Grace period is not being adhered to on RHEL 7.4; Clients continue running IO even during grace.
Product: [Community] GlusterFS Reporter: Kaleb KEITHLEY <kkeithle>
Component: common-haAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10CC: amukherj, bturner, bugs, dang, jthottan, kgaillot, kkeithle, mbenjamin, msaini, nchilaka, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1457179 Environment:
Last Closed: 2018-06-20 18:28:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1457179    
Bug Blocks: 1457558    

Comment 1 Kaleb KEITHLEY 2017-06-13 11:32:51 UTC
Description of problem:
-----------------------

4 node cluster,4 clients accessing the export via v4.

Kill NFS-Ganesha on any node.

Grace period should be entered and any and all IO should halt for 90 seconds.

I observed that other clients continued running their IO,which is unusual.

This may be a regression introduced in latest pacemaker/corosync bits. 

============================================================================

Ken Gaillot writes in email
> clone RA is created with
>
>   pcs resource create nfs-grace ocf:heartbeat:ganesha_grace --clone meta
> notify=true

With the above command, pcs puts the notify=true meta-attribute on the
primitive instead of the clone. Looking at the pcs help, that seems
expected (--clone notify=true would put it on the clone, meta
notify=true puts it on the primitive). If you drop the "meta" above, I
think it will work again.

If that exact command worked on 7.3, pcs behavior might have changed.
Double-check, and if so, I'll ask the pcs devs to look into it.

============================================================================

Indeed changing the resource create from

  `pcs resource create nfs-grace ocf:heartbeat:ganesha_grace --clone meta
 notify=true`

to

  `pcs resource create nfs-grace ocf:heartbeat:ganesha_grace --clone notify=true`

restores the original behavior seen in RHEL 7.3 and earlier.

Please check with the pcs devs and we will confirm that the changed command also works correctly on RHEL 7.3.

Comment 2 Worker Ant 2017-06-13 11:52:07 UTC
REVIEW: https://review.gluster.org/17534 (common-ha: surviving ganesha.nfsd not put in grace on fail-over) posted (#1) for review on release-3.10 by Kaleb KEITHLEY (kkeithle)

Comment 3 Worker Ant 2017-06-14 11:46:38 UTC
COMMIT: https://review.gluster.org/17534 committed in release-3.10 by Kaleb KEITHLEY (kkeithle) 
------
commit ee1a7560f8c27cc2721347dae37729aba9bac2d6
Author: Kaleb S. KEITHLEY <kkeithle>
Date:   Tue Jun 13 07:36:50 2017 -0400

    common-ha: surviving ganesha.nfsd not put in grace on fail-over
    
    Behavior change is seen in new HA in RHEL 7.4 Beta. Up to now clone
    RAs have been created with "pcs resource create ... meta notify=true".
    Their notify method is invoked with pre-start or post-stop when one of
    the clone RAs is started or stopped.
    
    In 7.4 Beta the notify method we observe that the notify method is not
    invoked when one of the clones is stopped (or started).
    
    Ken Gaillot, one of the pacemaker devs, wrote:
      With the above command, pcs puts the notify=true meta-attribute
      on the primitive instead of the clone. Looking at the pcs help,
      that seems expected (--clone notify=true would put it on the clone,
      meta notify=true puts it on the primitive). If you drop the "meta"
      above, I think it will work again.
    
    And indeed his suggested fix does work on both RHEL 7.4 Beta and RHEL
    7.3 and presumably Fedora.
    
    Change-Id: Idbb539f1366df6d39f77431c357dff4e53a2df6d
    BUG: 1461019
    Signed-off-by: Kaleb S. KEITHLEY <kkeithle>
    Reviewed-on: https://review.gluster.org/17534
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: soumya k <skoduri>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Shyamsundar 2018-06-20 18:28:36 UTC
This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.