Description of problem: While working on one of the issues raised by GSS, I found out that nfs-ganesha servers do not enter grace period all the time during failover & failback. Below are my observations while debugging the issue - * if the system is rebooted, nfs-mon of that system should have created dead-ip. But it doesn't happen. * ganesha_grace compares pcs_status collected during its monitor() and restart(). pcs status | grep dead_ip-1 | sort > /tmp/.pcs_status logger "ganesha_grace_start(), comparing" result=$(diff /var/run/ganesha/pcs_status1 /tmp/.pcs_status | grep '^>') if [[ ${result} ]]; then It so happens that sometimes even though we have dead_ip (when nfs service goes down), monitor could have kicked in first and copied the status to /var/run/ganesha/pcs_status and we could have got the same status in start() too i.e, in /tmp/.pcs_status, so the diff shall be empty. * during fail-back too the race between monitor() and start() result in nfs-server not being in grace. Version-Release number of selected component (if applicable): RHGS 3.1 How reproducible: Almost consistent especially in reboot & VIP failback cases.
The fix is posted upstream for review - http://review.gluster.org/13275
Correcting a backwards dependency chain.
(In reply to Soumya Koduri from comment #3) > The fix is posted upstream for review - > http://review.gluster.org/13275 Sorry. Had given wrong link. The fix provided upstream is - http://review.gluster.org/12964
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
FWIW I've tested RHGS 3.1.2 with Kaleb's patches, and it corrects the the VIP failover problem for me. In my VM lab, I can pause one of the nodes in a 2-node ganesha-ha configuration, and the VIP will quickly failover to the other node. Prior to this patch, the VIP would not failover and in fact the VIP on the remaining 'up' node would quickly disappear, causing a complete failure of the HA system.
Verified this bug with 3.1.3 latest build and the original issue, where nfs-ganesha was not entering in grace period during failover/failback, can not be reproduced. However there are other grace related bugs which are observed during verification as below and can be tracked separately: >>>> Bug 1329887 - Unexpected behavior observed when nfs-ganesha enters grace period. (https://bugzilla.redhat.com/show_bug.cgi?id=1329887) Description: During failover/failback, nfs-ganesha enters grace period only for 60 seconds and IO's get stopped for somewhere around 70-75 seconds >>>> Bug 1330218 - Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node. (https://bugzilla.redhat.com/show_bug.cgi?id=1330218) Description: Shutting down I/O serving node, takes 15-20 mins for IO to resume from failed over node. Since the original reported issue is not reproducible any more and seems to be working fine with latest ganesha builds, hence marking this bug as Verified.
Providing PM approval for the accelerated fix.
Do we have a build with required fixes (both)? Kindly post the brew link on the bug so that we can pick it up for verification.
There are lot many regressions we are seeing related to failover/failback with 3.1.3 build and there are couple of open bugs as of now for 3.1.3. And to verify this bug for hotfix we need to take care of other existing/open bugs, which doesn't look like a good idea to do as of now. so probably if everyone agrees, can we drop this bug (related patches) from the hotfix build and provide a new build which contains the fixes of only 2 bugs as below: https://review.gerrithub.io/#/c/263358/ (BZ#1306691 crash fix) http://review.gluster.org/13459 (BZ#1301542)
doc_text looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240