Bug 1417042

Summary: glusterd restart is starting the offline shd daemon on other node in the cluster
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10CC: bsrirama, bugs, rhs-bugs, sasundar, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.10.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1383893 Environment:
Last Closed: 2017-03-06 17:44:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1383893    
Bug Blocks: 1381825    

Description Atin Mukherjee 2017-01-27 05:05:16 UTC
+++ This bug was initially created as a clone of Bug #1383893 +++

+++ This bug was initially created as a clone of Bug #1381825 +++

Description of problem:
=======================

glusterd restart on one of the cluster node is restarting the offline selh heal daemon on other cluster node.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-2


How reproducible:
=================
Always

Steps to Reproduce:
===================
1. Have 3 node cluster
2. Create 1*3 volume using both the node bricks and start it.
3. Kill shd daemon using kill -15 on of the cluster node
4. restart glusterd on other cluster node where step-3 is not done.
5. Now check for the volume status on any cluster node, you will see shd running on the node where it was killed in step-3

Actual results:
===============
glusterd restart is starting the offline shd daemon on other node in the cluster 

Expected results:
=================
glusterd restart should not start the offline shd daemon on other node in the cluster.




Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-10-05 02:54:14 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Atin Mukherjee on 2016-10-12 01:10:22 EDT ---

RCA:

This is not a regression and has been there since server side quorum is introduced. Unlike brick processes, daemon services are (re)started irrespective of what the quorum state is. In this particular case, when glusterd instance on N1 was brought down and shd service of N2 was explicitly killed, upon restarting glusterd service on N1, N2 gets a friend update request which calls glusterd_restart_bricks () and which eventually ends up spawning the shd daemon. If the same reproducer is applied for one of the brick processes, the brick doesn't come up as for bricks the logic is start the brick processes only if the quorum is regained, otherwise skip it. To fix this behaviour the other daemons should also follow the same logic like bricks.

--- Additional comment from Worker Ant on 2016-10-12 03:25:42 EDT ---

REVIEW: http://review.gluster.org/15626 (glusterd: daemon restart logic should adhere server side quorum) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2016-10-13 01:55:51 EDT ---

REVIEW: http://review.gluster.org/15626 (glusterd: daemon restart logic should adhere server side quorum) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-01-27 00:04:33 EST ---

COMMIT: https://review.gluster.org/15626 committed in master by Atin Mukherjee (amukherj) 
------
commit a9f660bc9d2d7c87b3306a35a2088532de000015
Author: Atin Mukherjee <amukherj>
Date:   Wed Oct 5 14:59:51 2016 +0530

    glusterd: daemon restart logic should adhere server side quorum
    
    Just like brick processes, other daemon services should also follow the same
    logic of quorum checks to see if a particular service needs to come up if
    glusterd is restarted or the incoming friend add/update request is received
    (in glusterd_restart_bricks () function)
    
    Change-Id: I54a1fbdaa1571cc45eed627181b81463fead47a3
    BUG: 1383893
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/15626
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>

Comment 1 Worker Ant 2017-01-27 05:06:01 UTC
REVIEW: https://review.gluster.org/16472 (glusterd: daemon restart logic should adhere server side quorum) posted (#1) for review on release-3.10 by Atin Mukherjee (amukherj)

Comment 2 Worker Ant 2017-01-30 14:13:56 UTC
COMMIT: https://review.gluster.org/16472 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit 59aba1e739726b1a5e7d771b73c2c88d45113c88
Author: Atin Mukherjee <amukherj>
Date:   Wed Oct 5 14:59:51 2016 +0530

    glusterd: daemon restart logic should adhere server side quorum
    
    Just like brick processes, other daemon services should also follow the same
    logic of quorum checks to see if a particular service needs to come up if
    glusterd is restarted or the incoming friend add/update request is received
    (in glusterd_restart_bricks () function)
    
    >Reviewed-on: https://review.gluster.org/15626
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Smoke: Gluster Build System <jenkins.org>
    >Reviewed-by: Prashanth Pai <ppai>
    
    Change-Id: I54a1fbdaa1571cc45eed627181b81463fead47a3
    BUG: 1417042
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/16472
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    Reviewed-by: Samikshan Bairagya <samikshan>
    Reviewed-by: Prashanth Pai <ppai>

Comment 3 Shyamsundar 2017-03-06 17:44:28 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/