1381825 – glusterd restart is starting the offline shd daemon on other node in the cluster

Bug 1381825 - glusterd restart is starting the offline shd daemon on other node in the cluster

Summary: glusterd restart is starting the offline shd daemon on other node in the cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Atin Mukherjee
QA Contact:	Vinayak Papnoi
Docs Contact:
URL:
Whiteboard:
Depends On:	1383893 1417042
Blocks:	1417147
TreeView+	depends on / blocked

Reported:	2016-10-05 06:54 UTC by Byreddy
Modified:	2017-09-21 04:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1383893 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:28:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Byreddy 2016-10-05 06:54:08 UTC

Description of problem:
=======================

glusterd restart on one of the cluster node is restarting the offline selh heal daemon on other cluster node.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-2


How reproducible:
=================
Always

Steps to Reproduce:
===================
1. Have 3 node cluster
2. Create 1*3 volume using both the node bricks and start it.
3. Kill shd daemon using kill -15 on of the cluster node
4. restart glusterd on other cluster node where step-3 is not done.
5. Now check for the volume status on any cluster node, you will see shd running on the node where it was killed in step-3

Actual results:
===============
glusterd restart is starting the offline shd daemon on other node in the cluster 

Expected results:
=================
glusterd restart should not start the offline shd daemon on other node in the cluster.




Additional info:

Comment 2 Atin Mukherjee 2016-10-12 05:10:22 UTC

RCA:

This is not a regression and has been there since server side quorum is introduced. Unlike brick processes, daemon services are (re)started irrespective of what the quorum state is. In this particular case, when glusterd instance on N1 was brought down and shd service of N2 was explicitly killed, upon restarting glusterd service on N1, N2 gets a friend update request which calls glusterd_restart_bricks () and which eventually ends up spawning the shd daemon. If the same reproducer is applied for one of the brick processes, the brick doesn't come up as for bricks the logic is start the brick processes only if the quorum is regained, otherwise skip it. To fix this behaviour the other daemons should also follow the same logic like bricks.

Comment 3 Atin Mukherjee 2016-10-12 07:27:23 UTC

Upstream mainline patch http://review.gluster.org/15626 posted for review.

Comment 4 Atin Mukherjee 2016-10-12 07:29:48 UTC

Byreddy - I'd like to see if there are any other implications on the changes done in http://review.gluster.org/15626 through upstream review process. IMHO, given its not a regression nor a severe bug, this bug can be fixed post 3.2.0 too. Please let us know your thoughts here.

Comment 5 Byreddy 2016-10-12 09:26:11 UTC

(In reply to Atin Mukherjee from comment #4)
> Byreddy - I'd like to see if there are any other implications on the changes
> done in http://review.gluster.org/15626 through upstream review process.
> IMHO, given its not a regression nor a severe bug, this bug can be fixed
> post 3.2.0 too. Please let us know your thoughts here.

I am OK to take this IN/OUT of 3.2.0, there is no any functionality loss with out this fix.

Comment 8 Atin Mukherjee 2017-03-24 09:20:42 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101298/

Comment 10 Vinayak Papnoi 2017-06-07 08:56:29 UTC

Build - 3.8.4-26

Followed the steps to reproduce provided in the description. Killed shd on one node and restarted Glusterd on another node. This didn't result in the offline shd daemon to start on previously killed node in the cluster.

Hence, moving to Verifed.

Comment 12 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 13 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.