1385561 – [Eventing]: BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat when a brick-is-killed/volume-stopped

Bug 1385561 - [Eventing]: BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat when a brick-is-killed/volume-stopped

Summary: [Eventing]: BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heart...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Atin Mukherjee
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:	1387544
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-10-17 10:16 UTC by Sweta Anandpara
Modified:	2017-03-23 06:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.8.4-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:10:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Sweta Anandpara 2016-10-17 10:16:51 UTC

Description of problem:
========================
In a 4 node cluster with eventing enabled, if a brick goes down or a volume is stopped, multiple BRICK_CONNECTED and BRICK_DISCONNECTED events are seen for bricks belonging to one of the concerned nodes. And these events keep getting generated for ever, until the brick is brought back up, or the volume is deleted.

Firstly, we should not be seeing BRICK_CONNECTED messages at all, if the brick is disconnected. Secondly, we get a continuous traffic of these events at every heartbeat, but I'm suspecting that is by design. Is there a better alternative to it? Thirdly, I do not understand why am I getting these messages only for the bricks belonging to one of the nodes, and not the other ones.


Version-Release number of selected component (if applicable):
===========================================================
3.8.4-2


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have a 4 node cluster, with eventing enabled. Create a disperse volume 
2. Stop one of the bricks with the command 'kill -15 <brick_pid>' or stop the volume using 'volume stop <volname>'
3. Monitor the events seen

Actual results:
==============
Multiple BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat


Expected results:
================
Only BRICK_DISCONNECTED events should be seen. Also, the interval at which it should get resent is to be discussed.


Additional info:
===============


{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 -

Comment 3 Sweta Anandpara 2016-10-18 08:49:39 UTC

I was able to see this in my setup multiple times until yesterday. Unfortunately, the volume is deleted and I am not able to reproduce this with the steps mentioned in the description on a _new_ volume. 

Reducing the severity of this BZ for now. Will go ahead with my testing and update if I hit this again. After a substantial amount of testing if I am not able to reproduce this, this BZ will meet its closure.

Comment 4 Atin Mukherjee 2016-10-21 17:52:17 UTC

Allright, so I believe you hit this race which is similar to BZ 1387544.

Comment 5 Atin Mukherjee 2016-10-21 17:54:43 UTC

Upstream patch detail is available at BZ 1387544, moving it to POST state

Comment 8 Atin Mukherjee 2016-11-08 05:33:08 UTC

upstream mainline : http://review.gluster.org/#/c/15699
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89352

upstream 3.9 patch : http://review.gluster.org/#/c/15722/ is also posted, however given the merge window is blocked as 3.9 release is round the corner, at worst same will be merged for 3.9.1.

Comment 10 Sweta Anandpara 2016-11-22 09:26:07 UTC

Have not seen this again in my events testing of past few weeks. The status remains as is mentioned in comment3, and I have not really seen any unnecessary events other than multiple CLIENT_CONNECTS and CLIENT_DISCONNECTS.

Moving this BZ to verified in 3.2

Comment 12 errata-xmlrpc 2017-03-23 06:10:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.