Bug 1385561

Summary: [Eventing]: BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat when a brick-is-killed/volume-stopped
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: glusterfsAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: Sweta Anandpara <sanandpa>
Severity: low Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, rhinduja, sanandpa, vbellur
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:10:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1387544    
Bug Blocks: 1351528    

Description Sweta Anandpara 2016-10-17 10:16:51 UTC
Description of problem:
========================
In a 4 node cluster with eventing enabled, if a brick goes down or a volume is stopped, multiple BRICK_CONNECTED and BRICK_DISCONNECTED events are seen for bricks belonging to one of the concerned nodes. And these events keep getting generated for ever, until the brick is brought back up, or the volume is deleted.

Firstly, we should not be seeing BRICK_CONNECTED messages at all, if the brick is disconnected. Secondly, we get a continuous traffic of these events at every heartbeat, but I'm suspecting that is by design. Is there a better alternative to it? Thirdly, I do not understand why am I getting these messages only for the bricks belonging to one of the nodes, and not the other ones.


Version-Release number of selected component (if applicable):
===========================================================
3.8.4-2


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have a 4 node cluster, with eventing enabled. Create a disperse volume 
2. Stop one of the bricks with the command 'kill -15 <brick_pid>' or stop the volume using 'volume stop <volname>'
3. Monitor the events seen

Actual results:
==============
Multiple BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat


Expected results:
================
Only BRICK_DISCONNECTED events should be seen. Also, the interval at which it should get resent is to be discussed.


Additional info:
===============


{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 -
{u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 -

Comment 3 Sweta Anandpara 2016-10-18 08:49:39 UTC
I was able to see this in my setup multiple times until yesterday. Unfortunately, the volume is deleted and I am not able to reproduce this with the steps mentioned in the description on a _new_ volume. 

Reducing the severity of this BZ for now. Will go ahead with my testing and update if I hit this again. After a substantial amount of testing if I am not able to reproduce this, this BZ will meet its closure.

Comment 4 Atin Mukherjee 2016-10-21 17:52:17 UTC
Allright, so I believe you hit this race which is similar to BZ 1387544.

Comment 5 Atin Mukherjee 2016-10-21 17:54:43 UTC
Upstream patch detail is available at BZ 1387544, moving it to POST state

Comment 8 Atin Mukherjee 2016-11-08 05:33:08 UTC
upstream mainline : http://review.gluster.org/#/c/15699
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89352

upstream 3.9 patch : http://review.gluster.org/#/c/15722/ is also posted, however given the merge window is blocked as 3.9 release is round the corner, at worst same will be merged for 3.9.1.

Comment 10 Sweta Anandpara 2016-11-22 09:26:07 UTC
Have not seen this again in my events testing of past few weeks. The status remains as is mentioned in comment3, and I have not really seen any unnecessary events other than multiple CLIENT_CONNECTS and CLIENT_DISCONNECTS.

Moving this BZ to verified in 3.2

Comment 12 errata-xmlrpc 2017-03-23 06:10:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html