Description of problem: ======================== In a 4 node cluster with eventing enabled, if a brick goes down or a volume is stopped, multiple BRICK_CONNECTED and BRICK_DISCONNECTED events are seen for bricks belonging to one of the concerned nodes. And these events keep getting generated for ever, until the brick is brought back up, or the volume is deleted. Firstly, we should not be seeing BRICK_CONNECTED messages at all, if the brick is disconnected. Secondly, we get a continuous traffic of these events at every heartbeat, but I'm suspecting that is by design. Is there a better alternative to it? Thirdly, I do not understand why am I getting these messages only for the bricks belonging to one of the nodes, and not the other ones. Version-Release number of selected component (if applicable): =========================================================== 3.8.4-2 How reproducible: ================= Always Steps to Reproduce: =================== 1. Have a 4 node cluster, with eventing enabled. Create a disperse volume 2. Stop one of the bricks with the command 'kill -15 <brick_pid>' or stop the volume using 'volume stop <volname>' 3. Monitor the events seen Actual results: ============== Multiple BRICK_CONNECTED and BRICK_DISCONNECTED events seen at every heartbeat Expected results: ================ Only BRICK_DISCONNECTED events should be seen. Also, the interval at which it should get resent is to be discussed. Additional info: =============== {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697365, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:16] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick0/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697368, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:19] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_CONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 - {u'message': {u'peer': u'10.70.46.240', u'volume': u'disp', u'brick': u'/bricks/brick1/disp'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1476697371, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} 10.70.46.240 - - [12/Oct/2016 11:27:22] "POST /listen HTTP/1.1" 200 -
I was able to see this in my setup multiple times until yesterday. Unfortunately, the volume is deleted and I am not able to reproduce this with the steps mentioned in the description on a _new_ volume. Reducing the severity of this BZ for now. Will go ahead with my testing and update if I hit this again. After a substantial amount of testing if I am not able to reproduce this, this BZ will meet its closure.
Allright, so I believe you hit this race which is similar to BZ 1387544.
Upstream patch detail is available at BZ 1387544, moving it to POST state
upstream mainline : http://review.gluster.org/#/c/15699 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89352 upstream 3.9 patch : http://review.gluster.org/#/c/15722/ is also posted, however given the merge window is blocked as 3.9 release is round the corner, at worst same will be merged for 3.9.1.
Have not seen this again in my events testing of past few weeks. The status remains as is mentioned in comment3, and I have not really seen any unnecessary events other than multiple CLIENT_CONNECTS and CLIENT_DISCONNECTS. Moving this BZ to verified in 3.2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html