Description of problem: ======================== We choose to not send BRICK_DISONNECTED events if the volume is not started. A plain VOLUME_STOP event should suffice to pass on the message, when we execute a 'gluster volume stop <volname>'. In my 4 node cluster setup which had a tier volume // 1*(4+2) as cold and 2*2 as hot// , volume stop generated BRICK_DISCONNECTED events for the hot tier bricks. Version-Release number of selected component (if applicable): =========================================================== 3.8.4-2 How reproducible: ================= Hit it twice Steps to Reproduce: =================== 1. Have a 4 node cluster with eventing enabled. 2. Create a tier volume with 1*(4+2) as cold tier and 2*2 as hot tier. 3. Stop the volume using the command 'gluster volume stop disp' Actual results: ============== BRICK_DISCONNECTED events seen along with VOLUME_STOP Expected results: ================ BRICK_DISCONNECTED events should not be seen Additional info: =============== [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp46-239.lab.eng.blr.redhat.com Uuid: ed362eb3-421c-4a25-ad0e-82ef157ea328 State: Peer in Cluster (Connected) Hostname: 10.70.46.240 Uuid: 72c4f894-61f7-433e-a546-4ad2d7f0a176 State: Peer in Cluster (Connected) Hostname: 10.70.46.242 Uuid: 1e8967ae-51b2-4c27-907e-a22a83107fd0 State: Peer in Cluster (Connected) [root@dhcp46-218 ~]# rpm -qa | grep gluster glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64 glusterfs-fuse-3.8.4-2.el7rhgs.x86_64 glusterfs-cli-3.8.4-2.el7rhgs.x86_64 glusterfs-events-3.8.4-2.el7rhgs.x86_64 glusterfs-devel-3.8.4-2.el7rhgs.x86_64 glusterfs-api-devel-3.8.4-2.el7rhgs.x86_64 glusterfs-3.8.4-2.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-2.el7rhgs.x86_64 python-gluster-3.8.4-2.el7rhgs.noarch glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 glusterfs-server-3.8.4-2.el7rhgs.x86_64 nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64 glusterfs-libs-3.8.4-2.el7rhgs.x86_64 glusterfs-api-3.8.4-2.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64 glusterfs-rdma-3.8.4-2.el7rhgs.x86_64 [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# [root@dhcp46-218 ~]# gluster v info Volume Name: disp Type: Tier Volume ID: a9999464-b094-4213-a422-c11fed555674 Status: Started Snapshot Count: 0 Number of Bricks: 10 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: 10.70.46.218:/bricks/brick0/disp_hottier4 Brick2: 10.70.46.218:/bricks/brick0/disp_hottier3 Brick3: 10.70.46.218:/bricks/brick0/disp_hottier2 Brick4: 10.70.46.218:/bricks/brick0/disp_hottier1 Cold Tier: Cold Tier Type : Disperse Number of Bricks: 1 x (4 + 2) = 6 Brick5: 10.70.46.239:/bricks/brick0/disp1 Brick6: 10.70.46.240:/bricks/brick0/disp2 Brick7: 10.70.46.242:/bricks/brick0/disp3 Brick8: 10.70.46.242:/bricks/brick1/disp4 Brick9: 10.70.46.239:/bricks/brick1/disp5 Brick10: 10.70.46.240:/bricks/brick1/disp6 Options Reconfigured: cluster.tier-mode: cache features.ctr-enabled: on features.barrier: disable performance.readdir-ahead: on transport.address-family: inet cluster.watermark-low: 10 cluster.watermark-hi: 20 cluster.enable-shared-storage: disable [root@dhcp46-218 ~]# gluster v stop disp Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: disp: success [root@dhcp46-218 ~]# gluster v info Volume Name: disp Type: Tier Volume ID: a9999464-b094-4213-a422-c11fed555674 Status: Stopped Snapshot Count: 0 Number of Bricks: 10 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: 10.70.46.218:/bricks/brick0/disp_hottier4 Brick2: 10.70.46.218:/bricks/brick0/disp_hottier3 Brick3: 10.70.46.218:/bricks/brick0/disp_hottier2 Brick4: 10.70.46.218:/bricks/brick0/disp_hottier1 Cold Tier: Cold Tier Type : Disperse Number of Bricks: 1 x (4 + 2) = 6 Brick5: 10.70.46.239:/bricks/brick0/disp1 Brick6: 10.70.46.240:/bricks/brick0/disp2 Brick7: 10.70.46.242:/bricks/brick0/disp3 Brick8: 10.70.46.242:/bricks/brick1/disp4 Brick9: 10.70.46.239:/bricks/brick1/disp5 Brick10: 10.70.46.240:/bricks/brick1/disp6 Options Reconfigured: cluster.tier-mode: cache features.ctr-enabled: on features.barrier: disable performance.readdir-ahead: on transport.address-family: inet cluster.watermark-low: 10 cluster.watermark-hi: 20 cluster.enable-shared-storage: disable [root@dhcp46-218 ~]# ######################## EVENTS ################################ bash-4.3$ grep -v "200" volume_stop | grep -v "CLIENT_CONNECT" | grep -v "CLIENT_DISCONNECT" | grep -v "#####" {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'peer': u'10.70.46.218', u'volume': u'disp', u'brick': u'/bricks/brick0/disp_hottier3'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'subvol': u'disp-replicate-0'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'peer': u'10.70.46.218', u'volume': u'disp', u'brick': u'/bricks/brick0/disp_hottier4'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'peer': u'10.70.46.218', u'volume': u'disp', u'brick': u'/bricks/brick0/disp_hottier2'}, u'event': u'BRICK_DISCONNECTED', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'volume': u'isp'}, u'event': u'VOLUME_REBALANCE_FAILED', u'ts': 1477033323, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'subvol': u'disp-replicate-1'}, u'event': u'AFR_SUBVOLS_DOWN', u'ts': 1477033323, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'volume': u'isp'}, u'event': u'VOLUME_REBALANCE_FAILED', u'ts': 1477033324, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'volume': u'isp'}, u'event': u'VOLUME_REBALANCE_FAILED', u'ts': 1477033324, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'volume': u'isp'}, u'event': u'VOLUME_REBALANCE_FAILED', u'ts': 1477033324, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-disperse-0'}, u'event': u'EC_MIN_BRICKS_NOT_UP', u'ts': 1477033325, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'subvol': u'disp-disperse-0'}, u'event': u'EC_MIN_BRICKS_NOT_UP', u'ts': 1477033325, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'subvol': u'disp-disperse-0'}, u'event': u'EC_MIN_BRICKS_NOT_UP', u'ts': 1477033325, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'force': u'0', u'name': u'disp'}, u'event': u'VOLUME_STOP', u'ts': 1477033327, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} bash-4.3$
So here goes the RCA: On a volume stop trigger glusterd issues a brick-op to terminate the brick process during brick-op phase , however in the commit-op glusterd once again tries to kill the same process if it exists and then mark the brickinfo->status flag to GF_BRICK_STOPPED. In the former case, if brick is successfully killed there is a possibility that GlusterD will receive RPC_CLNT_DISCONNECT from the said brick process before even the commit op phase is executed and hence by that time brickinfo->status will still be set to GF_BRICK_STARTED. BRICK_DISCONNECT event should be only sent if a brick has been killed and not through a volume stop/remove brick trigger, however due to this trace, this event is also sent out on a volume stop. This has nothing to do with a volume type and its just that eventing has uncovered this race and hence this bug is now moved to GlusterD component.
upstream mainline patch http://review.gluster.org/#/c/15699 posted for review.
upstream mainline : http://review.gluster.org/#/c/15699 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/89352 upstream 3.9 patch : http://review.gluster.org/#/c/15722/ is also posted, however given the merge window is blocked as 3.9 release is round the corner, at worst same will be merged for 3.9.1.
Verified this bug using the build - glusterfs-3.8.4-5. BRICK_DISCONNECT event is not generating when the volume is stopped ( any type of volume ), it's generating Only when the brick is killed. Moving to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html