Description of problem: ====================== In a 4 node cluster, where we have a tiered volume with 1*(4+2) as cold and 2*2 as hot, when we execute a 'gluster volume tier <volname> detach start', we see a VOLUME_REBALANCE_COMPLETE event along with TIER_DETACH_START. The file migration from hot to cold is still in progress, and in spite of that we see a VOLUME_REBALANCE_COMPLETE, leading the consumer-of-events to believe that all files have been moved from hot to cold. A couple of concerns here: Firstly, should we really be giving out a VOLUME_REBALANCE event when we do a tier_detach? Secondly, shouldn't a REBALANCE_COMPLETE event be seen _after_ the file migration from hot to cold is complete, and not any other time? Proposal: Maybe we could have a TIER_DETACH_COMPLETE event once file migration is complete, rather than VOLUME_REBALANCE_COMPLETE That seems more correct, name per se.. Version-Release number of selected component (if applicable): ========================================================== 3.8.4-5 How reproducible: ================= Always Steps to Reproduce: =================== 1. Have a cluster with eventing enabled and webhook as a listener 2. Have a tiered volume, with about 2000 files present. 3. Execute a 'gluster volume tier <volname> detach start' and monitor the events seen Actual results: ============== Step3 triggers a VOLUME_REBALANCE_COMPLETE along with TIER_DETACH_START Expected results: ================ Only TIER_DETACH_START event should be seen and a REBALANCE_COMPLETE should be seen only after all files have been moved from hot to cold. Additional info: =================== [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# rpm -qa | grep gluster nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64 glusterfs-api-3.8.4-5.el7rhgs.x86_64 python-gluster-3.8.4-5.el7rhgs.noarch glusterfs-client-xlators-3.8.4-5.el7rhgs.x86_64 glusterfs-server-3.8.4-5.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-devel-3.8.4-5.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-libs-3.8.4-5.el7rhgs.x86_64 glusterfs-fuse-3.8.4-5.el7rhgs.x86_64 glusterfs-api-devel-3.8.4-5.el7rhgs.x86_64 glusterfs-rdma-3.8.4-5.el7rhgs.x86_64 glusterfs-3.8.4-5.el7rhgs.x86_64 glusterfs-cli-3.8.4-5.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-5.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-4.el7rhgs.x86_64 glusterfs-events-3.8.4-5.el7rhgs.x86_64 [root@dhcp46-239 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.46.240 Uuid: 72c4f894-61f7-433e-a546-4ad2d7f0a176 State: Peer in Cluster (Connected) Hostname: 10.70.46.242 Uuid: 1e8967ae-51b2-4c27-907e-a22a83107fd0 State: Peer in Cluster (Connected) Hostname: 10.70.46.218 Uuid: 0dea52e0-8c32-4616-8ef8-16db16120eaa State: Peer in Cluster (Connected) [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# [root@dhcp46-239 yum.repos.d]# [root@dhcp46-239 yum.repos.d]# gluster v info Volume Name: ozone Type: Tier Volume ID: 376cdde0-194f-460a-b273-3904a704a7dd Status: Started Snapshot Count: 0 Number of Bricks: 10 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: 10.70.46.218:/bricks/brick2/ozone_tier3 Brick2: 10.70.46.218:/bricks/brick2/ozone_tier2 Brick3: 10.70.46.218:/bricks/brick2/ozone_tier1 Brick4: 10.70.46.218:/bricks/brick2/ozone_tier0 Cold Tier: Cold Tier Type : Disperse Number of Bricks: 1 x (4 + 2) = 6 Brick5: 10.70.46.239:/bricks/brick0/ozone0 Brick6: 10.70.46.240:/bricks/brick0/ozone2 Brick7: 10.70.46.242:/bricks/brick0/ozone2 Brick8: 10.70.46.239:/bricks/brick1/ozone3 Brick9: 10.70.46.240:/bricks/brick1/ozone4 Brick10: 10.70.46.242:/bricks/brick1/ozone5 Options Reconfigured: features.scrub-freq: minute features.scrub: Active features.bitrot: on cluster.tier-mode: cache features.ctr-enabled: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.enable-shared-storage: disable [root@dhcp46-239 yum.repos.d]# [root@dhcp46-239 ~]# gluster v tier ozone detach start volume detach-tier start: success ID: 41e86ff1-c890-45d9-a8c3-2672b4694eeb [root@dhcp46-239 ~]# gluster v tier ozone detach status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 10.70.46.218 0 0Bytes 0 0 0 in progress 0:0:0 [root@dhcp46-239 ~]# gluster v tier ozone detach status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 10.70.46.218 0 0Bytes 0 0 0 in progress 0:0:0 [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# gluster v tier ozone detach status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- 10.70.46.218 2131 102.1MB 2131 0 0 completed 0:16:14 [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# EVENTS --------- bash-4.3$ grep -v "200" tier_detach_start | grep -v "####" | grep -v "CLIENT_" | grep -v "EC_" | grep -v "SVC" | grep -v "AFR" {u'message': {u'volume': u'ozone'}, u'event': u'VOLUME_REBALANCE_COMPLETE', u'ts': 1479207699, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'vol': u'ozone'}, u'event': u'TIER_DETACH_START', u'ts': 1479207709, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}
Engineering discussion is still on (in mail chain) to reach an agreement between dev and QE. Also, I suppose RCA is in progress. The bug is fairly easy to reproduce. Clearing the need-info for now, as this BZ is not waiting on me.
upstream patch http://review.gluster.org/#/c/15919/ posted for review.