Hide Forgot
Description of problem: ====================== Have a 4 node cluster master and slave, with the volumes 'master' (2*2) and 'slave' (2*2) created on them. When a geo-rep session is stopped, the expected GEOREP_STOP event is preceded with 4 GEOREP_FAULTY events, implying something wrong with the session, which is not true. GEOREP_FAULTY events should be seen only when the georep session turns Faulty for some reason, and not any other time. Version-Release number of selected component (if applicable): ============================================================= 3.8.4-2 How reproducible: ================ Always Steps to Reproduce: ================== 1. Have a 4 node master and slave clusters created, with 2*2 volumes created in both. 2. Establish georep session between the two and monitor the events seen on the master cluster side. GEOREP_CREATE and GEOREP_START are seen as expected. 3. Stop the georep session and monitor the events again Actual results: =============== Step 3 results in 4 GEOREP_FAULTY events (one for every brick of volume), and then a GEOREP_STOP event Expected results: ================= Step 3 should result in GEOREP_STOP event ONLY Additional info: ================ [root@dhcp46-239 ~]# gluster system:: execute gsec_create Common secret pub file present at /var/lib/glusterd/geo-replication/common_secret.pem.pub [root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave create push-pem Creating geo-replication session between master & 10.70.35.115::slave has been successful [root@dhcp46-239 ~]# {u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'push_pem': u'1'}, u'event': u'GEOREP_CREATE', u'ts': 1476703797, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} ------------------------------------------------------------------------------------------------------------------------- [root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave config use_meta_volume true geo-replication config updated successfully [root@dhcp46-239 ~]# {u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'option': u'use_meta_volume', u'value': u'true'}, u'event': u'GEOREP_CONFIG_SET', u'ts': 1476703859, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} -------------------------------------------------------------------------------------------------------------------------- [root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave start Starting geo-replication session between master & 10.70.35.115::slave has been successful [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.46.239 master /bricks/brick0/master1 root 10.70.35.115::slave 10.70.35.115 Active History Crawl 2016-10-17 15:53:12 10.70.46.218 master /bricks/brick0/master4 root 10.70.35.115::slave 10.70.35.100 Active History Crawl 2016-10-17 15:53:05 10.70.46.240 master /bricks/brick0/master2 root 10.70.35.115::slave 10.70.35.104 Passive N/A N/A 10.70.46.242 master /bricks/brick0/master3 root 10.70.35.115::slave 10.70.35.101 Passive N/A N/A [root@dhcp46-239 ~]# [root@dhcp46-239 ~]# {u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_START', u'ts': 1476703973, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} ------------------------------------------------------------------------------------------------------------------------------- [root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave stop Stopping geo-replication session between master & 10.70.35.115::slave has been successful [root@dhcp46-239 ~]# {u'message': {u'current_slave_host': u'10.70.35.115', u'master_node': u'10.70.46.239', u'brick_path': u'/bricks/brick0/master1', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704008, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'} {u'message': {u'current_slave_host': u'10.70.35.100', u'master_node': u'10.70.46.218', u'brick_path': u'/bricks/brick0/master4', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'} {u'message': {u'current_slave_host': u'10.70.35.101', u'master_node': u'10.70.46.242', u'brick_path': u'/bricks/brick0/master3', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'} {u'message': {u'current_slave_host': u'10.70.35.104', u'master_node': u'10.70.46.240', u'brick_path': u'/bricks/brick0/master2', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'} {u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_STOP', u'ts': 1476704011, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}
We have limitation in Geo-rep processes management. When Geo-rep is stopped, SIGTERM/SIGKILL is sent to running workers, Workers while terminating, updates its status as Faulty and dies. No differenciation available at the moment to differenciate worker crash vs worker killed by glusterd during Geo-rep stop. To prevent this event, we need major change in Geo-rep process management infra. This bug can be moved to be fixed post 3.2.0