Bug 1385602

Summary: [Eventing]: GEOREP_FAULTY events seen for every brick of the master volume, when a georep session is stopped
Product: Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, avishwan, csaba, rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 06:35:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sweta Anandpara 2016-10-17 11:51:21 UTC
Description of problem:
======================
Have a 4 node cluster master and slave, with the volumes 'master' (2*2) and 'slave' (2*2) created on them. When a geo-rep session is stopped, the expected GEOREP_STOP event is preceded with 4 GEOREP_FAULTY events, implying something wrong with the session, which is not true. 

GEOREP_FAULTY events should be seen only when the georep session turns Faulty for some reason, and not any other time.

Version-Release number of selected component (if applicable):
=============================================================
3.8.4-2


How reproducible:
================
Always


Steps to Reproduce:
==================
1. Have a 4 node master and slave clusters created, with 2*2 volumes created in both.
2. Establish georep session between the two and monitor the events seen on the master cluster side. GEOREP_CREATE and GEOREP_START are seen as expected.
3. Stop the georep session and monitor the events again

Actual results:
===============
Step 3 results in 4 GEOREP_FAULTY events (one for every brick of volume), and then a GEOREP_STOP event


Expected results:
=================
Step 3 should result in GEOREP_STOP event ONLY


Additional info:
================
[root@dhcp46-239 ~]# gluster system:: execute gsec_create
Common secret pub file present at /var/lib/glusterd/geo-replication/common_secret.pem.pub
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave create push-pem
Creating geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]# 

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'push_pem': u'1'}, u'event': u'GEOREP_CREATE', u'ts': 1476703797, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

-------------------------------------------------------------------------------------------------------------------------
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave config use_meta_volume true
geo-replication config updated successfully
[root@dhcp46-239 ~]#

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'option': u'use_meta_volume', u'value': u'true'}, u'event': u'GEOREP_CONFIG_SET', u'ts': 1476703859, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

--------------------------------------------------------------------------------------------------------------------------

[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave start
Starting geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK              SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.46.239    master        /bricks/brick0/master1    root          10.70.35.115::slave    10.70.35.115    Active     History Crawl    2016-10-17 15:53:12          
10.70.46.218    master        /bricks/brick0/master4    root          10.70.35.115::slave    10.70.35.100    Active     History Crawl    2016-10-17 15:53:05          
10.70.46.240    master        /bricks/brick0/master2    root          10.70.35.115::slave    10.70.35.104    Passive    N/A              N/A                          
10.70.46.242    master        /bricks/brick0/master3    root          10.70.35.115::slave    10.70.35.101    Passive    N/A              N/A                          
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# 

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_START', u'ts': 1476703973, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

-------------------------------------------------------------------------------------------------------------------------------

[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave stop
Stopping geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]#

{u'message': {u'current_slave_host': u'10.70.35.115', u'master_node': u'10.70.46.239', u'brick_path': u'/bricks/brick0/master1', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704008, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}
{u'message': {u'current_slave_host': u'10.70.35.100', u'master_node': u'10.70.46.218', u'brick_path': u'/bricks/brick0/master4', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'}
{u'message': {u'current_slave_host': u'10.70.35.101', u'master_node': u'10.70.46.242', u'brick_path': u'/bricks/brick0/master3', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'}
{u'message': {u'current_slave_host': u'10.70.35.104', u'master_node': u'10.70.46.240', u'brick_path': u'/bricks/brick0/master2', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_STOP', u'ts': 1476704011, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

Comment 2 Aravinda VK 2016-10-25 08:14:41 UTC
We have limitation in Geo-rep processes management. When Geo-rep is stopped, SIGTERM/SIGKILL is sent to running workers, Workers while terminating, updates its status as Faulty and dies. No differenciation available at the moment to differenciate worker crash vs worker killed by glusterd during Geo-rep stop.

To prevent this event, we need major change in Geo-rep process management infra. This bug can be moved to be fixed post 3.2.0