Bug 1385602 - [Eventing]: GEOREP_FAULTY events seen for every brick of the master volume, when a georep session is stopped
Summary: [Eventing]: GEOREP_FAULTY events seen for every brick of the master volume, w...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-17 11:51 UTC by Sweta Anandpara
Modified: 2018-11-19 06:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-19 06:35:31 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Sweta Anandpara 2016-10-17 11:51:21 UTC
Description of problem:
======================
Have a 4 node cluster master and slave, with the volumes 'master' (2*2) and 'slave' (2*2) created on them. When a geo-rep session is stopped, the expected GEOREP_STOP event is preceded with 4 GEOREP_FAULTY events, implying something wrong with the session, which is not true. 

GEOREP_FAULTY events should be seen only when the georep session turns Faulty for some reason, and not any other time.

Version-Release number of selected component (if applicable):
=============================================================
3.8.4-2


How reproducible:
================
Always


Steps to Reproduce:
==================
1. Have a 4 node master and slave clusters created, with 2*2 volumes created in both.
2. Establish georep session between the two and monitor the events seen on the master cluster side. GEOREP_CREATE and GEOREP_START are seen as expected.
3. Stop the georep session and monitor the events again

Actual results:
===============
Step 3 results in 4 GEOREP_FAULTY events (one for every brick of volume), and then a GEOREP_STOP event


Expected results:
=================
Step 3 should result in GEOREP_STOP event ONLY


Additional info:
================
[root@dhcp46-239 ~]# gluster system:: execute gsec_create
Common secret pub file present at /var/lib/glusterd/geo-replication/common_secret.pem.pub
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave create push-pem
Creating geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]# 

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'push_pem': u'1'}, u'event': u'GEOREP_CREATE', u'ts': 1476703797, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

-------------------------------------------------------------------------------------------------------------------------
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave config use_meta_volume true
geo-replication config updated successfully
[root@dhcp46-239 ~]#

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master', u'option': u'use_meta_volume', u'value': u'true'}, u'event': u'GEOREP_CONFIG_SET', u'ts': 1476703859, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

--------------------------------------------------------------------------------------------------------------------------

[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave start
Starting geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK              SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.46.239    master        /bricks/brick0/master1    root          10.70.35.115::slave    10.70.35.115    Active     History Crawl    2016-10-17 15:53:12          
10.70.46.218    master        /bricks/brick0/master4    root          10.70.35.115::slave    10.70.35.100    Active     History Crawl    2016-10-17 15:53:05          
10.70.46.240    master        /bricks/brick0/master2    root          10.70.35.115::slave    10.70.35.104    Passive    N/A              N/A                          
10.70.46.242    master        /bricks/brick0/master3    root          10.70.35.115::slave    10.70.35.101    Passive    N/A              N/A                          
[root@dhcp46-239 ~]# 
[root@dhcp46-239 ~]# 

{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_START', u'ts': 1476703973, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

-------------------------------------------------------------------------------------------------------------------------------

[root@dhcp46-239 ~]# gluster volume geo-replication master 10.70.35.115::slave stop
Stopping geo-replication session between master & 10.70.35.115::slave has been successful
[root@dhcp46-239 ~]#

{u'message': {u'current_slave_host': u'10.70.35.115', u'master_node': u'10.70.46.239', u'brick_path': u'/bricks/brick0/master1', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704008, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}
{u'message': {u'current_slave_host': u'10.70.35.100', u'master_node': u'10.70.46.218', u'brick_path': u'/bricks/brick0/master4', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'0dea52e0-8c32-4616-8ef8-16db16120eaa'}
{u'message': {u'current_slave_host': u'10.70.35.101', u'master_node': u'10.70.46.242', u'brick_path': u'/bricks/brick0/master3', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'1e8967ae-51b2-4c27-907e-a22a83107fd0'}
{u'message': {u'current_slave_host': u'10.70.35.104', u'master_node': u'10.70.46.240', u'brick_path': u'/bricks/brick0/master2', u'slave_host': u'10.70.35.115', u'master_volume': u'master', u'slave_volume': u'slave'}, u'event': u'GEOREP_FAULTY', u'ts': 1476704010, u'nodeid': u'72c4f894-61f7-433e-a546-4ad2d7f0a176'}
{u'message': {u'slave': u'10.70.35.115::slave', u'master': u'master'}, u'event': u'GEOREP_STOP', u'ts': 1476704011, u'nodeid': u'ed362eb3-421c-4a25-ad0e-82ef157ea328'}

Comment 2 Aravinda VK 2016-10-25 08:14:41 UTC
We have limitation in Geo-rep processes management. When Geo-rep is stopped, SIGTERM/SIGKILL is sent to running workers, Workers while terminating, updates its status as Faulty and dies. No differenciation available at the moment to differenciate worker crash vs worker killed by glusterd during Geo-rep stop.

To prevent this event, we need major change in Geo-rep process management infra. This bug can be moved to be fixed post 3.2.0


Note You need to log in before you can comment on or make changes to this bug.