1369384 – [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Bug 1369384 - [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Summary: [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Aravinda VK
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1373741 1374630 1374631 1374632
TreeView+	depends on / blocked

Reported:	2016-08-23 09:20 UTC by Rahul Hinduja
Modified:	2020-05-14 15:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1373741 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:45:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Rahul Hinduja 2016-08-23 09:20:22 UTC

Description of problem:
=======================

After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to kernel update. Upon rebooting bricks from one of the node didn't list in geo-replication status command. However, the peer was in connected state and bricks were all online. Upon checking the geo-replication directory, node was missing monitor.pid. After doing touch, it resolves the issue. 

1. Not sure what caused the monitor.pid to be removed. From user perspective, only operation was reboot of whole cluster at once. 
2. Even if it is removed, we should handle the ENOENT use case. 

Initial:
========

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:11:07          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.200    master        /rhs/brick1/b3    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.200    master        /rhs/brick2/b6    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:10:59          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Passive    N/A                N/A                          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
[root@dhcp37-81 ~]#

After REHL Platform is updated and rebooted:
============================================

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:10:59          
[root@dhcp37-81 ~]# 

Peer and bricks of node 200 are all online:
=============================================

[root@dhcp37-81 ~]# gluster peer status
Number of Peers: 2

Hostname: 10.70.37.100
Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.200
Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc
State: Peer in Cluster (Connected)
[root@dhcp37-81 ~]# 

[root@dhcp37-81 ~]# gluster volume status master
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.81:/rhs/brick1/b1            49152     0          Y       1639 
Brick 10.70.37.100:/rhs/brick1/b2           49152     0          Y       1672 
Brick 10.70.37.200:/rhs/brick1/b3           49152     0          Y       1683 
Brick 10.70.37.81:/rhs/brick2/b4            49153     0          Y       1662 
Brick 10.70.37.100:/rhs/brick2/b5           49153     0          Y       1673 
Brick 10.70.37.200:/rhs/brick2/b6           49153     0          Y       1678 
Snapshot Daemon on localhost                49155     0          Y       1776 
NFS Server on localhost                     2049      0          Y       1701 
Self-heal Daemon on localhost               N/A       N/A        Y       1709 
Quota Daemon on localhost                   N/A       N/A        Y       1718 
Snapshot Daemon on 10.70.37.100             49155     0          Y       1798 
NFS Server on 10.70.37.100                  2049      0          Y       1729 
Self-heal Daemon on 10.70.37.100            N/A       N/A        Y       1737 
Quota Daemon on 10.70.37.100                N/A       N/A        Y       1745 
Snapshot Daemon on 10.70.37.200             49155     0          Y       1817 
NFS Server on 10.70.37.200                  2049      0          Y       1644 
Self-heal Daemon on 10.70.37.200            N/A       N/A        Y       1649 
Quota Daemon on 10.70.37.200                N/A       N/A        Y       1664 
 
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp37-81 ~]# 

Problematic node 200:
=====================

[root@dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf --status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/
[2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i
    brick_status.print_status(checkpoint_time=checkpoint_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status
    for key, value in self.get_status(checkpoint_time).items():
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status
    with open(self.monitor_pid_file, "r+") as f:
IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid'
failed with IOError.
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/
gsyncd_template.conf      master_10.70.37.80_slave/ secret.pem                secret.pem.pub            tar_ssh.pem               tar_ssh.pem.pub           
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.status
[root@dhcp37-200 ~]# touch /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.pid  monitor.status
[root@dhcp37-200 ~]# 


After touch, status shows information but stopped:
==================================================

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:10:59          
10.70.37.200    master        /rhs/brick1/b3    root          10.70.37.80::slave    N/A             Stopped    N/A                N/A                          
10.70.37.200    master        /rhs/brick2/b6    root          10.70.37.80::slave    N/A             Stopped    N/A                N/A                          
[root@dhcp37-81 ~]#


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64
glusterfs-3.7.9-10.el7rhgs.x86_64


How reproducible:
=================
Reboot cases were tried in 3.1.3 and didn't see this issue. However, this case was different in a sense that all the nodes in cluster were brought offline at the same time while geo-replication was in started state. kind of negative testing.

Comment 3 Aravinda VK 2016-09-07 06:18:33 UTC

Upstream patch sent to fix gsyncdstatus.py traceback.
http://review.gluster.org/15416

Comment 5 Atin Mukherjee 2016-09-19 08:57:58 UTC

Upstream mainline : http://review.gluster.org/15416
Upstream 3.8 : http://review.gluster.org/15448
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/85005

Comment 9 Rahul Hinduja 2017-02-20 12:44:55 UTC

Verified with the build: glusterfs-geo-replication-3.8.4-14.el6rhs.x86_64

Since it is not reproducible, simulated the scenario by moving the monitor.pid. It is correctly showing the status as stopped instead of initial not showing at all. 

[root@rhel6-1 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid 
/var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid
[root@rhel6-1 ~]# 

[root@rhel6-1 ~]# mv /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid /var/lib/glusterd/geo-replication/
[root@rhel6-1 ~]# gluster volum geo-replication master 10.70.37.56::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK                     SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.94     master        /bricks/brick0/master_brick0     root          10.70.37.56::slave    N/A             Stopped    N/A                N/A                          
10.70.37.94     master        /bricks/brick1/master_brick4     root          10.70.37.56::slave    N/A             Stopped    N/A                N/A                          
10.70.37.94     master        /bricks/brick2/master_brick8     root          10.70.37.56::slave    N/A             Stopped    N/A                N/A                          
10.70.37.157    master        /bricks/brick0/master_brick1     root          10.70.37.56::slave    10.70.37.205    Active     Changelog Crawl    2017-02-19 14:59:11          
10.70.37.157    master        /bricks/brick1/master_brick5     root          10.70.37.56::slave    10.70.37.205    Active     Changelog Crawl    2017-02-19 14:59:22          
10.70.37.157    master        /bricks/brick2/master_brick9     root          10.70.37.56::slave    10.70.37.205    Active     Changelog Crawl    2017-02-19 14:59:22          
10.70.37.41     master        /bricks/brick0/master_brick3     root          10.70.37.56::slave    10.70.37.200    Active     Changelog Crawl    2017-02-19 14:59:08          
10.70.37.41     master        /bricks/brick1/master_brick7     root          10.70.37.56::slave    10.70.37.200    Active     Changelog Crawl    2017-02-19 14:59:18          
10.70.37.41     master        /bricks/brick2/master_brick11    root          10.70.37.56::slave    10.70.37.200    Active     Changelog Crawl    2017-02-19 14:59:14          
10.70.37.199    master        /bricks/brick0/master_brick2     root          10.70.37.56::slave    10.70.37.63     Passive    N/A                N/A                          
10.70.37.199    master        /bricks/brick1/master_brick6     root          10.70.37.56::slave    10.70.37.63     Passive    N/A                N/A                          
10.70.37.199    master        /bricks/brick2/master_brick10    root          10.70.37.56::slave    10.70.37.63     Passive    N/A                N/A                          
[root@rhel6-1 ~]# 


[root@rhel6-1 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/gsyncd.conf --status-get :master 10.70.37.56::slave --path /bricks/brick0/master_brick0
checkpoint_time: N/A
last_synced_utc: N/A
checkpoint_completion_time_utc: N/A
checkpoint_completed: N/A
meta: N/A
entry: N/A
slave_node: N/A
data: N/A
worker_status: Stopped
checkpoint_completion_time: N/A
checkpoint_completed_time: N/A
last_synced: N/A
checkpoint_time_utc: N/A
failures: N/A
crawl_status: N/A
[root@rhel6-1 ~]#

Comment 11 errata-xmlrpc 2017-03-23 05:45:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.