1374632 – [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Bug 1374632 - [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Summary: [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	3.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Aravinda VK
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1369384 1373741
Blocks:	1374630
TreeView+	depends on / blocked

Reported:	2016-09-09 09:13 UTC by Aravinda VK
Modified:	2016-10-20 14:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.5
Clone Of:	1373741
Environment:
Last Closed:	2016-10-20 14:02:57 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Aravinda VK 2016-09-09 09:13:06 UTC

+++ This bug was initially created as a clone of Bug #1373741 +++

+++ This bug was initially created as a clone of Bug #1369384 +++

Description of problem:
=======================

After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to kernel update. Upon rebooting bricks from one of the node didn't list in geo-replication status command. However, the peer was in connected state and bricks were all online. Upon checking the geo-replication directory, node was missing monitor.pid. After doing touch, it resolves the issue. 

1. Not sure what caused the monitor.pid to be removed. From user perspective, only operation was reboot of whole cluster at once. 
2. Even if it is removed, we should handle the ENOENT use case. 

Initial:
========

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:11:07          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.200    master        /rhs/brick1/b3    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.200    master        /rhs/brick2/b6    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:10:59          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Passive    N/A                N/A                          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
[root@dhcp37-81 ~]#

After REHL Platform is updated and rebooted:
============================================

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:10:59          
[root@dhcp37-81 ~]# 

Peer and bricks of node 200 are all online:
=============================================

[root@dhcp37-81 ~]# gluster peer status
Number of Peers: 2

Hostname: 10.70.37.100
Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.200
Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc
State: Peer in Cluster (Connected)
[root@dhcp37-81 ~]# 

[root@dhcp37-81 ~]# gluster volume status master
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.81:/rhs/brick1/b1            49152     0          Y       1639 
Brick 10.70.37.100:/rhs/brick1/b2           49152     0          Y       1672 
Brick 10.70.37.200:/rhs/brick1/b3           49152     0          Y       1683 
Brick 10.70.37.81:/rhs/brick2/b4            49153     0          Y       1662 
Brick 10.70.37.100:/rhs/brick2/b5           49153     0          Y       1673 
Brick 10.70.37.200:/rhs/brick2/b6           49153     0          Y       1678 
Snapshot Daemon on localhost                49155     0          Y       1776 
NFS Server on localhost                     2049      0          Y       1701 
Self-heal Daemon on localhost               N/A       N/A        Y       1709 
Quota Daemon on localhost                   N/A       N/A        Y       1718 
Snapshot Daemon on 10.70.37.100             49155     0          Y       1798 
NFS Server on 10.70.37.100                  2049      0          Y       1729 
Self-heal Daemon on 10.70.37.100            N/A       N/A        Y       1737 
Quota Daemon on 10.70.37.100                N/A       N/A        Y       1745 
Snapshot Daemon on 10.70.37.200             49155     0          Y       1817 
NFS Server on 10.70.37.200                  2049      0          Y       1644 
Self-heal Daemon on 10.70.37.200            N/A       N/A        Y       1649 
Quota Daemon on 10.70.37.200                N/A       N/A        Y       1664 
 
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp37-81 ~]# 

Problematic node 200:
=====================

[root@dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf --status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/
[2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i
    brick_status.print_status(checkpoint_time=checkpoint_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status
    for key, value in self.get_status(checkpoint_time).items():
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status
    with open(self.monitor_pid_file, "r+") as f:
IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid'
failed with IOError.
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/
gsyncd_template.conf      master_10.70.37.80_slave/ secret.pem                secret.pem.pub            tar_ssh.pem               tar_ssh.pem.pub           
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.status
[root@dhcp37-200 ~]# touch /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid
[root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.pid  monitor.status
[root@dhcp37-200 ~]# 


After touch, status shows information but stopped:
==================================================

[root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root          10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A                          
10.70.37.81     master        /rhs/brick2/b4    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root          10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22 16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root          10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22 16:10:59          
10.70.37.200    master        /rhs/brick1/b3    root          10.70.37.80::slave    N/A             Stopped    N/A                N/A                          
10.70.37.200    master        /rhs/brick2/b6    root          10.70.37.80::slave    N/A             Stopped    N/A                N/A                          
[root@dhcp37-81 ~]#


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64
glusterfs-3.7.9-10.el7rhgs.x86_64


How reproducible:
=================
This case was different in a sense that all the nodes in cluster were brought offline at the same time while geo-replication was in started state. kind of negative testing.

--- Additional comment from Worker Ant on 2016-09-07 02:15:58 EDT ---

REVIEW: http://review.gluster.org/15416 (geo-rep: Fix Geo-rep status if monitor.pid file not exists) posted (#1) for review on master by Aravinda VK (avishwan)

--- Additional comment from Worker Ant on 2016-09-08 12:15:19 EDT ---

COMMIT: http://review.gluster.org/15416 committed in master by Aravinda VK (avishwan) 
------
commit c7118a92f52a2fa33ab69f3e3ef1bdabfee847cf
Author: Aravinda VK <avishwan>
Date:   Wed Sep 7 11:39:39 2016 +0530

    geo-rep: Fix Geo-rep status if monitor.pid file not exists
    
    If monitor.pid file not exists, gsyncd fails with following traceback
    
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 201, in main
        main_i()
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 681, in main_i
        brick_status.print_status(checkpoint_time=checkpoint_time)
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 343, in print_status
        for key, value in self.get_status(checkpoint_time).items():
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 262, in get_status
        with open(self.monitor_pid_file, "r+") as f:
    IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/
      geo-replication/master_node_slave/monitor.pid'
    
    If Georep status command this worker's status will not be displayed
    since not returning expected status output.
    
    BUG: 1373741
    Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/15416
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>

Comment 1 Worker Ant 2016-09-09 09:24:35 UTC

REVIEW: http://review.gluster.org/15448 (geo-rep: Fix Geo-rep status if monitor.pid file not exists) posted (#1) for review on release-3.8 by Aravinda VK (avishwan)

Comment 2 Niels de Vos 2016-09-12 05:39:39 UTC

All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html

Comment 3 Worker Ant 2016-09-15 10:10:45 UTC

COMMIT: http://review.gluster.org/15448 committed in release-3.8 by Aravinda VK (avishwan) 
------
commit 0d3e879b6cb65b2e42d9fc2e1a1cfe4cc38d5296
Author: Aravinda VK <avishwan>
Date:   Wed Sep 7 11:39:39 2016 +0530

    geo-rep: Fix Geo-rep status if monitor.pid file not exists
    
    If monitor.pid file not exists, gsyncd fails with following traceback
    
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 201, in main
        main_i()
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 681, in main_i
        brick_status.print_status(checkpoint_time=checkpoint_time)
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 343, in print_status
        for key, value in self.get_status(checkpoint_time).items():
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 262, in get_status
        with open(self.monitor_pid_file, "r+") as f:
    IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/
      geo-replication/master_node_slave/monitor.pid'
    
    If Georep status command this worker's status will not be displayed
    since not returning expected status output.
    
    > Reviewed-on: http://review.gluster.org/15416
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Kotresh HR <khiremat>
    
    BUG: 1374632
    Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/15448
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Saravanakumar Arumugam <sarumuga>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Niels de Vos 2016-10-20 14:02:57 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.5, please open a new bug report.

glusterfs-3.8.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/announce/2016-October/000061.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.