+++ This bug was initially created as a clone of Bug #1373741 +++ +++ This bug was initially created as a clone of Bug #1369384 +++ Description of problem: ======================= After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to kernel update. Upon rebooting bricks from one of the node didn't list in geo-replication status command. However, the peer was in connected state and bricks were all online. Upon checking the geo-replication directory, node was missing monitor.pid. After doing touch, it resolves the issue. 1. Not sure what caused the monitor.pid to be removed. From user perspective, only operation was reboot of whole cluster at once. 2. Even if it is removed, we should handle the ENOENT use case. Initial: ======== [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:11:07 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.200 master /rhs/brick1/b3 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.200 master /rhs/brick2/b6 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:10:59 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Passive N/A N/A 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A [root@dhcp37-81 ~]# After REHL Platform is updated and rebooted: ============================================ [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:04 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:10:59 [root@dhcp37-81 ~]# Peer and bricks of node 200 are all online: ============================================= [root@dhcp37-81 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.37.100 Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a State: Peer in Cluster (Connected) Hostname: 10.70.37.200 Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc State: Peer in Cluster (Connected) [root@dhcp37-81 ~]# [root@dhcp37-81 ~]# gluster volume status master Status of volume: master Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.81:/rhs/brick1/b1 49152 0 Y 1639 Brick 10.70.37.100:/rhs/brick1/b2 49152 0 Y 1672 Brick 10.70.37.200:/rhs/brick1/b3 49152 0 Y 1683 Brick 10.70.37.81:/rhs/brick2/b4 49153 0 Y 1662 Brick 10.70.37.100:/rhs/brick2/b5 49153 0 Y 1673 Brick 10.70.37.200:/rhs/brick2/b6 49153 0 Y 1678 Snapshot Daemon on localhost 49155 0 Y 1776 NFS Server on localhost 2049 0 Y 1701 Self-heal Daemon on localhost N/A N/A Y 1709 Quota Daemon on localhost N/A N/A Y 1718 Snapshot Daemon on 10.70.37.100 49155 0 Y 1798 NFS Server on 10.70.37.100 2049 0 Y 1729 Self-heal Daemon on 10.70.37.100 N/A N/A Y 1737 Quota Daemon on 10.70.37.100 N/A N/A Y 1745 Snapshot Daemon on 10.70.37.200 49155 0 Y 1817 NFS Server on 10.70.37.200 2049 0 Y 1644 Self-heal Daemon on 10.70.37.200 N/A N/A Y 1649 Quota Daemon on 10.70.37.200 N/A N/A Y 1664 Task Status of Volume master ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp37-81 ~]# Problematic node 200: ===================== [root@dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf --status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/ [2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i brick_status.print_status(checkpoint_time=checkpoint_time) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status for key, value in self.get_status(checkpoint_time).items(): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status with open(self.monitor_pid_file, "r+") as f: IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid' failed with IOError. [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/ gsyncd_template.conf master_10.70.37.80_slave/ secret.pem secret.pem.pub tar_ssh.pem tar_ssh.pem.pub [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/ brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.status [root@dhcp37-200 ~]# touch /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/ brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.pid monitor.status [root@dhcp37-200 ~]# After touch, status shows information but stopped: ================================================== [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:04 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:10:59 10.70.37.200 master /rhs/brick1/b3 root 10.70.37.80::slave N/A Stopped N/A N/A 10.70.37.200 master /rhs/brick2/b6 root 10.70.37.80::slave N/A Stopped N/A N/A [root@dhcp37-81 ~]# Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64 glusterfs-3.7.9-10.el7rhgs.x86_64 How reproducible: ================= This case was different in a sense that all the nodes in cluster were brought offline at the same time while geo-replication was in started state. kind of negative testing. --- Additional comment from Worker Ant on 2016-09-07 02:15:58 EDT --- REVIEW: http://review.gluster.org/15416 (geo-rep: Fix Geo-rep status if monitor.pid file not exists) posted (#1) for review on master by Aravinda VK (avishwan) --- Additional comment from Worker Ant on 2016-09-08 12:15:19 EDT --- COMMIT: http://review.gluster.org/15416 committed in master by Aravinda VK (avishwan) ------ commit c7118a92f52a2fa33ab69f3e3ef1bdabfee847cf Author: Aravinda VK <avishwan> Date: Wed Sep 7 11:39:39 2016 +0530 geo-rep: Fix Geo-rep status if monitor.pid file not exists If monitor.pid file not exists, gsyncd fails with following traceback Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i brick_status.print_status(checkpoint_time=checkpoint_time) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status for key, value in self.get_status(checkpoint_time).items(): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status with open(self.monitor_pid_file, "r+") as f: IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/ geo-replication/master_node_slave/monitor.pid' If Georep status command this worker's status will not be displayed since not returning expected status output. BUG: 1373741 Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9 Signed-off-by: Aravinda VK <avishwan> Reviewed-on: http://review.gluster.org/15416 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Kotresh HR <khiremat>
REVIEW: http://review.gluster.org/15448 (geo-rep: Fix Geo-rep status if monitor.pid file not exists) posted (#1) for review on release-3.8 by Aravinda VK (avishwan)
All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html
COMMIT: http://review.gluster.org/15448 committed in release-3.8 by Aravinda VK (avishwan) ------ commit 0d3e879b6cb65b2e42d9fc2e1a1cfe4cc38d5296 Author: Aravinda VK <avishwan> Date: Wed Sep 7 11:39:39 2016 +0530 geo-rep: Fix Geo-rep status if monitor.pid file not exists If monitor.pid file not exists, gsyncd fails with following traceback Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i brick_status.print_status(checkpoint_time=checkpoint_time) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status for key, value in self.get_status(checkpoint_time).items(): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status with open(self.monitor_pid_file, "r+") as f: IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/ geo-replication/master_node_slave/monitor.pid' If Georep status command this worker's status will not be displayed since not returning expected status output. > Reviewed-on: http://review.gluster.org/15416 > Smoke: Gluster Build System <jenkins.org> > NetBSD-regression: NetBSD Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Kotresh HR <khiremat> BUG: 1374632 Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9 Signed-off-by: Aravinda VK <avishwan> Reviewed-on: http://review.gluster.org/15448 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Saravanakumar Arumugam <sarumuga> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.5, please open a new bug report. glusterfs-3.8.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/announce/2016-October/000061.html [2] https://www.gluster.org/pipermail/gluster-users/