Description of problem: ======================= After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to kernel update. Upon rebooting bricks from one of the node didn't list in geo-replication status command. However, the peer was in connected state and bricks were all online. Upon checking the geo-replication directory, node was missing monitor.pid. After doing touch, it resolves the issue. 1. Not sure what caused the monitor.pid to be removed. From user perspective, only operation was reboot of whole cluster at once. 2. Even if it is removed, we should handle the ENOENT use case. Initial: ======== [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:11:07 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.200 master /rhs/brick1/b3 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.200 master /rhs/brick2/b6 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:10:59 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Passive N/A N/A 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A [root@dhcp37-81 ~]# After REHL Platform is updated and rebooted: ============================================ [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:04 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:10:59 [root@dhcp37-81 ~]# Peer and bricks of node 200 are all online: ============================================= [root@dhcp37-81 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.37.100 Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a State: Peer in Cluster (Connected) Hostname: 10.70.37.200 Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc State: Peer in Cluster (Connected) [root@dhcp37-81 ~]# [root@dhcp37-81 ~]# gluster volume status master Status of volume: master Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.81:/rhs/brick1/b1 49152 0 Y 1639 Brick 10.70.37.100:/rhs/brick1/b2 49152 0 Y 1672 Brick 10.70.37.200:/rhs/brick1/b3 49152 0 Y 1683 Brick 10.70.37.81:/rhs/brick2/b4 49153 0 Y 1662 Brick 10.70.37.100:/rhs/brick2/b5 49153 0 Y 1673 Brick 10.70.37.200:/rhs/brick2/b6 49153 0 Y 1678 Snapshot Daemon on localhost 49155 0 Y 1776 NFS Server on localhost 2049 0 Y 1701 Self-heal Daemon on localhost N/A N/A Y 1709 Quota Daemon on localhost N/A N/A Y 1718 Snapshot Daemon on 10.70.37.100 49155 0 Y 1798 NFS Server on 10.70.37.100 2049 0 Y 1729 Self-heal Daemon on 10.70.37.100 N/A N/A Y 1737 Quota Daemon on 10.70.37.100 N/A N/A Y 1745 Snapshot Daemon on 10.70.37.200 49155 0 Y 1817 NFS Server on 10.70.37.200 2049 0 Y 1644 Self-heal Daemon on 10.70.37.200 N/A N/A Y 1649 Quota Daemon on 10.70.37.200 N/A N/A Y 1664 Task Status of Volume master ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp37-81 ~]# Problematic node 200: ===================== [root@dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf --status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/ [2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in main_i brick_status.print_status(checkpoint_time=checkpoint_time) File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in print_status for key, value in self.get_status(checkpoint_time).items(): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in get_status with open(self.monitor_pid_file, "r+") as f: IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid' failed with IOError. [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/ gsyncd_template.conf master_10.70.37.80_slave/ secret.pem secret.pem.pub tar_ssh.pem tar_ssh.pem.pub [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/ brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.status [root@dhcp37-200 ~]# touch /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid [root@dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/ brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.pid monitor.status [root@dhcp37-200 ~]# After touch, status shows information but stopped: ================================================== [root@dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.81 master /rhs/brick1/b1 root 10.70.37.80::slave 10.70.37.80 Passive N/A N/A 10.70.37.81 master /rhs/brick2/b4 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:11 10.70.37.100 master /rhs/brick1/b2 root 10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22 16:11:04 10.70.37.100 master /rhs/brick2/b5 root 10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22 16:10:59 10.70.37.200 master /rhs/brick1/b3 root 10.70.37.80::slave N/A Stopped N/A N/A 10.70.37.200 master /rhs/brick2/b6 root 10.70.37.80::slave N/A Stopped N/A N/A [root@dhcp37-81 ~]# Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64 glusterfs-3.7.9-10.el7rhgs.x86_64 How reproducible: ================= Reboot cases were tried in 3.1.3 and didn't see this issue. However, this case was different in a sense that all the nodes in cluster were brought offline at the same time while geo-replication was in started state. kind of negative testing.
Upstream patch sent to fix gsyncdstatus.py traceback. http://review.gluster.org/15416
Upstream mainline : http://review.gluster.org/15416 Upstream 3.8 : http://review.gluster.org/15448 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/85005
Verified with the build: glusterfs-geo-replication-3.8.4-14.el6rhs.x86_64 Since it is not reproducible, simulated the scenario by moving the monitor.pid. It is correctly showing the status as stopped instead of initial not showing at all. [root@rhel6-1 ~]# ls /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid [root@rhel6-1 ~]# [root@rhel6-1 ~]# mv /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/monitor.pid /var/lib/glusterd/geo-replication/ [root@rhel6-1 ~]# gluster volum geo-replication master 10.70.37.56::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.37.94 master /bricks/brick0/master_brick0 root 10.70.37.56::slave N/A Stopped N/A N/A 10.70.37.94 master /bricks/brick1/master_brick4 root 10.70.37.56::slave N/A Stopped N/A N/A 10.70.37.94 master /bricks/brick2/master_brick8 root 10.70.37.56::slave N/A Stopped N/A N/A 10.70.37.157 master /bricks/brick0/master_brick1 root 10.70.37.56::slave 10.70.37.205 Active Changelog Crawl 2017-02-19 14:59:11 10.70.37.157 master /bricks/brick1/master_brick5 root 10.70.37.56::slave 10.70.37.205 Active Changelog Crawl 2017-02-19 14:59:22 10.70.37.157 master /bricks/brick2/master_brick9 root 10.70.37.56::slave 10.70.37.205 Active Changelog Crawl 2017-02-19 14:59:22 10.70.37.41 master /bricks/brick0/master_brick3 root 10.70.37.56::slave 10.70.37.200 Active Changelog Crawl 2017-02-19 14:59:08 10.70.37.41 master /bricks/brick1/master_brick7 root 10.70.37.56::slave 10.70.37.200 Active Changelog Crawl 2017-02-19 14:59:18 10.70.37.41 master /bricks/brick2/master_brick11 root 10.70.37.56::slave 10.70.37.200 Active Changelog Crawl 2017-02-19 14:59:14 10.70.37.199 master /bricks/brick0/master_brick2 root 10.70.37.56::slave 10.70.37.63 Passive N/A N/A 10.70.37.199 master /bricks/brick1/master_brick6 root 10.70.37.56::slave 10.70.37.63 Passive N/A N/A 10.70.37.199 master /bricks/brick2/master_brick10 root 10.70.37.56::slave 10.70.37.63 Passive N/A N/A [root@rhel6-1 ~]# [root@rhel6-1 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py -c /var/lib/glusterd/geo-replication/master_10.70.37.56_slave/gsyncd.conf --status-get :master 10.70.37.56::slave --path /bricks/brick0/master_brick0 checkpoint_time: N/A last_synced_utc: N/A checkpoint_completion_time_utc: N/A checkpoint_completed: N/A meta: N/A entry: N/A slave_node: N/A data: N/A worker_status: Stopped checkpoint_completion_time: N/A checkpoint_completed_time: N/A last_synced: N/A checkpoint_time_utc: N/A failures: N/A crawl_status: N/A [root@rhel6-1 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html