Description of problem: ======================= Observed a scenario where lasy sync became zero post upgrade/reboot during hystory crawl. Before upgrade started, the sync was "changelog crawl" with last sync time as: "2017-07-21 12:51:55". However after upgrade and starting the geo-rep, the last sync for few workers were shown as "0". The corresponding status file shows "0" [root@dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.42.79 master /rhs/brick1/b1 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.42.79 master /rhs/brick2/b5 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.42.79 master /rhs/brick3/b9 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.42.74 master /rhs/brick1/b3 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.42.74 master /rhs/brick2/b7 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.42.74 master /rhs/brick3/b11 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.41.217 master /rhs/brick1/b4 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.41.217 master /rhs/brick2/b8 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.41.217 master /rhs/brick3/b12 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.43.210 master /rhs/brick1/b2 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A 10.70.43.210 master /rhs/brick2/b6 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A 10.70.43.210 master /rhs/brick3/b10 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A [root@dhcp42-79 ~]# [root@dhcp42-79 ~]# date Sun Jul 23 11:04:25 IST 2017 [root@dhcp42-79 ~]# [root@dhcp42-74 ~]# cd /var/lib/glusterd/geo-replication/master_10.70.41.209_slave/ [root@dhcp42-74 master_10.70.41.209_slave]# ls brick_%2Frhs%2Fbrick1%2Fb3.status brick_%2Frhs%2Fbrick2%2Fb7.status brick_%2Frhs%2Fbrick3%2Fb11.status gsyncd.conf monitor.pid monitor.status [root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick1%2Fb3.status {"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# [root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick2%2Fb7.status {"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# [root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick3%2Fb11.status {"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# [root@dhcp42-74 master_10.70.41.209_slave]# cat monitor.status Started[root@dhcp42-74 master_10.70.41.209_slave]# The status remained same for more than 10 mins until one batch did not sync MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.42.79 master /rhs/brick1/b1 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.42.79 master /rhs/brick2/b5 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.42.79 master /rhs/brick3/b9 root 10.70.41.209::slave 10.70.41.209 Active History Crawl 2017-07-21 12:51:55 10.70.41.217 master /rhs/brick1/b4 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.41.217 master /rhs/brick2/b8 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.41.217 master /rhs/brick3/b12 root 10.70.41.209::slave 10.70.42.177 Passive N/A N/A 10.70.42.74 master /rhs/brick1/b3 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.42.74 master /rhs/brick2/b7 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.42.74 master /rhs/brick3/b11 root 10.70.41.209::slave 10.70.41.202 Active History Crawl N/A 10.70.43.210 master /rhs/brick1/b2 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A 10.70.43.210 master /rhs/brick2/b6 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A 10.70.43.210 master /rhs/brick3/b10 root 10.70.41.209::slave 10.70.41.194 Passive N/A N/A Sun Jul 23 11:14:50 IST 2017 Version-Release number of selected component (if applicable): ============================================================= mainline How reproducible: ================= I remember seeing this only once before upon stop/start. Have tried upgrade twice and seen this once. Steps to Reproduce: =================== No specific steps, the systems were upgraded and as part of upgrade geo-replication was stopped/started. Actual results: =============== Last sync is "0" Expected results: ================= Last sync should be what it was before geo-rep stopped. Looks like brick status file was overwritten with "0" as last synced.
REVIEW: https://review.gluster.org/18468 (geo-rep: Fix passive brick's last sync time) posted (#1) for review on master by Kotresh HR (khiremat)
COMMIT: https://review.gluster.org/18468 committed in master by Kotresh HR (khiremat) ------ commit f18a47ee7e6e06c9a9a8893aef7957f23a18de53 Author: Kotresh HR <khiremat> Date: Tue Oct 10 08:25:19 2017 -0400 geo-rep: Fix passive brick's last sync time Passive brick's stime was not updated to the status file immediately after updating the brick root. As a result the last sync time was showing '0' until it finishes first crawl if passive worker becomes active after restart. Fix is to update the status file immediately after upgrading the brick root. Change-Id: I248339497303bad20b7f5a1d42ab44a1fe6bca99 BUG: 1500346 Signed-off-by: Kotresh HR <khiremat>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/