1500346 – [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start

Bug 1500346 - [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start

Summary: [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1474012
Blocks:	1500853
TreeView+	depends on / blocked

Reported:	2017-10-10 12:32 UTC by Kotresh HR
Modified:	2017-12-08 17:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.13.0
Clone Of:	1474012
Clones:	1500853 (view as bug list)
Environment:
Last Closed:	2017-12-08 17:43:00 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Kotresh HR 2017-10-10 12:33:25 UTC

Description of problem:
=======================

Observed a scenario where lasy sync became zero post upgrade/reboot during hystory crawl. Before upgrade started, the sync was "changelog crawl" with last sync time as: "2017-07-21 12:51:55". However after upgrade and starting the geo-rep, the last sync for few workers were shown as "0". The corresponding status file shows "0"

[root@dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
[root@dhcp42-79 ~]# 
[root@dhcp42-79 ~]# date
Sun Jul 23 11:04:25 IST 2017
[root@dhcp42-79 ~]#


[root@dhcp42-74 ~]# cd /var/lib/glusterd/geo-replication/master_10.70.41.209_slave/
[root@dhcp42-74 master_10.70.41.209_slave]# ls
brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb7.status  brick_%2Frhs%2Fbrick3%2Fb11.status  gsyncd.conf  monitor.pid  monitor.status
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick1%2Fb3.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick2%2Fb7.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick3%2Fb11.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat monitor.status
Started[root@dhcp42-74 master_10.70.41.209_slave]# 


The status remained same for more than 10 mins until one batch did not sync



MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
Sun Jul 23 11:14:50 IST 2017


Version-Release number of selected component (if applicable):
=============================================================
mainline


How reproducible:
=================

I remember seeing this only once before upon stop/start. Have tried upgrade twice and seen this once. 

Steps to Reproduce:
===================

No specific steps, the systems were upgraded and as part of upgrade geo-replication was stopped/started.

Actual results:
===============

Last sync is "0"


Expected results:
=================

Last sync should be what it was before geo-rep stopped. Looks like brick status file was overwritten with "0" as last synced.

Comment 2 Worker Ant 2017-10-10 12:34:12 UTC

REVIEW: https://review.gluster.org/18468 (geo-rep: Fix passive brick's last sync time) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 3 Worker Ant 2017-10-11 15:16:39 UTC

COMMIT: https://review.gluster.org/18468 committed in master by Kotresh HR (khiremat) 
------
commit f18a47ee7e6e06c9a9a8893aef7957f23a18de53
Author: Kotresh HR <khiremat>
Date:   Tue Oct 10 08:25:19 2017 -0400

    geo-rep: Fix passive brick's last sync time
    
    Passive brick's stime was not updated to the
    status file immediately after updating the brick
    root. As a result the last sync time was showing
    '0' until it finishes first crawl if passive
    worker becomes active after restart. Fix is to
    update the status file immediately after upgrading
    the brick root.
    
    Change-Id: I248339497303bad20b7f5a1d42ab44a1fe6bca99
    BUG: 1500346
    Signed-off-by: Kotresh HR <khiremat>

Comment 4 Shyamsundar 2017-12-08 17:43:00 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.